Skip to content

Instantly share code, notes, and snippets.

@radekosmulski
Last active June 6, 2022 10:24
Show Gist options
  • Save radekosmulski/b708e2367fe78ee21ffba382633e52d3 to your computer and use it in GitHub Desktop.
Save radekosmulski/b708e2367fe78ee21ffba382633e52d3 to your computer and use it in GitHub Desktop.
# simulated batch of images
x = torch.rand(64, 3, 224, 224)
# or some number of layers up the convolutional stack
x = torch.rand(64, 256, 32, 32)

BatchNorm:

m = x.mean((0,2,3), keepdim=True) # keepdim=True facilitates broadcasting
v = x.var ((0,2,3), keepdim=True)

Calculate layer statistics across all examples separately for each channel.

  • milestone technique enabling various networks to train or train better / faster
  • as batch size tends towards 1, training becomes unstable to become impossible (what is the variance of a single example?)
  • not obvious how to use in RNNs

LayerNorm

m = x.mean((1,2,3), keepdim=True)
v = x.var ((1,2,3), keepdim=True)

Calculate the statistics for each example separately, across all channels

"batch normalization cannot be applied to online learning tasks or to extremely large distributed models where the minibatches have to be small"

  • can be applied in RNNs
  • can throw out potentially useful information

InstanceNorm

m = x.mean((2,3), keepdim=True)
v = x.var ((2,3), keepdim=True)

No concept of running stats anymore. Calculate stats for each example for each channel separately.

GroupNorm

In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants.

  • supports small batches
  • can be used in transfer learning where the target task requires small batches (e.g. image segmentation)

Summary of all norms (from the GroupNorm paper):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment