# simulated batch of images x = torch.rand(64, 3, 224, 224) # or some number of layers up the convolutional stack x = torch.rand(64, 256, 32, 32)
m = x.mean((0,2,3), keepdim=True) # keepdim=True facilitates broadcasting v = x.var ((0,2,3), keepdim=True)
Calculate layer statistics across all examples separately for each channel.
- milestone technique enabling various networks to train or train better / faster
- as batch size tends towards 1, training becomes unstable to become impossible (what is the variance of a single example?)
- not obvious how to use in RNNs
m = x.mean((1,2,3), keepdim=True) v = x.var ((1,2,3), keepdim=True)
Calculate the statistics for each example separately, across all channels
"batch normalization cannot be applied to online learning tasks or to extremely large distributed models where the minibatches have to be small"
- can be applied in RNNs
- can throw out potentially useful information
m = x.mean((2,3), keepdim=True) v = x.var ((2,3), keepdim=True)
No concept of running stats anymore. Calculate stats for each example for each channel separately.
- throws out even more information than LayerNorm
- applies to fast stylization
In this paper, we present Group Normalization (GN) as a simple alternative to BN. GN divides the channels into groups and computes within each group the mean and variance for normalization. GN’s computation is independent of batch sizes, and its accuracy is stable in a wide range of batch sizes. On ResNet-50 trained in ImageNet, GN has 10.6% lower error than its BN counterpart when using a batch size of 2; when using typical batch sizes, GN is comparably good with BN and outperforms other normalization variants.
- supports small batches
- can be used in transfer learning where the target task requires small batches (e.g. image segmentation)
Summary of all norms (from the GroupNorm paper):