6 groups of models inherit from BartForConditionalGeneration
.
The major differences between them are:
- pretraining objective & data
- finetuning objective & data
- number of layers and dimension of each layer
- when layernorm is applied
This document focuses on layernorm timing.
Here are the three relevant BartConfig
parameters that control which layernorms are instantiated and when they are applied:
add_final_layer_norm | normalize_embedding | normalize_before | |
---|---|---|---|
facebook/bart-large-cnn | False | True | False |
facebook/mbart-large-en-ro | True | True | True |
facebook/blenderbot-3B | True | False | True |
facebook/blenderbot-90M | False | True | False |
google/pegasus-large | True | False | True |
Helsinki-NLP/opus-mt-en-de | False | False | False |
For background, LayerNorm
tries to normalize inputs using two parameters (called weight and bias) in the following picture:
It was introduced in this paper as an alternative to BatchNorm
we transpose batch normalization into layer normalization by computing the mean and variance used for normalization from all of the summed inputs to the neurons in a layer on a single training case. Like batch normalization, we also give each neuron its own adaptive bias and gain which are applied after the normalization but before the non-linearity. Unlike batch normalization, layer normalization performs exactly the same computation at training and test times.
When normalize_before==True
, as is the case for blenderbot-3B
, mBART
and pegasus
, we apply layernorms before the associated operation. For blenderbot-90
, BART
and marian
, we apply layernorm after the operation.
Here is some pseudocode for the Encoder to illustrate the difference
Encoder
layernorm_embedding
for each EncoderLayer:
if normalize_before: self_attn_layer_norm
x = self_attention(x)
if not normalize_before: self_attn_layer_norm
if normalize_before: final_layer_norm # final of the layer...
x = x + fully_connected_network(x)
if not normalize_before: final_layernorm
done
if config.add_final_layernorm: layernorm # see below
The decoder is very similar, with two wrinkles.
First, each DecoderLayer also has cross attention and an associated layernorm, called encoder_attn_layer_norm
, for a total of 3 layernorms. The timing of these layernorms, just like in EncoderLayer
is controlled by config.normalize_before
.
The other difference only pertains to config.do_blenderbot_90_layernorm
which is only True
for blenderbot-90M.
At the very beginning of the decoder after we compute token embeddings, blenderbot-90 calls its layernorm_embedding
.
The other models wait until the positional embeddings are added to the token embeddings. Here is the real code:
x = self.embed_tokens(input_ids) * self.embed_scale
if self.do_blenderbot_90_layernorm:
x = self.layernorm_embedding(x) + positions
else:
x = self.layernorm_embedding(x + positions)
self.layernorm_embedding
is an Identity operation, not a layernorm for 3/6 models: where config.normalize_embedding=False
in the table above.
This is implemented by the following conditional
self.layernorm_embedding = LayerNorm(embed_dim) if config.normalize_embedding else nn.Identity()
mBart
, pegasus
and blenderbot-3B
both have an extra layernorm that they apply as the last operation of the encoder and decoder. The other 3 checkpoints don't.
In modeling_bart.py, this difference is implemented by the following conditional:
self.layer_norm = LayerNorm(config.d_model) if config.add_final_layer_norm else nn.Identity()