morganmcg1/Transformer Tricks to Try

## Transformer Tricks to Try
ARCHITECCTURE
- ADMIN Initialisation
- {{[[TODO]]}} Deeper encoder, shallower decoder
- {{[[TODO]]}} Mish
- DONE? {{[[TODO]]}} Test Impact of embedding tying (would need shared vocab)
- {{[[TODO]]}} Use [[PreLayerNorm]]
- Try #ELU and #[[Shifted RELU]]
- Try [[EDITOR]] transformer: https://jlibovicky.github.io/2020/12/12/MT-Weekly-Editor.html
- Gradient Adaptive Clipping
- Snake Activation: https://twitter.com/EdwardDixon3/status/1360211045491617792?s=20

Attention Variants
- {{[[TODO]]}} Funnel Transformer
- {{[[TODO]]}} PAR Transformer
- {{[[TODO]]}} Use Performer: https://arxiv.org/abs/2009.14794
- Feedback Transformer with Performer Attention?

OPTIMIZER
- {{[[TODO]]}} AdaHessian Optimizer
- {{[[TODO]]}} Latest Ranger (with Gradient Centralisation)
- Epsilon tuning - http://zna.do/epsilon
- SAM : #[[Sharpness-Aware Minimization for Efficiently Improving Generalization]]

TRAINING
- {{[[TODO]]}} Exclude LayerNorm, Embeddings from weight decay!
- {{[[TODO]]}} use #LayerDrop #[[Transformers without Tears: Improving the Normalization of Self-Attention]] #[[Depth-Adaptive Transformer]] like in #M2M-100
- try #GradAug (GradAug: A New Regularization Method for Deep Neural Networks)
- Try #AutoFreeze for freezing layers for fine-tuning
- Training objectives: BART-style, ProphetNet style, Span-BERT style

GENERATION
- Try #[[Diverse Beam Search]] for generation

PRODUCTIONISATION
- Try [[Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT]]
- Try add LayerNorm and/or [[QuantNoise]] in the Embeddings like in `forward_embedding` in FairSeq
    - https://fairseq.readthedocs.io/en/latest/_modules/fairseq/models/transformer.html#TransformerModel

- Use RNN for Decoder:
    - https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html
    - Tweet thread: https://twitter.com/srush_nlp/status/1339608126845292547
	ARCHITECCTURE
	- ADMIN Initialisation
	- {{[[TODO]]}} Deeper encoder, shallower decoder
	- {{[[TODO]]}} Mish
	- DONE? {{[[TODO]]}} Test Impact of embedding tying (would need shared vocab)
	- {{[[TODO]]}} Use [[PreLayerNorm]]
	- Try #ELU and #[[Shifted RELU]]
	- Try [[EDITOR]] transformer: https://jlibovicky.github.io/2020/12/12/MT-Weekly-Editor.html
	- Gradient Adaptive Clipping
	- Snake Activation: https://twitter.com/EdwardDixon3/status/1360211045491617792?s=20

	Attention Variants
	- {{[[TODO]]}} Funnel Transformer
	- {{[[TODO]]}} PAR Transformer
	- {{[[TODO]]}} Use Performer: https://arxiv.org/abs/2009.14794
	- Feedback Transformer with Performer Attention?

	OPTIMIZER
	- {{[[TODO]]}} AdaHessian Optimizer
	- {{[[TODO]]}} Latest Ranger (with Gradient Centralisation)
	- Epsilon tuning - http://zna.do/epsilon
	- SAM : #[[Sharpness-Aware Minimization for Efficiently Improving Generalization]]

	TRAINING
	- {{[[TODO]]}} Exclude LayerNorm, Embeddings from weight decay!
	- {{[[TODO]]}} use #LayerDrop #[[Transformers without Tears: Improving the Normalization of Self-Attention]] #[[Depth-Adaptive Transformer]] like in #M2M-100
	- try #GradAug (GradAug: A New Regularization Method for Deep Neural Networks)
	- Try #AutoFreeze for freezing layers for fine-tuning
	- Training objectives: BART-style, ProphetNet style, Span-BERT style

	GENERATION
	- Try #[[Diverse Beam Search]] for generation

	PRODUCTIONISATION
	- Try [[Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT]]
	- Try add LayerNorm and/or [[QuantNoise]] in the Embeddings like in `forward_embedding` in FairSeq
	- https://fairseq.readthedocs.io/en/latest/_modules/fairseq/models/transformer.html#TransformerModel

	- Use RNN for Decoder:
	- https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html
	- Tweet thread: https://twitter.com/srush_nlp/status/1339608126845292547