Skip to content

Instantly share code, notes, and snippets.

@morganmcg1
Last active February 12, 2021 13:02
Show Gist options
  • Save morganmcg1/b0c79440c33f5d964ce688f97cffb13e to your computer and use it in GitHub Desktop.
Save morganmcg1/b0c79440c33f5d964ce688f97cffb13e to your computer and use it in GitHub Desktop.
A bunch of random tricks to try and create the ultimate transformer!
ARCHITECCTURE
- ADMIN Initialisation
- {{[[TODO]]}} Deeper encoder, shallower decoder
- {{[[TODO]]}} Mish
- DONE? {{[[TODO]]}} Test Impact of embedding tying (would need shared vocab)
- {{[[TODO]]}} Use [[PreLayerNorm]]
- Try #ELU and #[[Shifted RELU]]
- Try [[EDITOR]] transformer: https://jlibovicky.github.io/2020/12/12/MT-Weekly-Editor.html
- Gradient Adaptive Clipping
- Snake Activation: https://twitter.com/EdwardDixon3/status/1360211045491617792?s=20
Attention Variants
- {{[[TODO]]}} Funnel Transformer
- {{[[TODO]]}} PAR Transformer
- {{[[TODO]]}} Use Performer: https://arxiv.org/abs/2009.14794
- Feedback Transformer with Performer Attention?
OPTIMIZER
- {{[[TODO]]}} AdaHessian Optimizer
- {{[[TODO]]}} Latest Ranger (with Gradient Centralisation)
- Epsilon tuning - http://zna.do/epsilon
- SAM : #[[Sharpness-Aware Minimization for Efficiently Improving Generalization]]
TRAINING
- {{[[TODO]]}} Exclude LayerNorm, Embeddings from weight decay!
- {{[[TODO]]}} use #LayerDrop #[[Transformers without Tears: Improving the Normalization of Self-Attention]] #[[Depth-Adaptive Transformer]] like in #M2M-100
- try #GradAug (GradAug: A New Regularization Method for Deep Neural Networks)
- Try #AutoFreeze for freezing layers for fine-tuning
- Training objectives: BART-style, ProphetNet style, Span-BERT style
GENERATION
- Try #[[Diverse Beam Search]] for generation
PRODUCTIONISATION
- Try [[Q-BERT: Hessian Based Ultra Low Precision Quantization of BERT]]
- Try add LayerNorm and/or [[QuantNoise]] in the Embeddings like in `forward_embedding` in FairSeq
- https://fairseq.readthedocs.io/en/latest/_modules/fairseq/models/transformer.html#TransformerModel
- Use RNN for Decoder:
- https://ai.googleblog.com/2020/06/recent-advances-in-google-translate.html
- Tweet thread: https://twitter.com/srush_nlp/status/1339608126845292547
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment