Released under open-source MIT License.
By Pau-Emanuel SOTIR paul-emanuel@outlook.com
- Fully connected networks = Dense neural networks (FC DNN) (Warning, could be confused with FCN: Fully Convolutional Networks or with some residual networks (e.g. "Densely Connected Networks" (DenseNet), which are only dense in terms of residual links between layer blocks))
- CNN: Convolutionnal Neural Networks (Mainly brougth by Yahn LeCun)
- a-trou-convolutions = upconv = upsampling convolutions = dilated convolutions (popularized by DeepLab)
- U-net = encoder-decoder = could be interpreted as specialization of residual links in fully convlutionnal network follwed by a-trou convolutions)
- DCN: Deformable Convolution Networks: Infered offsets aaplied to next convolution layer's kernels position. (infered from a specific layer after previous convolutions's feature maps)
- Recurrent architectures (RNN)
- Gated RNNs
- LSTM
- RCNN
- Gated RNNs
- AutoEncoders (often used as generative models)
- Variational AutoEncoders (VAE)
- Adversarial architectures
- Generative adversarial netorks (often used as generative models)
- Ensembling, stacking and siamese networks
- Attention mechanisms
- Transformer architecture pattern with attention mechanism: From "Attention is all you need" paper: A very cited paper of Google Research in Dec. 2017; It influenced research community toward more attention mechanisms in DL architectures (especially in NLP). NOTE for NLP tasks: OpenAI's GPT-1 (2018) and GPT2 (2019) models based on a large transformer architecture significantly improved SOTA for various NLP tasks: see GPT-1 paper, and GPT-2 paper, OpenAI released GPT-2 implementation on GitHub and more recently, they released dataset of GPT-2's behavior/outputs along with larger trained models. They progressively released larger and larger GPT-2 model weights and code, see OpenAI release policy blog post. Later in 2019, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding from Google AI Language outperformed GPT-2 for various NLP tasks. Even more recently in 2019, Google Brain and Carnegie Mellon University developed XLNet transformer model, outperforming BERT and GPT-2: XLNet: Generalized Autoregressive Pretraining for Language Understanding; What a nice year for NLP SOTA improvement, what a bad year for model lightweightness and low training costs! See also following link for various ready-to-use transformer models on hugginface's GitHub repo.
- ~"Attention-based external memory (DL analog to RAM)" (TODO: fix title of this technique)
- Batch-normalization
- Dropout = could be seen as ensembling of sampled sub-neural network architectures
- binary or real-valued dropout
- regular dropout or layer-wise / filter-wise / residual block dropout
- Residual links: Adds "shortcut" links between layers (often applied to Convolutional Neural Networks) Concatenated / Additive Not-Gated / Gated = weigthed Densely connected
- Padding
- Pooling
- avg / max pooling
- global average pooling
- Pyramid pooling module (see Pyramid Scene Parsing Network)
- Auxiliary losses = could be interpreted as 'special' residual links from network layers to output
- Model size reduction
- distillation = teacher-student methods
- compression
- quantization (binary / 16-Bit / 8bit / ect...)
- inference-time
- at training time (more difficult but could permits faster training time)
- learning rate
- cos / exp decay
- cycles
- one cycle policy (Used in various SOTA papers of 2019 in combination with AdamW)
- warm restart
- multiple learning rates at once (for parts of NN, residual-block-scale, layers-wise, ect...)
- scheduling: multiple constant learning rates ()
- See also adaptive training algorithms (but doesn't replace learning rate techniques above, e.g. AdaGrad)
- loss-related techniques
- L1 / L2 regularization = Lasso / Ridge regularization = weight decay = weight penalty
- ...
- Optimizers: Stochastic Gradient Descent (SGD) algorithms
- Momentum
- constant
- momentum scheduling or decaying
- adaptive: see techniques below (Adam ; )
- RMSprop
- Adam = RMSprop + Momentum
- AdamW (Used in various SOTA papers of 2019 in combination with one cycle learning rate policy)
- AdaGrad: adapts lr for each dimensions (w)
- Natural gradient descent
- Second degree and other optimizers (TODO: refactor this part for better understanding of these classes of algorithms)
- Newton optimmization (too expensive for regular neural nets)
- Hessian approximation techniques (TODO: refactor this part for better understanding of these classes of algorithms)
- Conjugate gradient
- Hessian-free optimization
- Conjugate gradient = conjugate of jacobian allows approximation of the hessian if ??? (TODO: ix this / remind the thoughts about this)
- Momentum
- Pretraining and weight initialization
- Finetunning of pre-trained models
- Greedy lawer-wise pretraining
- Data augmentation techniques
- Active learning and boosting methods
- Convolution fiter visualization
- Deep dream and its consequences..
- Deep-art _^o^_/
- Deep dream and its consequences..
- Uncertainty estimation
- Bla bla bla ... E.g. Variance from mutiple output inferences sampled from neural models interpreted as stacked bayesian networks
- ...
- Images
- Generation
- Treatment / segmentation / denoising /
- Classification / Detection /
- Tabular data
- NLP
- ... you may also be intererested in section 4 "Generalization vs Memorization" of OpenAI's paper: Language Models are Unsupervised Multitask Learners which investingates overlaps between WebText and common NLP dataset trainsets using 8-gram bloom filters. (This paper also gives some insights in how language models learns: "We demonstrate that language models begin to learn these tasks without any explicit supervision when rained on (...) WebText"
- NLP
- Timeseries / Sound / few-dimentions frenquency domain
- Other structured data : Graphs / trees / ect
- Inference on heterogeneous data
- Data embedding techniques
- Character-level embedding
- Word embedding
- Sentence embedding: See this repo for fine-tunned SOTA pre-trained models based on BERT architecture: UKPLab sentence-transformers github, see also "A curated list of pretrained sentence and word embedding models": awesome-sentence-embedding github
- Missing data and errors
- high dimensional sparse data (e.g. recemandation systems, consumer churn rates, telemetry, sparse boolean matrices)
- Unsupervised, weakly supervised (see also finetunning of pre-trained models in "training techniques" section)
- self supervised learning
- zero-shot / one-shot / few-shot learning
- metalearning
- fine-tuning of pretrained models
- Deep reinforcement learning
- MCTS with CNNs as value and policy functions
- Online training