PaulEmmanuelSotir/deeplearning_techniques_mindmap.md

## deeplearning_techniques_mindmap.md

      
    Raw
  

              deeplearning_techniques_mindmap.md
            
          
    A non-exhaustive Deep Learning techniques mind-map (december 2019)

Released under open-source MIT License.

By Pau-Emanuel SOTIR paul-emanuel@outlook.com
Neural network architectures


Fully connected networks = Dense neural networks (FC DNN) (Warning, could be confused with FCN: Fully Convolutional Networks or with some residual networks (e.g. "Densely Connected Networks" (DenseNet), which are only dense in terms of residual links between layer blocks))
CNN: Convolutionnal Neural Networks (Mainly brougth by Yahn LeCun)

a-trou-convolutions = upconv = upsampling convolutions = dilated convolutions (popularized by DeepLab)
U-net = encoder-decoder = could be interpreted as specialization of residual links in fully convlutionnal network follwed by a-trou convolutions)
DCN: Deformable Convolution Networks: Infered offsets aaplied to next convolution layer's kernels position. (infered from a specific layer after previous convolutions's feature maps)


Recurrent architectures (RNN)

Gated RNNs

LSTM


RCNN


AutoEncoders (often used as generative models)

Variational AutoEncoders (VAE)


Adversarial architectures

Generative adversarial netorks (often used as generative models)


Ensembling, stacking and siamese networks
Attention mechanisms

Transformer architecture pattern with attention mechanism: From "Attention is all you need" paper: A very cited paper of Google Research in Dec. 2017; It influenced research community toward more attention mechanisms in DL architectures (especially in NLP). NOTE for NLP tasks: OpenAI's GPT-1 (2018) and GPT2 (2019) models based on a large transformer architecture significantly improved SOTA for various NLP tasks: see GPT-1 paper, and GPT-2 paper, OpenAI released GPT-2 implementation on GitHub and more recently, they released dataset of GPT-2's behavior/outputs along with larger trained models. They progressively released larger and larger GPT-2 model weights and code, see OpenAI release policy blog post. Later in 2019, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding from Google AI Language outperformed GPT-2 for various NLP tasks. Even more recently in 2019, Google Brain and Carnegie Mellon University developed XLNet transformer model, outperforming BERT and GPT-2: XLNet: Generalized Autoregressive Pretraining for Language Understanding; What a nice year for NLP SOTA improvement, what a bad year for model lightweightness and low training costs! See also following link for various ready-to-use transformer models on hugginface's GitHub repo.
~"Attention-based external memory (DL analog to RAM)" (TODO: fix title of this technique)


in-architecture regularization and other architecture-related techniques


Batch-normalization
Dropout = could be seen as ensembling of sampled sub-neural network architectures

binary or real-valued dropout
regular dropout or layer-wise / filter-wise / residual block dropout


Residual links: Adds "shortcut" links between layers (often applied to Convolutional Neural Networks)
Concatenated / Additive
Not-Gated / Gated = weigthed
Densely connected
Padding
Pooling

avg / max pooling
global average pooling
Pyramid pooling module (see Pyramid Scene Parsing Network)


Auxiliary losses = could be interpreted as 'special' residual links from network layers to output
Model size reduction

distillation = teacher-student methods
compression
quantization (binary / 16-Bit / 8bit / ect...)

inference-time
at training time (more difficult but could permits faster training time)


Training techniques


learning rate

cos / exp decay
cycles

one cycle policy (Used in various SOTA papers of 2019 in combination with AdamW)
warm restart


multiple learning rates at once (for parts of NN, residual-block-scale, layers-wise, ect...)
scheduling: multiple constant learning rates ()
See also adaptive training algorithms (but doesn't replace learning rate techniques above, e.g. AdaGrad)


loss-related techniques

L1 / L2 regularization = Lasso / Ridge regularization = weight decay = weight penalty
...


Optimizers: Stochastic Gradient Descent (SGD) algorithms

Momentum

constant
momentum scheduling or decaying
adaptive: see techniques below (Adam ; )


RMSprop
Adam = RMSprop + Momentum

AdamW (Used in various SOTA papers of 2019 in combination with one cycle learning rate policy)


AdaGrad: adapts lr for each dimensions (w)
Natural gradient descent
Second degree and other optimizers (TODO: refactor this part for better understanding of these classes of algorithms)

Newton optimmization (too expensive for regular neural nets)
Hessian approximation techniques (TODO: refactor this part for better understanding of these classes of algorithms)

Conjugate gradient
Hessian-free optimization


Conjugate gradient = conjugate of jacobian allows approximation of the hessian if ??? (TODO: ix this / remind the thoughts about this)


Pretraining and weight initialization

Finetunning of pre-trained models
Greedy lawer-wise pretraining


Data augmentation techniques
Active learning and boosting methods

Visualization, debuging, model interpretation and inference explanation techniques


Convolution fiter visualization

Deep dream and its consequences..

Deep-art _^o^_/


Uncertainty estimation

Bla bla bla ... E.g. Variance from mutiple output inferences sampled from neural models interpreted as stacked bayesian networks


...

Some Data-specific techniques


Images

Generation
Treatment / segmentation / denoising /
Classification / Detection /


Tabular data

NLP

... you may also be intererested in section 4 "Generalization vs Memorization" of OpenAI's paper: Language Models are Unsupervised Multitask Learners which investingates overlaps between WebText and common NLP dataset trainsets using 8-gram bloom filters. (This paper also gives some insights in how language models learns:
"We demonstrate that language models begin to learn these tasks without any explicit supervision when rained on (...) WebText"


Timeseries / Sound / few-dimentions frenquency domain
Other structured data : Graphs / trees / ect
Inference on heterogeneous data
Data embedding techniques

Character-level embedding
Word embedding
Sentence embedding: See this repo for fine-tunned SOTA pre-trained models based on BERT architecture: UKPLab sentence-transformers github, see also "A curated list of pretrained sentence and word embedding models": awesome-sentence-embedding github


Missing data and errors
high dimensional sparse data (e.g. recemandation systems, consumer churn rates, telemetry, sparse boolean matrices)

Some other task-specific techniques


Unsupervised, weakly supervised (see also finetunning of pre-trained models in "training techniques" section)

self supervised learning
zero-shot / one-shot / few-shot learning
metalearning
fine-tuning of pretrained models


Deep reinforcement learning

MCTS with CNNs as value and policy functions


Online training