inexxt/231n_notes.md

## 231n_notes.md

      
    Raw
  

              231n_notes.md
            
          
    Thoughts from CS231n 2016:


Lecture 4:

NNs are just computational graphs.
Each node is responsible for computation of forward-pass and backward-pass
Graph object is just a thin abstraction over passing data.
In principle, everything is going by chain rule and Jacobian matrices, but it
is crucial to do optimization and use sparsity.
We set neurons in layers because of computational gains.
Bigger is always better, regularize by regularization.
Thinking about it as kernel trick, differentiable space bending.
Lecture 5:

Activations

Different kinds of non-linearities:

tanh
sigm = 1/1+e^-z, d sigm/dx (x) = (1-sigm(x))*sigm(x)
relu = max(0, z) d relu/dx (x) = {1 if x > 0 else 0}
elu, leaky relu

People used sigm because of history reasons, it's squashing, represents firing rate.
Problems of sigm:

saturating neurons problem
outputs are not zero-centered - problem, because if everything is positive, all the gradients are the same sign - slower convergence
exp is expensive

tanh proposed by leCunn:

is zero centered

relu:

doesn't saturate
computationally expensive
not zero-centered output
kills gradient - doesn't update ancestors if doesn't activated
not really differentiable, but doesn't matter
dead neurons - outside data cloud - never activated, never trained
initialize by positive values (0.01)
6x faster than sigm and tanh

leaky relu:

max(0.01x, x)
neurons don't die

parametric rectifier:

max(ax, x) - where a can be learned by backprop

elu:

x if x > 0 else a*(exp(x) - 1)
closer to zero-mean
exp is expensive

maxout:

max(W_1x + b_1 , W_2x + b_2)
doubled parameters
doesn't saturated
linear in nature, computationally easy

In normal networks, use relu, in LSTMs use sigmoid (TODO)
Preprocessing

Basic preprocessing

Mean, std or max, min
PCA whitening - uncorellating data, variances 1

Weight initialization

forward pass dimishing, if weights too small
gradient dimishing if weights too large
Xavier initialization - doesn't work in relu
He version of Xavier initialization - std is halving in case of relu, so /2 and works
data-driven approaches

Batch normalization

layer of BN before non-linearity, after fully-connected
norm(x^k) = x^k - exp(x^k) / sqrt(var(x^k)) for every feature x^k, for every mini-batch
works as regularizatoin, adds noise because it ties together images in batch
for test time, we have to remember values from training
parametrized batch-normalization (y' = w*norm(x) + b), w and b - parameters learned by backprop

Sanity checks:

at implementation step - computational gradient vs analytic
check if initialization is correct - we should expect normal distribution over outputs, so we can calculate it by hand
try to overfit small dataset, to check if backprop is working
see update_scale / weight_scale - should be ~0.001

Hyperparameter search

sampling learning rate from log space
grid search is worse, because one parameter is more important than other

Lecture 6

Optimization methods

First order optimization methods:

Basic gradient descent
Momentum - exponentially weighted average of gradients
Nesterov momentum - computing gradient one step ahead (there expists some computational trick to not having to keep two gradients)
Adagrad - keeping cache - sum of squares of gradients, then scaling each update according to it's own cache
RMSProp - cache is leaking exponentially
Adam - RMSProp + Momentum

Second order optimization methods:

Taylor series expansion - requires storing second-order derivatives and then inverting that matrix
BGFS - doesn't require inverting of matrix, but still has to store it
L-BFGS - doesn't store or invert the matrix, works really well in non-noisy enviroments (full batches) and when there is plenty of memory

Ensembles

Almost always add something.
Just training more networks on the same data can help.
Constructing additional models using weights from checkpoints can do the trick.
Dropout

Explanations:

Uncorellates neurons
Forces redundant representations
Ensemble of 2^n models that share weights, each trained on one mini-batch
There is problem of different expectation values of outputs:
During training, we can compensate with multiplication * 1/(probability of dropout) - inverted dropout, or
During test-time, we can compensate with multiplication * (probability of dropout

Lecture 7

Residual nets:

Lots of layers, but on the shrinked image
Layers add outputs in the stream, identity layers bypassing normal layers
Ability to train the network better

Practical notes:

No need for dropout when using batch norm
This days it's just conv-batch norm-max pool * N
No need for FC at the end, averate pooling works about the same
Extracting features from convnet as FC7
3.6 error took 2 weeks on 8 GPUs

Lecture 8

Detection is easy, just attach second network.
There are complicated ways to do multi-object localization, but it seems the best&simplest to use is YOLO pjreddie.com/darknet/yolo/
Lecture 9