NNs are just computational graphs.
Each node is responsible for computation of forward-pass and backward-pass
Graph object is just a thin abstraction over passing data.
In principle, everything is going by chain rule and Jacobian matrices, but it is crucial to do optimization and use sparsity.
We set neurons in layers because of computational gains.
Bigger is always better, regularize by regularization.
Thinking about it as kernel trick, differentiable space bending.
Different kinds of non-linearities:
- tanh
- sigm = 1/1+e^-z, d sigm/dx (x) = (1-sigm(x))*sigm(x)
- relu = max(0, z) d relu/dx (x) = {1 if x > 0 else 0}
- elu, leaky relu
People used sigm because of history reasons, it's squashing, represents firing rate.
Problems of sigm:
- saturating neurons problem
- outputs are not zero-centered - problem, because if everything is positive, all the gradients are the same sign - slower convergence
- exp is expensive
tanh proposed by leCunn:
- is zero centered
relu:
- doesn't saturate
- computationally expensive
- not zero-centered output
- kills gradient - doesn't update ancestors if doesn't activated
- not really differentiable, but doesn't matter
- dead neurons - outside data cloud - never activated, never trained
- initialize by positive values (0.01)
- 6x faster than sigm and tanh
leaky relu:
- max(0.01x, x)
- neurons don't die
parametric rectifier:
- max(ax, x) - where a can be learned by backprop
elu:
- x if x > 0 else a*(exp(x) - 1)
- closer to zero-mean
- exp is expensive
maxout:
- max(W_1x + b_1 , W_2x + b_2)
- doubled parameters
- doesn't saturated
- linear in nature, computationally easy
In normal networks, use relu, in LSTMs use sigmoid (TODO)
Basic preprocessing
- Mean, std or max, min
- PCA whitening - uncorellating data, variances 1
Weight initialization
- forward pass dimishing, if weights too small
- gradient dimishing if weights too large
- Xavier initialization - doesn't work in relu
- He version of Xavier initialization - std is halving in case of relu, so /2 and works
- data-driven approaches
Batch normalization
- layer of BN before non-linearity, after fully-connected
- norm(x^k) = x^k - exp(x^k) / sqrt(var(x^k)) for every feature x^k, for every mini-batch
- works as regularizatoin, adds noise because it ties together images in batch
- for test time, we have to remember values from training
- parametrized batch-normalization (y' = w*norm(x) + b), w and b - parameters learned by backprop
Sanity checks:
- at implementation step - computational gradient vs analytic
- check if initialization is correct - we should expect normal distribution over outputs, so we can calculate it by hand
- try to overfit small dataset, to check if backprop is working
- see update_scale / weight_scale - should be ~0.001
Hyperparameter search
- sampling learning rate from log space
- grid search is worse, because one parameter is more important than other
First order optimization methods:
- Basic gradient descent
- Momentum - exponentially weighted average of gradients
- Nesterov momentum - computing gradient one step ahead (there expists some computational trick to not having to keep two gradients)
- Adagrad - keeping cache - sum of squares of gradients, then scaling each update according to it's own cache
- RMSProp - cache is leaking exponentially
- Adam - RMSProp + Momentum
Second order optimization methods:
- Taylor series expansion - requires storing second-order derivatives and then inverting that matrix
- BGFS - doesn't require inverting of matrix, but still has to store it
- L-BFGS - doesn't store or invert the matrix, works really well in non-noisy enviroments (full batches) and when there is plenty of memory
Almost always add something. Just training more networks on the same data can help. Constructing additional models using weights from checkpoints can do the trick.
Explanations:
- Uncorellates neurons
- Forces redundant representations
- Ensemble of 2^n models that share weights, each trained on one mini-batch
- There is problem of different expectation values of outputs:
- During training, we can compensate with multiplication * 1/(probability of dropout) - inverted dropout, or
- During test-time, we can compensate with multiplication * (probability of dropout
Residual nets:
- Lots of layers, but on the shrinked image
- Layers add outputs in the stream, identity layers bypassing normal layers
- Ability to train the network better
Practical notes:
- No need for dropout when using batch norm
- This days it's just conv-batch norm-max pool * N
- No need for FC at the end, averate pooling works about the same
- Extracting features from convnet as FC7
- 3.6 error took 2 weeks on 8 GPUs
Detection is easy, just attach second network.
There are complicated ways to do multi-object localization, but it seems the best&simplest to use is YOLO pjreddie.com/darknet/yolo/