Skip to content

Instantly share code, notes, and snippets.

@inexxt
Last active December 26, 2016 04:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save inexxt/dc141f44e2f94d944594d6d1921b54ef to your computer and use it in GitHub Desktop.
Save inexxt/dc141f44e2f94d944594d6d1921b54ef to your computer and use it in GitHub Desktop.
Notes from Karpathy 231n 2016 course

Thoughts from CS231n 2016:


Lecture 4:

NNs are just computational graphs.

Each node is responsible for computation of forward-pass and backward-pass

Graph object is just a thin abstraction over passing data.

In principle, everything is going by chain rule and Jacobian matrices, but it is crucial to do optimization and use sparsity.

We set neurons in layers because of computational gains.

Bigger is always better, regularize by regularization.

Thinking about it as kernel trick, differentiable space bending.

Lecture 5:

Activations

Different kinds of non-linearities:

  • tanh
  • sigm = 1/1+e^-z, d sigm/dx (x) = (1-sigm(x))*sigm(x)
  • relu = max(0, z) d relu/dx (x) = {1 if x > 0 else 0}
  • elu, leaky relu

People used sigm because of history reasons, it's squashing, represents firing rate.

Problems of sigm:

  • saturating neurons problem
  • outputs are not zero-centered - problem, because if everything is positive, all the gradients are the same sign - slower convergence
  • exp is expensive

tanh proposed by leCunn:

  • is zero centered

relu:

  • doesn't saturate
  • computationally expensive
  • not zero-centered output
  • kills gradient - doesn't update ancestors if doesn't activated
  • not really differentiable, but doesn't matter
  • dead neurons - outside data cloud - never activated, never trained
  • initialize by positive values (0.01)
  • 6x faster than sigm and tanh

leaky relu:

  • max(0.01x, x)
  • neurons don't die

parametric rectifier:

  • max(ax, x) - where a can be learned by backprop

elu:

  • x if x > 0 else a*(exp(x) - 1)
  • closer to zero-mean
  • exp is expensive

maxout:

  • max(W_1x + b_1 , W_2x + b_2)
  • doubled parameters
  • doesn't saturated
  • linear in nature, computationally easy

In normal networks, use relu, in LSTMs use sigmoid (TODO)

Preprocessing

Basic preprocessing

  • Mean, std or max, min
  • PCA whitening - uncorellating data, variances 1

Weight initialization

  • forward pass dimishing, if weights too small
  • gradient dimishing if weights too large
  • Xavier initialization - doesn't work in relu
  • He version of Xavier initialization - std is halving in case of relu, so /2 and works
  • data-driven approaches

Batch normalization

  • layer of BN before non-linearity, after fully-connected
  • norm(x^k) = x^k - exp(x^k) / sqrt(var(x^k)) for every feature x^k, for every mini-batch
  • works as regularizatoin, adds noise because it ties together images in batch
  • for test time, we have to remember values from training
  • parametrized batch-normalization (y' = w*norm(x) + b), w and b - parameters learned by backprop

Sanity checks:

  • at implementation step - computational gradient vs analytic
  • check if initialization is correct - we should expect normal distribution over outputs, so we can calculate it by hand
  • try to overfit small dataset, to check if backprop is working
  • see update_scale / weight_scale - should be ~0.001

Hyperparameter search

  • sampling learning rate from log space
  • grid search is worse, because one parameter is more important than other

Lecture 6

Optimization methods

First order optimization methods:

  • Basic gradient descent
  • Momentum - exponentially weighted average of gradients
  • Nesterov momentum - computing gradient one step ahead (there expists some computational trick to not having to keep two gradients)
  • Adagrad - keeping cache - sum of squares of gradients, then scaling each update according to it's own cache
  • RMSProp - cache is leaking exponentially
  • Adam - RMSProp + Momentum

Second order optimization methods:

  • Taylor series expansion - requires storing second-order derivatives and then inverting that matrix
  • BGFS - doesn't require inverting of matrix, but still has to store it
  • L-BFGS - doesn't store or invert the matrix, works really well in non-noisy enviroments (full batches) and when there is plenty of memory

Ensembles

Almost always add something. Just training more networks on the same data can help. Constructing additional models using weights from checkpoints can do the trick.

Dropout

Explanations:

  • Uncorellates neurons
  • Forces redundant representations
  • Ensemble of 2^n models that share weights, each trained on one mini-batch
  • There is problem of different expectation values of outputs:
  • During training, we can compensate with multiplication * 1/(probability of dropout) - inverted dropout, or
  • During test-time, we can compensate with multiplication * (probability of dropout

Lecture 7

Residual nets:

  • Lots of layers, but on the shrinked image
  • Layers add outputs in the stream, identity layers bypassing normal layers
  • Ability to train the network better

Practical notes:

  • No need for dropout when using batch norm
  • This days it's just conv-batch norm-max pool * N
  • No need for FC at the end, averate pooling works about the same
  • Extracting features from convnet as FC7
  • 3.6 error took 2 weeks on 8 GPUs

Lecture 8

Detection is easy, just attach second network.

There are complicated ways to do multi-object localization, but it seems the best&simplest to use is YOLO pjreddie.com/darknet/yolo/

Lecture 9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment