Skip to content

Instantly share code, notes, and snippets.

Created April 2, 2018 00:42
What would you like to do?
Transfer Learning Papers

How transferable are features in deep neural networks? - Yosinski et al.

  • Transfer Learning

    • Train on a base network, try to take that network and tweak it to work for a new target network.
    • Notes from CS231N.
  • Tries to figure out how much information can we transfer between networks trained on different datasets.

  • Quantifies the transferability by layer.

  • Hypothesis:

    • First few layers are general (Gabor Filters kind of features) and can adapt well.
    • Last few layers are specific for the dataset.


  • Two disjoint datasets, baseA & baseB from Imagenet (each of 500 classes).
  • Selffer Network: Train a network B on baseB with 8 layers. Take the first n layers, and freeze them. Randomly initialize the remaining layers, and train on baseA. This is a BnB network (AnA network is same, except trained on baseA).
  • Transfer Network: Train a network A on baseA with 8 layers. Freeze the layers similar to a BnB network, but train on baseB instead. This is an AnB network.
  • BnB+: Same as BnB, except that none of the layers are frozen.
  • AnB+: Same as AnB, except that none of the layers are frozen.


  • Keeping the first 1-2 layers in both AnB and BnB transfer well, and the performance is similar to a regular B network.
  • In a BnB network, as we go from layer 3-5, the performance degrades, probably because of co-adapation of weights between the layers, and Gradient Descent is not able to learn the relationships between the weights across several layers. Performance improves with layers 6 & 7, as gradient descent is able to find a good solution for the weights.
  • Performance further degrades as we go to higher layers in an AnB network, as the later layers are specific to the base dataset and freezing them leads to drop in specificity.
  • However, AnB+ networks perform very well, and even perform better than training directly on the base dataset. This is possibly because of better generalized features learned by these networks. This effect is not seen in BnB+ networks, which means that this improvement is not linked to longer training times.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment