Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save bbrighttaer/caf30ba7325ce8165a34b1bd04beffc2 to your computer and use it in GitHub Desktop.
Save bbrighttaer/caf30ba7325ce8165a34b1bd04beffc2 to your computer and use it in GitHub Desktop.
Notes for Curriculum Learning paper

Curriculum Learning

Introduction

  • Curriculum Learning - When training machine learning models, start with easier subtasks and gradually increase the difficulty level of the tasks.
  • Motivation comes from the observation that humans and animals seem to learn better when trained with a curriculum like a strategy.
  • Link to the paper.

Contributions of the paper

  • Explore cases that show that curriculum learning benefits machine learning.
  • Offer hypothesis around when and why does it happen.
  • Explore relation of curriculum learning with other machine learning approaches.

Experiments with convex criteria

  • Training perceptron where some input data is irrelevant(not predictive of the target class).
  • Difficulty can be defined in terms of the number of irrelevant samples or margin from the separating hyperplane.
  • Curriculum learning model outperforms no-curriculum based approach.
  • Surprisingly, in the case of difficulty defined in terms of the number of irrelevant examples, the anti-curriculum strategy also outperforms no-curriculum strategy.

Experiments on shape recognition with datasets having different variability in shapes

  • Standard(target) dataset - Images of rectangles, ellipses, and triangles.
  • Easy dataset - Images of squares, circles, and equilateral triangles.
  • Start performing gradient descent on easy dataset and switch to target data set at a particular epoch (called switch epoch).
  • For no-curriculum learning, the first epoch is the switch epoch.
  • As switch epoch increases, the classification error comes down with the best performance when switch epoch is half the total number of epochs.
  • Paper does not report results for higher values of switch epoch.

Experiments on language modeling

  • Standard data set is the set of all possible windows of the text of size 5 from Wikipedia where all words in the window appear in 20000 most frequent words.
  • Easy dataset considers only those windows where all words appear in 5000 most frequent words in vocabulary.
  • Each word from the vocabulary is embedded into a d dimensional feature space using a matrix W (to be learnt).
  • The model predicts the score of next word, given a window of words.
  • Expected value of ranking loss function is minimized to learn W.
  • Curriculum Learning-based model overtakes the other model soon after switching to the target vocabulary, indicating that curriculum-based model quickly learns new words.

Curriculum as a continuation method

  • Continuation methods start with a smoothed objective function and gradually move to less smoothed function.
  • Useful in the case where the objective function in non-convex.
  • Consider a family of cost functions Cλ(θ) such that C0(θ) can be easily optimized and C1(θ) is the actual objective function.
  • Start with C0(θ) and increase λ, keeping θ at a local minimum of Cλ(θ).
  • Idea is to move θ towards a dominant (if not global) minima of C1(θ).
  • Curriculum learning can be seen as a sequence of training criteria starting with an easy-to-optimise objective and moving all the way to the actual objective.
  • The paper provides a mathematical formulation of curriculum learning in terms of a target training distribution and a weight function (to model the probability of selecting anyone training example at any step).

Advantages of Curriculum Learning

  • Faster training in the online setting as learner does not try to learn difficult examples when it is not ready.
  • Guiding training towards better local minima in parameter space, specifically useful for non-convex methods.

Relation to other machine learning approaches

  • Unsupervised preprocessing - Both have a regularizing effect and lower the generalization error for the same training error.
  • Active learning - The learner would benefit most from the examples that are close to the learner's frontier of knowledge and are neither too hard nor too easy.
  • Boosting Algorithms - Difficult examples are gradually emphasised though the curriculum starts with a focus on easier examples and the training criteria do not change.
  • Transfer learning and Life-long learning - Initial tasks are used to guide the optimisation problem.

Criticism

  • Curriculum Learning is not well understood, making it difficult to define the curriculum.
  • In one of the examples, anti-curriculum performs better than no-curriculum. Given that curriculum learning is modeled on the idea that learning benefits when examples are presented in order of increasing difficulty, one would expect anti-curriculum to perform worse.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment