Notes for Curriculum Learning paper
Contributions of the paper
- Curriculum Learning - When training machine learning models, start with easier subtasks and gradually increase the difficulty level of the tasks.
- Motivation comes from the observation that humans and animals seem to learn better when trained with a curriculum like a strategy.
- Link to the paper.
Experiments with convex criteria
- Explore cases that show that curriculum learning benefits machine learning.
- Offer hypothesis around when and why does it happen.
- Explore relation of curriculum learning with other machine learning approaches.
Experiments on shape recognition with datasets having different variability in shapes
- Training perceptron where some input data is irrelevant(not predictive of the target class).
- Difficulty can be defined in terms of the number of irrelevant samples or margin from the separating hyperplane.
- Curriculum learning model outperforms no-curriculum based approach.
- Surprisingly, in the case of difficulty defined in terms of the number of irrelevant examples, the anti-curriculum strategy also outperforms no-curriculum strategy.
Experiments on language modeling
- Standard(target) dataset - Images of rectangles, ellipses, and triangles.
- Easy dataset - Images of squares, circles, and equilateral triangles.
- Start performing gradient descent on easy dataset and switch to target data set at a particular epoch (called switch epoch).
- For no-curriculum learning, the first epoch is the switch epoch.
- As switch epoch increases, the classification error comes down with the best performance when switch epoch is half the total number of epochs.
- Paper does not report results for higher values of switch epoch.
Curriculum as a continuation method
- Standard data set is the set of all possible windows of the text of size 5 from Wikipedia where all words in the window appear in 20000 most frequent words.
- Easy dataset considers only those windows where all words appear in 5000 most frequent words in vocabulary.
- Each word from the vocabulary is embedded into a d dimensional feature space using a matrix W (to be learnt).
- The model predicts the score of next word, given a window of words.
- Expected value of ranking loss function is minimized to learn W.
- Curriculum Learning-based model overtakes the other model soon after switching to the target vocabulary, indicating that curriculum-based model quickly learns new words.
Advantages of Curriculum Learning
- Continuation methods start with a smoothed objective function and gradually move to less smoothed function.
- Useful in the case where the objective function in non-convex.
- Consider a family of cost functions Cλ(θ) such that C0(θ) can be easily optimized and C1(θ) is the actual objective function.
- Start with C0(θ) and increase λ, keeping θ at a local minimum of Cλ(θ).
- Idea is to move θ towards a dominant (if not global) minima of C1(θ).
- Curriculum learning can be seen as a sequence of training criteria starting with an easy-to-optimise objective and moving all the way to the actual objective.
- The paper provides a mathematical formulation of curriculum learning in terms of a target training distribution and a weight function (to model the probability of selecting anyone training example at any step).
Relation to other machine learning approaches
- Faster training in the online setting as learner does not try to learn difficult examples when it is not ready.
- Guiding training towards better local minima in parameter space, specifically useful for non-convex methods.
- Unsupervised preprocessing - Both have a regularizing effect and lower the generalization error for the same training error.
- Active learning - The learner would benefit most from the examples that are close to the learner's frontier of knowledge and are neither too hard nor too easy.
- Boosting Algorithms - Difficult examples are gradually emphasised though the curriculum starts with a focus on easier examples and the training criteria do not change.
- Transfer learning and Life-long learning - Initial tasks are used to guide the optimisation problem.
- Curriculum Learning is not well understood, making it difficult to define the curriculum.
- In one of the examples, anti-curriculum performs better than no-curriculum. Given that curriculum learning is modeled on the idea that learning benefits when examples are presented in order of increasing difficulty, one would expect anti-curriculum to perform worse.