Skip to content

Instantly share code, notes, and snippets.

@kkatrio
Last active March 31, 2017 13:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kkatrio/9710b0a37f042a2df784cda7082d5523 to your computer and use it in GitHub Desktop.
Save kkatrio/9710b0a37f042a2df784cda7082d5523 to your computer and use it in GitHub Desktop.
scikit-learn proposal for GSoC 2017

Improve online learning for linear models

Sub-org

Scikit-learn

Personal Information

  • Name: Konstantinos Katrioplas
  • Email: konst.katrioplas@gmail.com
  • Telephone: +30 6976354054
  • Github: kkatrio
  • University: Aristotle University of Thessaloniki, Thessaloniki, Greece
  • Course: Master’s Degree in Computational Physics
  • Expected Graduation date: June 2017
  • Timezone: EEST (GMT +3)
  • Blog: https://kkatrio.github.io/

Project Proposal

Abstract

Online learning for stochastic gradient descent can be improved for multi classification problems by minimizing the multinomial logistic loss. Furthermore, dynamic adaptation of the learning rate using the AdaGrad algorithm and optmimization with the Adam method can improve convergence.

Deliverables

At the end of this project, scikit-learn Stochastic Gradient Descent will be enhanced with the following:

  • Calculation of softmax regression for multi-class classification.
  • Implementation of AdaGrad algorithm to optimize convergence.
  • Implementation od Adam optimizer for the linear model.

Implementation

The implementation language will be Python. Core parts of the work will be done in Cython.

Unit tests will be added to validate against the existing multi-classification methods. Benchmark analysis will be done using the iris dataset. Documentation and examples will be added to the user guide and the API.

  • The cross entropy loss function will be added in linear_model/sgd_fast.pyx
  • The AdaGrad and Adam optimizations will be implemented in a new linear_model/sgd_opt.pyx

Description

Currently, multi-class classification with Stochastic Gradient Descent (SGD) is performed by combining multiple binary classifiers in a "one versus all" scheme. A binary classifier is trained so that discriminates between one and K minus one classes. To achieve that, the loss function that gets minimized can be based either on Support Vector Machine (hinge), a smoothed hinge, or the binary logistic regression.

Softmax regression

In addition to the existing loss functions the multinomial logistic loss (also cross entropy loss) will be implemented as an option for the SGD algorithm to perform multi-class classification. The multinomial logistic loss is direct generalization to the logistic regression loss, and can be validated against that for K=2 classes.

An exploration of the convergence and the stability of the new method will be done, as it is expected to improve stability and performance in multi-classification problems.

AdaGrad optimization

AdaGrad is an algorithm for choosing a learning rate dynamically by adapting to the data. AdaGrad calculates a different learning rate for every feature. Then it is used to normalize the update step. The result is that high gradients will have their effective learning rate reduced, while small or infrequent gradients will have their effective learning rate increased.

The algorithm will be implemented by calculating the covariance matrix. It will be investigated weather only the diagonal of the covariance matrix yeilds sufficient results in order to reduce computational cost for large problems.

Adam optimization

Adam is currently implemented in a Python definition for the Multi Layer Perceptron. The method introduces a "smooth" version for the covariance matrix of Adagrad, where the calculation of the gradients is based on two -usually constant- hyperparameters. The modulated gradients have a beneficial equalizing effect, since updates do not get monotonically smaller. Furthermore, a "momentum" term is considered when updating the gradient vectors, which results in parameters to get updated in directions where the gradient is smooth.

Learning rate annealing

An exploration of a learning rate tool will be done in order to set specific value of the parameter on a few epochs.

Timeline

  • Preparation Period: 1 May - 28 May

    • Getting involved in open issues relevant to the linear model.
    • Experiment with Cython.
  • Week 1: 29 May - 4 June

    • Setup simple multi-classification examples using existing loss functions.
    • Implement cross entropy loss in a python module.
    • Test softmax regression in oversimplified examples.
  • Week 2: 5 June - 11 June

    • Implement cross entropy function with Cython in linear_model/sgd_fast.pyx.
  • Week 3: 12 June - 18 June

    • Generalize Cython implementation.
  • Week 4: 19 June - 25 June

    • Test for K=2 classes.
    • Test on Iris dataset.
    • Comparison analysis against current multi-classification loss functions.
    • Add documentation and examples.
  • Week 5: 26 June - 2 July

    • Start working on AdaGrad implementation in a Python module.
    • Calculation of covariance matrix in Python for small problems.
  • Week 6: 3 July - 9 July

    • Add new Cython linear_model/sgd_opt.pyx file.
    • Adagrad implementation on linear_model/sgd_opt.pyx.
  • Week 7: 10 July - 16 July

    • Work on Cython implementation: Investigation of option to calculate only the diagonal.
    • Test AdaGrad on Stochastic gradient Descent.
  • Week 8: 17 July - 23 July

    • Add benchmarks and unit tests for AdaGrad.
    • Test on randomly generated datasets.
  • Week 9: 24 July - 30 July

    • Add documentation and examples for AdaGrad.
  • Week 10: 31 July - 6 August

    • Implementation of Adam optimizer in linear_model/sgd_opt.pyx.
  • Week 11: 7 August - 13 August

    • Test Adam.
    • Add example comparing Adam and AdaGrad.
  • Week 12: 14 August - 20 August

    • Exploration of learning rate annealing tool.
  • Last week: 21 August - 31 August

    Buffer Week. Finalization of project for evaluation.

I am completely available during the summer for 40 hours/week work.

Contributions

References

  1. Softmax regression

  2. Pattern recognition and machine learning

  3. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization

  4. Adam, A Method for Stochastic Optimization

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment