kkatrio/scikit_proposal.md Secret

## scikit_proposal.md

      
    Raw
  

              scikit_proposal.md
            
          
    Improve online learning for linear models

Sub-org

Scikit-learn
Personal Information


Name: Konstantinos Katrioplas
Email: konst.katrioplas@gmail.com
Telephone: +30 6976354054
Github: kkatrio
University: Aristotle University of Thessaloniki, Thessaloniki, Greece
Course: Master’s Degree in Computational Physics
Expected Graduation date: June 2017
Timezone: EEST (GMT +3)
Blog: https://kkatrio.github.io/

Project Proposal

Abstract

Online learning for stochastic gradient descent can be improved for multi classification problems by minimizing the multinomial logistic loss.
Furthermore, dynamic adaptation of the learning rate using the AdaGrad algorithm and optmimization with the Adam method can improve convergence.
Deliverables

At the end of this project, scikit-learn Stochastic Gradient Descent will be enhanced with the following:

Calculation of softmax regression for multi-class classification.
Implementation of AdaGrad algorithm to optimize convergence.
Implementation od Adam optimizer for the linear model.

Implementation

The implementation language will be Python. Core parts of the work will be done in Cython.
Unit tests will be added to validate against the existing multi-classification methods.
Benchmark analysis will be done using the iris dataset.
Documentation and examples will be added to the user guide and the API.

The cross entropy loss function will be added in linear_model/sgd_fast.pyx
The AdaGrad and Adam optimizations will be implemented in a new linear_model/sgd_opt.pyx

Description

Currently, multi-class classification with Stochastic Gradient Descent (SGD) is performed by combining multiple binary classifiers in a
"one versus all" scheme. A binary classifier is trained so that discriminates between one and K minus one classes.
To achieve that, the loss function that gets minimized can be based either on Support Vector Machine (hinge), a smoothed hinge, or the binary logistic regression.
Softmax regression

In addition to the existing loss functions the multinomial logistic loss (also cross entropy loss) will be implemented as an option for the SGD algorithm to perform multi-class classification.
The multinomial logistic loss is direct generalization to the logistic regression loss, and can be validated against that for K=2 classes.
An exploration of the convergence and the stability of the new method will be done, as it is expected to improve stability and performance in multi-classification problems.
AdaGrad optimization

AdaGrad is an algorithm for choosing a learning rate
dynamically by adapting to the data. AdaGrad calculates a different learning
rate for every feature. Then it is used to normalize the update step.
The result is that high gradients will have their effective learning rate reduced, while small or infrequent gradients will have their effective learning rate increased.
The algorithm will be implemented by calculating the covariance matrix. It will be investigated weather only the diagonal of the covariance matrix yeilds sufficient results
in order to reduce computational cost for large problems.
Adam optimization

Adam is currently implemented in a Python definition for the Multi Layer Perceptron. The method introduces a "smooth" version for the covariance matrix of Adagrad, where the calculation of the gradients is based on two -usually constant- hyperparameters. The modulated gradients have a beneficial equalizing effect, since updates do not get monotonically smaller. Furthermore, a "momentum" term is considered when updating the gradient vectors, which results in parameters to get updated in directions where the gradient is smooth.
Learning rate annealing

An exploration of a learning rate tool will be done in order to set specific value of the parameter on a few epochs.
Timeline


Preparation Period: 1 May - 28 May

Getting involved in open issues relevant to the linear model.
Experiment with Cython.


Week 1: 29 May - 4 June

Setup simple multi-classification examples using existing loss functions.
Implement cross entropy loss in a python module.
Test softmax regression in oversimplified examples.


Week 2: 5 June - 11 June

Implement cross entropy function with Cython in linear_model/sgd_fast.pyx.


Week 3: 12 June - 18 June

Generalize Cython implementation.


Week 4: 19 June - 25 June

Test for K=2 classes.
Test on Iris dataset.
Comparison analysis against current multi-classification loss functions.
Add documentation and examples.


Week 5: 26 June - 2 July

Start working on AdaGrad implementation in a Python module.
Calculation of covariance matrix in Python for small problems.


Week 6: 3 July - 9 July

Add new Cython linear_model/sgd_opt.pyx file.
Adagrad implementation on linear_model/sgd_opt.pyx.


Week 7: 10 July - 16 July

Work on Cython implementation: Investigation of option to calculate only the diagonal.
Test AdaGrad on Stochastic gradient Descent.


Week 8: 17 July - 23 July

Add benchmarks and unit tests for AdaGrad.
Test on randomly generated datasets.


Week 9: 24 July - 30 July

Add documentation and examples for AdaGrad.


Week 10: 31 July - 6 August

Implementation of Adam optimizer in linear_model/sgd_opt.pyx.


Week 11: 7 August - 13 August

Test Adam.
Add example comparing Adam and AdaGrad.


Week 12: 14 August - 20 August

Exploration of learning rate annealing tool.


Last week: 21 August - 31 August
Buffer Week. Finalization of project for evaluation.


I am completely available during the summer for 40 hours/week work.
Contributions


[MRG] enhance make_blobs to accept lists for samples per cluster #8624


[MRG] add random_state in tests estimators #8563


Bug in bfgs gradient computation of MLPRegressor with multiple output neurons #8349


References


Softmax regression


Pattern recognition and machine learning


Adaptive Subgradient Methods for Online Learning and Stochastic Optimization


Adam, A Method for Stochastic Optimization