Skip to content

Instantly share code, notes, and snippets.

@dusenberrymw
Last active June 11, 2018 06:23
Show Gist options
  • Save dusenberrymw/1177ec62c064d39589bd0a70d480f742 to your computer and use it in GitHub Desktop.
Save dusenberrymw/1177ec62c064d39589bd0a70d480f742 to your computer and use it in GitHub Desktop.
Quick notes on SGD vs batch optimization approaches for deep learning.

Deep Learning Optimization

Currently, the literature generally supports mini-batch SGD (particularly using adaptive learning rate variants like RMSprop or Adam) as achieving the best performance in terms of both optimization metrics and computational time for large deep learning models.

The "Neural Networks Part 3: Learning and Evaluation" lecture of the CS231n course is a good reference -- see the "In practice..." and "Additional References" portions of the "Second-order methods" section. The first paper linked in that latter section, "Large Scale Distributed Deep Networks", is Google's 2012 paper introducing their distributed "DistBelief" system (replaced by TensorFlow after assessing its shortcomings), along with comparisons of a distributed, asynchronous SGD ("Downpour SGD") vs a distributed L-BFGS ("Sandblaster L-BFGS") algorithm. It was shown that the mini-batch SGD approach (particularly, the "adagrad" variant of SGD) is able to achieve a better overall test accuracy than their distributed L-BFGS, and it is able to do it faster.

Then, a Google followup in 2016 (https://arxiv.org/abs/1604.00981) found that a distributed synchronous + backup workers mini-batch SGD approach produces better test accuracy, and is computationally faster, than the previous distributed asynchronous SGD.

TL;DR: Favor a synchronous, adaptive learning rate SGD approach that is distributed over larger mini-batches with backup workers while training large deep learning models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment