dusenberrymw/deep_learning_optim.md

## deep_learning_optim.md

      
    Raw
  

              deep_learning_optim.md
            
          
    Deep Learning Optimization

Currently, the literature generally supports mini-batch SGD (particularly using adaptive learning rate variants like RMSprop or Adam) as achieving the best performance in terms of both optimization metrics and computational time for large deep learning models.
The "Neural Networks Part 3: Learning and Evaluation" lecture of the CS231n course is a good reference -- see the "In practice..." and "Additional References" portions of the "Second-order methods" section.  The first paper linked in that latter section, "Large Scale Distributed Deep Networks", is Google's 2012 paper introducing their distributed "DistBelief" system (replaced by TensorFlow after assessing its shortcomings), along with comparisons of a distributed, asynchronous SGD ("Downpour SGD") vs a distributed L-BFGS ("Sandblaster L-BFGS") algorithm.  It was shown that the mini-batch SGD approach (particularly, the "adagrad" variant of SGD) is able to achieve a better overall test accuracy than their distributed L-BFGS, and it is able to do it faster.
Then, a Google followup in 2016 (https://arxiv.org/abs/1604.00981) found that a distributed synchronous + backup workers mini-batch SGD approach produces better test accuracy, and is computationally faster, than the previous distributed asynchronous SGD.
TL;DR: Favor a synchronous, adaptive learning rate SGD approach that is distributed over larger mini-batches with backup workers while training large deep learning models.