anirudhacharya/optDraft.md

## optDraft.md

      
    Raw
  

              optDraft.md
            
          
    Because scaling (or normalizing) inputs makes gradient descent converge faster. Let’s see why this is the case via a simple linear regression example.
Take this toy dataset for instance:
A line that fits this data well will have two parameters: 𝑤0 (bias) and 𝑤1 (slope)
𝑦=𝑤0+𝑤1𝑥
Now lets plot the contours of the Least Squares cost function associated with this dataset (in 2D), with darker blue regions corresponding to larger points on the cost surface, and conversely lighter regions indicating lower points on the cost function.
Can you see the optimal point here?
It’s at the center of all concentric ellipses - somewhere near 𝑤0=2 and 𝑤1=10
Unfortunately, gradient descent lacks our sense of human perception and cannot “see” this cost. Instead, it starts at some random initial point and follows the direction of the (negative) gradient until it reaches the minimum, creating a path like the one shown below in the process:
Do you see how gradient descent “zig-zags” its way toward the center (minimum)? The reason for this behavior is that the gradient descent direction always points perpendicular to the contours of a cost function (proof left to the reader!).
This means that - when applied to minimize a cost function with elliptical contours like the example here - the gradient descent direction (while still a descent direction) points away from the global minimum of the function.
Do you see how the gradient direction points further away from the minimum as the contours become more “elliptical” ? In other words, with more circular contours a gradient descent step takes us closer to the actual minimum.
What’s the role of scaling (or normalizing) inputs here?
Well, a GIF is worth a 1000 (times # of its frames) words. This animation shows what happens to the contours of the original Least Squares cost function after input normalization.
Note that this is not limited to linear regression. Here’s a logistic regression (classification) dataset (top) along with the contours of its cost function (bottom).
And, again, this is what happens to the geometry of the cost function after input normalization:
Summary: In normalizing the data we temper elongated elliptical contours, transforming them into more circular ones. With more circular contours the gradient descent direction starts pointing more in the direction of the cost function minima, making each gradient descent step much more effective.
To read more and tinker with the code that generated these plots, see Feature scaling via standard normalization.
It’s worthwhile to note that batch normalization - which speeds up training in neural networks - is also a natural extension of what’s discussed here. See Batch normalization if interested to know more.