Machine learning gives computers the ability to learn without being explicitly programmed.
Linear regression is a simple model to find best fit for some data.
Example: You have a data set with commute time, sleep time, salary and happiness (1-10). For the three features - commute time, sleep time and salary - you want to predict happiness level.
How well does this line predict the data? Cost function (essentially sum of deltas between a line at a point and the real point) which we want to minimize. Once function converges we know we have a good fit. How do we find it? Gradient descent.
The lines in this picture represent certain heights. Imagine we are walking up one of these hills by each step taking the steepest path (the places where lines are closest to each other). Thats essentially gradient ascent (descent if opposite).
Going back to our example: commute time on scale of 10m-2h, salary in 5-6 figure range - obviously not proportional. Feature scaling is a way of normalizing our values.
Using x' = (x - min) / (max - min), with x=15, min=0, max=120 we get x' = (15 - 0) / (120 - 0) = 0.125
Feature scaling helps with convergence speed. Without proportionality we get uneven descent.
PERFECT!!!