Skip to content

Instantly share code, notes, and snippets.

@devbkhadka
Last active August 12, 2019 12:46
Show Gist options
  • Save devbkhadka/b848ecabb5ec141e1026358779bb8b70 to your computer and use it in GitHub Desktop.
Save devbkhadka/b848ecabb5ec141e1026358779bb8b70 to your computer and use it in GitHub Desktop.
Linear Regression Gradient Descent - Intuition, Math and Code
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Problem Setup\n",
"\n",
"Lets assume a hypothetical experiment in which students of agriculture science collected some data from different farms/green houses through out many years. Structure of the collected data is as follow\n",
"\n",
"<table>\n",
" <tr>\n",
" <th>Average Temperature ($^0c$)</th>\n",
" <th>Average Nitrite In Soil(PPM)</th>\n",
" <th>Harvest Yield (Kg/Sq Meter)</th>\n",
" </tr>\n",
" <tr>\n",
" <td>23.4</td>\n",
" <td>0.01</td>\n",
" <td>7</td>\n",
" </tr>\n",
" <tr>\n",
" <td>21.7</td>\n",
" <td>0.02</td>\n",
" <td>6.5</td>\n",
" </tr>\n",
" <tr>\n",
" <td>22.1</td>\n",
" <td>0.015</td>\n",
" <td>6</td>\n",
" </tr>\n",
"</table>\n",
"\n",
"What we want to do is predict harvests of different farms/green houses given Average Temperature and Average Nitrite In Soil. For this we have made some assumptions based on initial analysis of the data\n",
"- Relation of Temperature and Average Nitrite In Soil is somewhat linear with yeild\n",
"- For simplicity we also assume that other factors like water in soil, light etc. either don't affect the yield or are kept fairly similar across all the farms\n",
"\n",
"Lets represent\n",
"Average Temperature ($^0c$) as $x_1$\n",
"Average Nitrite In Soil as $x_2$\n",
"Harvest Yield as $y$\n",
"\n",
"Then based on the above assumption we can say that there exists a linear equation\n",
"\n",
"$y=w_0+w_1*x_1+w_2*x_{2} \\tag {1.1}$\n",
"\n",
"Which will fairly describe relationship between our independent variables $x_1,x_2\\ and\\ y$. This relationship is also called **hypothesis equation**. We'll also refer to this as a model.\n",
"\n",
"where **$w_0$ is a bias term** and **$w_1$ and $w_2$ are weights** for $x_1$ and $x_2$ respectively\n",
"\n",
"Now our aim is to find value of $w_0,w_1$ and $w_2$ which will best describes the relationship\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Cost Function\n",
"Now lets find the way to measure how good the model is. One straight forward way is to measure how much error the model makes in predictions. Lets say\n",
"\n",
"$Y = \\{y_1,y_2,....,y_m\\}$ \n",
"$P = \\{p_1,p_2,....,p_m\\}$ \n",
"$err = \\{y_1-p_1, y_2-p_2, ...., y_m-p_m\\}$\n",
"\n",
"Where\n",
"Y is target values of known data in our example \"yield\" \n",
"P is corresponding predicted values and \n",
"err is errors made in each predictions \n",
"\n",
"\n",
"But, this is array of values, we need a single numerical value to compare which model is better. \n",
"\n",
"We need to combine those value to give single value, straight sum or average will not work because positive and negative error will cancel each other.\n",
"\n",
"So we take average of square of individual error. We are taking average because we don't want the metric to depend on number of items on training data. This value is called **Mean Square Error(MSE)**, this is our cost function which we need to minimize.\n",
"\n",
"\n",
"$Mean\\ Square\\ Error(MSE) = \\frac{1}{m} \\sum_{i=1}^m (y_i-p_i)^2 \\tag {2.1}$\n",
"\n",
"\n",
"Replacing $p_i$ in above equation with our hypothesis function we get\n",
"\n",
"$MSE = \\frac{1}{m} \\sum_{i=1}^m (y_i-(w_0+w_1*x_{i,1}+w_2*x_{i,2}))^2 \\tag {2.2}$\n",
"\n",
"we added subscript i to represent $i_{th}$ row of data. So, $x_{i,1}$ means 1st feature of $i^{th}$ row of training data\n",
"\n",
"Obviously another valid cost function will be average of absolute value of individual error also known as **Mean Absolute Error(MAE)** but we will use **MSE** for now due to its simplicity."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Finding Optimum Model (Bias and Weights)\n",
"Now we got our cost function as equation(2.2) above, lets look at it in detail\n",
"- $MSE = MSE(w_0,w_1,w_2)$ is equation with three variables $w_0,w_1,w_2$, as they are the unknown values. Whereas the $y_i,x_{i,1}, x_{i,2}$ are constants from training data\n",
"- We want to find $w_0,w_1w_2$ which will give minimum value of $MSE(w_0,w_1,w_2)$\n",
"\n",
"How would we do it?\n",
"\n",
"**Side Note**\n",
"There is a straight forward way to do it using calculus which need finding partial derivatives of the function and equating them to zero. As at lowest point of function with continuous derivative, tangent will be equal to zero (i.e at this point tangent change from positive to negative or vice versa). Then solving the equations to find the value of variables.\n",
"\n",
"But this method is not always practical due to following reasons\n",
"- Process will be different for different type of equation\n",
"- If cost function is very complex then it is very hard to find symbolic partial derivative or solve the partial derivative equations\n",
"**Side Note END**\n",
"\n",
"\n",
"\n",
"\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Gradient Descent\n",
"There is a general algorithm to minimize any function (\\*\\*function must be convex else it will only find local minima) known as gradient descent. We can summarize gradient descent algorithm as \n",
"\n",
"\n",
"1. Choose an arbitrary point lets say $W_a$, i.e arbitrary values of $w_0,w_1 and w_2$\n",
"2. Find small change in $w_0,w_1\\ and\\ w_2$ which will result in fastest decrease/descent of the MSE.\n",
" - For finding the small change we'll find rate of change of $MSE$ in direction of $w_0,w_1 and w_2$ separately, these are called partial derivatives. The combined vector of rate of change in each direction is called gradient. Then we multiply the gradient by a small number.\n",
" \n",
" **$\\delta MSE$ =** $\\eta * \\nabla MSE_{W_a}$\n",
" \n",
" - where $\\eta$ is a small step size and $\\nabla MSE$ is gradient of $MSE$ at point $W_a$. $\\nabla MSE_{W_a}$ represents the direction in which $MSE_{W_a}$ changes fastest at point ${W_a}$ and its magnitude give rate of change in that direction.\n",
"3. Now find new point using following equation \n",
" $W_b = W_a-\\eta * \\nabla MSE_{W_a} \\tag{3.1}$ \n",
"which will have lower $MSE$ value. Then replace $W_a$ with $W_b$ and repeat from step 2 for a specified number of times or until change on $MSE$ in each step become very small\n",
"\n",
"\n",
"\n",
"\n",
"#### Thought Experiment\n",
"To understand the algorithm visually/intuitively, let's do a thought experiment. \n",
"\n",
"Lets say you are operating a nano robot in a temperature field (temperature varies in 3D space and is function of location) and your objective it to take the robot to lowest temperature as fast as possible else you risk damaging it. You can ask robot temperature at its current position and move robot in any direction in 3D space, how will you guide the robot to lowest temperature?\n",
"\n",
"\n",
"\n",
"**Some points to note**\n",
"- **Here temperature is analogous to cost** and **location of robot is analogous to point $w_0,w_1,w_2$** **your objective is to find the point where cost is lowest.**\n",
"- Lets say $w_0, w_1$ and $w_2T$ represents left-right, top-bottom and front-back axis\n",
"- It is not possible to plot MSE against $w_0,w_1,w_2$ at once because it will be in 4D plane. So we need to visualize it by plotting MSE against $w_0,w_1,w_2$ one at a time.\n",
"- If we plot temperature against only $w_0$ keeping $w_1\\ and \\ w_2$ at some fixed value, we'll see curve like below. Which is a quadratic curve which has convex shape.\n",
"![MSE against w2](images/quadratic_curve.jpg)\n",
"\n",
"\n",
"\n",
"You can follow the following steps\n",
"\n",
"1. Lets say the robot starts at it current location $W_a$\n",
"2. To find rate of change in temperature in left-right direction \n",
" - move robot to right little bit get temperature there and comeback to previous position. \n",
" - Rate of change in temperature is **change in temperature** divided by **the distance moved.**\n",
" - this is called the partial derivative with respect to $w_0$ because we kept $w_1\\ and\\ w_2$ constant\n",
"3. Similarly you can find rate of change in temperature in top-bottom and front-back direction\n",
"4. Combining the rate of change in all 3 directions as vector give the gradient of temperature in field. Which is a vector has direction where there is fastest change in temperature and its magnitude give the rate of change. Lets represent it by $\\nabla T$\n",
"5. Now we need to move the robot in direction opposite of $\\nabla T$ one step. The step size should be small enough so that it doesn't jump to other side of curve where temperature could be higher (see curve above). So new location of the robot will be \n",
"$ W_b = W_a - \\eta * \\nabla T$\n",
"6. Now repeat the steps from 2 until either you reach a satisfactory temperature point or you run out of battery (i.e. steps)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Partial Derivative\n",
"If we take partial derivative of equation(1.1) w.r.t $w_j$, where j is coefficient index\n",
"\n",
"$\\frac{\\partial MSE}{\\partial w_j} = \\frac{\\partial \\frac{1}{m} \\sum_{i=1}^m (y_i-p_i)^2}{\\partial w_j}$ \n",
"\n",
"using chain rule \n",
"\n",
"$\\frac{\\partial MSE}{\\partial w_j} = \\frac{2}{m} \\sum_{i=1}^m (y_i-p_i)*-\\frac{\\partial p_i}{\\partial w_j}$ \n",
"\n",
"\n",
"from equation(1.1) we know \n",
"\n",
"$p_i = w_0+w_1*x_{i,1}+w_2*x_{i,2}$\n",
"\n",
"\n",
"$p_i = w_0*x_{i,0}+w_1*x_{i,1}+w_2*x_{i,2}$ \n",
"\n",
"where $x_{i,0}$ is always 1 called bias term\n",
"\n",
"then if we take partial derivative of $p_i$ w.r.t $w_j$ all other terms except $w_j$ is constant. The result will be\n",
"\n",
"$\\frac{\\partial p_i}{\\partial w_j} = x_{i,j}$\n",
"\n",
"\n",
"replacing this on above equation gives\n",
"\n",
"$\\frac{\\partial MSE}{\\partial w_j} = -\\frac{2}{m} \\sum_{i=1}^m (y_i-p_i)*x_{i,j}$ \n",
"\n",
"Rearranging\n",
"\n",
"$\\frac{\\partial MSE}{\\partial w_j} = \\frac{2}{m} \\sum_{i=1}^m (p_i-y_i)*x_{i,j}\\tag {4.1}$\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Matrix Representation\n",
"Matrix operations like matrix multiplication are optimized and are more efficient than doing simple array calculations. Also matrix representation make the equations more readable. So we'll convert above equations to matrix operations\n",
"\n",
"### Prediction using matrix multiplication\n",
"We can represent equation 1.1 for all data points at once as\n",
"\n",
"$X=\\left [ \\matrix{x_{0,0} & x_{0,1} & x_{0,2}\\\\\n",
"x_{1,0} & x_{1,1}& x_{1,2}\\\\\n",
"\\vdots & & \\vdots\\\\\n",
"x_{i,0} & x_{i,1}& x_{m,2}\\\\\n",
"\\vdots & & \\vdots\\\\\n",
"x_{m,0} & x_{m,1}& x_{m,2}\\\\\n",
"} \\right ]$\n",
"\n",
"$W = \\left[ \\matrix{w_0\\\\w_1\\\\w_2} \\right ]$\n",
"\n",
"\n",
"$\\boldsymbol {P = X.W} \\tag {5.1}$\n",
"\n",
"**Here:** \n",
"- **P** is prediction which is a column vector of length m\n",
"- **X** is matrix with m rows and n columns, m is number of data points and n number of features plus one bias term which is always 1\n",
"- **W** is column vector of length n\n",
"\n",
"\n",
"\n",
"### Gradient using matrix operations\n",
"\n",
"In equation (4.1) we found partial derivative of MSE w.r.t $w_j$ which is $j_{th}$ coefficient of regression model, which is $j_{th}$ component of gradient vector. Based on equation(4.1) Whole gradient can be represented using matrix operation like\n",
"\n",
"\n",
"$X^T = \\left [ \\matrix{\n",
"x_{0,0} & x_{1,0} & \\dots & x_{i,0} & \\dots & x_{m,0}\\\\\n",
"x_{0,1} & x_{1,1} & \\dots & x_{i,1} &\\dots & x_{m,1}\\\\\n",
"x_{0,2} & x_{1,2} & \\dots & x_{i,2} &\\dots & x_{m,2}\n",
"}\\right]$\n",
"\n",
"$P-Y = \\left[\\matrix{\n",
" p_0 - y_0 \\\\ p_1 - y_1 \\\\ \\vdots \\\\ p_i - y_i \\\\ \\vdots \\\\ p_m - y_m\n",
"}\n",
"\\right]\n",
"$\n",
"\n",
"\n",
"$\\boldsymbol{\\nabla MSE= \\frac{2}{m} X^T.(P-Y)}\\tag {5.2}$\n",
"\n",
"**Here:**\n",
"- $X^T$ is transpose of matrix X. It will have n rows and m columns\n",
"\n",
"\n",
"### Training step\n",
"\n",
"$\\boldsymbol{W_{new} = W - \\eta * \\nabla MSE} \\tag {5.3}$"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Coding Time"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this part we'll implement linear regression using basic python and numpy library. numpy library is the fundamental package for scientific computing in python. Its make computation on large array of data easy and efficient. \n",
"\n",
"### Numpy Introduction\n",
"Please don't get overwhelmed by numpy, we'll only use some basic functions of it. It will be explained on comment what each function does.\n",
"\n",
"**Installing Numpy**\n",
"If you have installed python using conda or miniconda it should be already installed. If its not installed in you system you can use pip to install it\n",
"``` command\n",
"pip install numpy\n",
"```\n",
"Then you can import numpy library in you jupyter notebook or python file as follow\n",
"``` python\n",
"import numpy as np\n",
"```\n",
"\n",
"**Numpy Array Operations**\n",
"Numpy provides shorthand notations to performs arithmetic operation on array, some examples are as follow\n",
"\n",
"```arr1+arr2```: will returns new array which is element wise sum of two arrays \n",
"```2*arr1```: will multiply each element of arr1 by 2 \n",
"```arr1*arr2```: will return new array which is element wise multiplication of two array \n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lets Generate Data"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"few instances of generated data\n",
" [[12.15862374 26.41754744]\n",
" [ 7.7100876 26.58160233]\n",
" [12.39040209 27.5565463 ]\n",
" [14.89781833 26.14161419]\n",
" [12.89437241 29.96356333]]\n",
"\n",
"few instances of generated targets\n",
" [4.37946646 3.41541999 3.41400057 4.3290903 3.70089309]\n"
]
}
],
"source": [
"import numpy as np\n",
"\n",
"# np.random.randn(100) will give array of 100 normally distributed random \n",
"# numbers with mean 0 and std-dev 1\n",
"\n",
"# 2*np.random.randn(100)+12 changes the normal distrubution to have mean 12 and std-dev 2\n",
"nitrate = 2*np.random.randn(100)+12\n",
"temperature = 4*np.random.rand(100) + 26\n",
"\n",
"# np.c_ concatinates (joins) two array column wise\n",
"x_farm = np.c_[nitrate,temperature]\n",
"\n",
"# This is imaginary equation describing relation between yeild, nitrate and temperature.\n",
"# Obiously this is not know in real world problem. We are using it to generate dummy data.\n",
"yeild_ideal = .1*nitrate + .08*temperature +.6\n",
"\n",
"# adding some noise on the ideal equation. \n",
"# The noise is normally distributed with 0 mean and std-dev 0.4\n",
"yeild = yeild_ideal + .4*np.random.randn(100)\n",
"\n",
"print(\"few instances of generated data\\n\", x_farm[:5])\n",
"print(\"\\nfew instances of generated targets\\n\", yeild[:5] )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Lets Define Some Functions"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"# Our equations above need bais term in x which should always be 1 \n",
"def add_bais_term(x):\n",
" \n",
" # np.ones(n) will give new array of length n whose all elements are 1 \n",
" # np.c_ concatinates two array column wise\n",
" return np.c_[np.ones(len(x)),x]\n",
"\n",
"# Root mean square cost function\n",
"def rmse_cost_func(P,Y):\n",
" ## model is array with bais and coffecients values\n",
" return np.sqrt(np.mean((P-Y)**2))\n",
"\n",
"\n",
"# Finds gradient of cost function using eq(5.2) above\n",
"def gradient_of_cost(x,y,model):\n",
" preds = predict(x,model)\n",
" \n",
" error_term = preds-y\n",
" \n",
" # np.matmul performs matrix multiplication\n",
" # x.T is transpose of matrix x\n",
" xt_dot_error_term = np.matmul(x.T,error_term)/len(x)\n",
" return xt_dot_error_term\n",
"\n",
"# Do prediction using eq(5.1) above\n",
"def predict(x,model):\n",
" #np.matmul performs matrix multiplication\n",
" return np.matmul(x,model)"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
"# find optimum values of bais and coefficient using gradient descent\n",
"def find_linear_regression_model(x,y):\n",
" n_epochs = 10000\n",
" neta = 0.001\n",
" \n",
" # Initialize all parameters(wj's) to zero\n",
" model = np.zeros(len(x[0]))\n",
" \n",
" # do n_epochs iteration\n",
" for _ in range(n_epochs):\n",
" grad = gradient_of_cost(x,y,model)\n",
" \n",
" # move parameters closer to optimum solution in every step \n",
" next_model = model - neta*grad\n",
" model = next_model\n",
" return model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predict and Evaluate"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First 3 data with bias\n",
" [[ 1. 12.15862374 26.41754744]\n",
" [ 1. 7.7100876 26.58160233]\n",
" [ 1. 12.39040209 27.5565463 ]]\n",
"\n",
"model(w0,w1,w2)\n",
" [0.03759733 0.09930154 0.10164468]\n"
]
}
],
"source": [
"x_farm_with_bais = add_bais_term(x_farm)\n",
"print(\"First 3 data with bias\\n\",x_farm_with_bais[:3])\n",
"model = find_linear_regression_model(x_farm_with_bais, yeild)\n",
"print(\"\\nmodel(w0,w1,w2)\\n\", model)"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([3.93017065, 3.50509946, 4.06895977, 4.17412975, 4.36366529,\n",
" 3.79386452, 3.74566146, 3.81563257, 4.19976404, 3.90262785,\n",
" 4.21451357, 4.05253678, 4.34120954, 4.06686309, 4.24635708])"
]
},
"execution_count": 34,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# predict yeild for \n",
"yeild_predicted = predict(x_farm_with_bais, model)\n",
"yeild_predicted[:15]"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.3679644024562277\n"
]
}
],
"source": [
"cost = rmse_cost_func(yeild_predicted,yeild)\n",
"print(cost)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**Hurray!!**\n",
"\n",
"We have now implemented linear regression and used it to make predictions. We also measured RMSE, which is equivalent to mean of absolute error made when doing a prediction. The mean error is close to std-dev of random error added which mean the model is working quite good.\n",
"\n",
"### Disclamer\n",
"Above examples are only for illustration purpose to understand the working of linear regression from inside. Following are few things to be noted \n",
"- In real world we will not know the relationship of dependent and independent variables in advance as we do now. Or even we don't know if relationship is linear, we can infer something about the relation using scatter plot or other data analysis.\n",
"- We usually don't evaluate performance of a model using training data because a model can memorize the training data to get high performance in training data but may not work well with new unseen data. To avoid this we usually separate data into three sets, training, validation and testing. Validation set is used to validate model performance and used to improve model. Testing set is used only after final model is found to estimate performance of the model on unseen data.\n",
"- This is a very basic implementation of linear regression to understand the core concepts. We hardly ever need to do our own implementation. We'll always use linear regression implementation from popular libraries which will be much more robust, efficient and provides customization parameters"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Summary\n",
"Here is short summary of linear regression algorithms\n",
"- It has linear **hypothesis equation** for predicting result\n",
" $y=w_0+w_1*x_1+w_2*x_{2} \\tag {1.1}$\n",
" \n",
" \n",
"- We used **Root Mean Square(RMSE)** cost function which need to be minimized to find best solution\n",
" $Mean\\ Square\\ Error(MSE) = \\frac{1}{m} \\sum_{i=1}^m (y_i-p_i)^2 \\tag {2.1}$\n",
" \n",
" \n",
"- For finding best solution we started at random parameter values and gradually moved towards reducing cost function until we reached the minimum. This is called **gradient descent** and it can be summarized by equation\n",
" $W_{new} = W - \\eta * \\nabla MSE \\tag {5.3}$\n",
" \n",
" \n",
"- Then we made **matrix representation** of these operations which are as below\n",
" $\\boldsymbol {P = X.W} \\tag {5.1}$ for doing predictions \n",
" $\\boldsymbol{\\nabla MSE= \\frac{2}{m} X^T.(P-Y)}\\tag {5.2}$ for finding gradient \n",
" $\\boldsymbol{W_{new} = W - \\eta * \\nabla MSE} \\tag {5.3}$ training step to reduce cost\n",
" \n",
" \n",
"- We implemented the algorithm using python and numpy library "
]
}
],
"metadata": {
"gist": {
"data": {
"description": "linear_regression_blog.ipynb",
"public": true
},
"id": ""
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment