Skip to content

Instantly share code, notes, and snippets.

@pb111
Created November 17, 2018 15:45
Show Gist options
  • Save pb111/4f0464503794c815fd249650f4828e03 to your computer and use it in GitHub Desktop.
Save pb111/4f0464503794c815fd249650f4828e03 to your computer and use it in GitHub Desktop.
SLR Project
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Simple Linear Regression Project\n",
"\n",
"\n",
"## Modelling the linear relationship between Sales and Advertising dataset\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Project overview\n",
"\n",
"\n",
"In this project, I build a Simple Linear Regression model to study the linear relationship between Sales and Advertising dataset for a dietary weight control product.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Linear Regression\n",
"\n",
"\n",
"Linear Regression is a statistical technique which is used to find the linear relationship between dependent and one or more independent variables. This technique is applicable for Supervised learning Regression problems where we try to predict a continuous variable.\n",
"\n",
"\n",
"Linear Regression can be further classified into two types – Simple and Multiple Linear Regression. In this project, I employ Simple Linear Regression technique where I have one independent and one dependent variable. It is the simplest form of Linear Regression where we fit a straight line to the data."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Simple Linear Regression (SLR)\n",
"\n",
"Simple Linear Regression (or SLR) is the simplest model in machine learning. It models the linear relationship between the independent and dependent variables. \n",
"\n",
"In this project, there is one independent or input variable which represents the Sales data and is denoted by X. Similarly, there is one dependent or output variable which represents the Advertising data and is denoted by y. We want to build a linear relationship between these variables. This linear relationship can be modelled by mathematical equation of the form:-\n",
"\t\t\t\t \n",
" \n",
" Y = β0 + β1*X ------------- (1)\n",
" \n",
"\n",
"In this equation, X and Y are called independent and dependent variables respectively,\n",
"\n",
"β1 is the coefficient for independent variable and\n",
"\n",
"β0 is the constant term.\n",
"\n",
"β0 and β1 are called parameters of the model.\n",
" \n",
"\n",
"\n",
"For simplicity, we can compare the above equation with the basic line equation of the form:-\n",
" \n",
" y = ax + b ----------------- (2)\n",
"\n",
"We can see that \n",
"\n",
"slope of the line is given by, a = β1, and\n",
"\n",
"intercept of the line by b = β0. \n",
"\n",
"\n",
"In this Simple Linear Regression model, we want to fit a line which estimates the linear relationship between X and Y. So, the question of fitting reduces to estimating the parameters of the model β0 and β1. \n",
"\n",
" \n",
"\n",
"## Ordinary Least Square Method\n",
"\n",
"As I have described earlier, the Sales and Advertising data are given by X and y respectively. We can draw a scatter plot between X and y which shows the relationship between them.\n",
"\n",
" \n",
"\n",
"Now, our task is to find a line which best fits this scatter plot. This line will help us to predict the value of any Target variable for any given Feature variable. This line is called **Regression line**. \n",
"\n",
"\n",
"We can define an error function for any line. Then, the regression line is the one which minimizes the error function. Such an error function is also called a **Cost function**. \n",
"\n",
"\n",
"## Cost Function\n",
"\n",
"We want the Regression line to resemble the dataset as closely as possible. In other words, we want the line to be as close to actual data points as possible. It can be achieved by minimizing the vertical distance between the actual data point and fitted line. I calculate the vertical distance between each data point and the line. This distance is called the **residual**. \n",
"\n",
"\n",
"So, in a regression model, we try to minimize the residuals by finding the line of best fit. The residuals are represented by the vertical dotted lines from actual data points to the line.\n",
"\n",
" \n",
"We can try to minimize the sum of the residuals, but then a large positive residual would cancel out a large negative residual. For this reason, we minimize the sum of the squares of the residuals. \n",
"\n",
"\n",
"Mathematically, we denote actual data points by yi and predicted data points by ŷi. So, the residual for a data point i would be given as \n",
"\t\t\t\t\n",
" di = yi - ŷi\n",
"\n",
"Sum of the squares of the residuals is given as:\n",
"\n",
"\t\t\t\tD = Ʃ di**2 for all data points\n",
" \n",
"\n",
"This is the **Cost function**. It denotes the total error present in the model which is the sum of the total errors of each individual data point. \n",
"\n",
"We can estimate the parameters of the model β0 and β1 by minimize the error in the model by minimizing D. Thus, we can find the regression line given by equation (1).\n",
"\n",
"\n",
"This method of finding the parameters of the model and thus regression line is called **Ordinary Least Square Method**.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## The problem statement\n",
"\n",
"The aim of building a machine learning model is to solve a problem and to define a metric to measure model performance. \n",
"\n",
"The problem is to model and investigate the linear relationship between Sales and Advertising dataset for a dietary weight control product. \n",
"\n",
"I have used two performance metrics RMSE (Root Mean Square Value) and R2 Score value to compute our model performance.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Software information\n",
"\n",
"I did this project using Jupyter notebook (Jupyter notebook server 5.5.0).\n",
"\n",
"The server is running on Python (Python 3.6.5), Anaconda dsitribution.\n",
" \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Python libraries\n",
"\n",
"I have Anaconda Python distribution installed on my system. It comes with most of the standard Python libraries I need for this project. The basic Python libraries used in this project are:-\n",
"\n",
" •\tNumpy – It provides a fast numerical array structure and operating functions.\n",
" \n",
" •\tpandas – It provides tools for data storage, manipulation and analysis tasks.\n",
" \n",
" •\tScikit-Learn – The required machine learning library in Python.\n",
" \n",
" •\tMatplotlib – It is the basic plotting library in Python. It provides tools for making plots. \n"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Import necessary libraries\n",
"\n",
"import numpy as np\n",
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"% matplotlib inline\n",
"\n",
"# The above command sets the backend of matplotlib to the 'inline' backend. \n",
"# It means the output of plotting commands is displayed inline."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## About the dataset\n",
"\n",
"The data set has been imported from the econometrics website with the following url:-\n",
"\n",
"http://www.econometrics.com/intro/sales.htm\n",
"\n",
"This data set contains Sales and Advertising expenditures for a dietary weight control product. It contains monthly data for 36 months. The variables in this data set are Sales and Advertising.\n",
"\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"# Import the data\n",
"\n",
"url = \"C:/project_datasets/SALES.txt\"\n",
"df = pd.read_csv(url, sep='\\t', header=None)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Exploratory data analysis\n",
"\n",
"\n",
"First, I import the dataset into the dataframe with the standard read_csv () function of pandas library and assign it to the df variable. Then, I conducted exploratory data analysis to get a feel for the data.\n",
"\n",
"\n",
"### pandas shape attribute\n",
"\n",
"The shape attribute of the pandas dataframe gives the dimensions of the dataframe."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(36, 2)\n"
]
}
],
"source": [
"# Exploratory data analysis\n",
"\n",
"# View the dimensions of df\n",
"\n",
"print(df.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### pandas head() method\n",
"\n",
"I viewed the top 5 rows of the pandas dataframe with the pandas head() method."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0 1\n",
"0 12.0 15.0\n",
"1 20.5 16.0\n",
"2 21.0 18.0\n",
"3 15.5 27.0\n",
"4 15.3 21.0\n"
]
}
],
"source": [
"# View the top 5 rows of df\n",
"\n",
"print(df.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### pandas columns attribute\n",
"\n",
"I renamed the column labels of the dataframe with the columns attribute."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"# Rename columns of df dataframe\n",
"\n",
"df.columns = ['Sales', 'Advertising']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### column names renamed\n",
"\n",
"I viewed the renamed column names."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Sales Advertising\n",
"0 12.0 15.0\n",
"1 20.5 16.0\n",
"2 21.0 18.0\n",
"3 15.5 27.0\n",
"4 15.3 21.0\n"
]
}
],
"source": [
"# View the top 5 rows of df with column names renamed\n",
"\n",
"print(df.head())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### pandas info() method\n",
"\n",
"I viewed the summary of the dataframe with the pandas info() method."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 36 entries, 0 to 35\n",
"Data columns (total 2 columns):\n",
"Sales 36 non-null float64\n",
"Advertising 36 non-null float64\n",
"dtypes: float64(2)\n",
"memory usage: 656.0 bytes\n",
"None\n"
]
}
],
"source": [
"# View dataframe summary\n",
"\n",
"print(df.info())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### pandas describe() method\n",
"\n",
"I look at the descriptive statistics of the dataframe with the pandas describe() method."
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" Sales Advertising\n",
"count 36.000000 36.000000\n",
"mean 24.255556 28.527778\n",
"std 6.185118 18.777625\n",
"min 12.000000 1.000000\n",
"25% 20.300000 15.750000\n",
"50% 24.250000 23.000000\n",
"75% 28.600000 41.000000\n",
"max 36.500000 65.000000\n"
]
}
],
"source": [
"# View descriptive statistics\n",
"\n",
"print(df.describe())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Independent and Dependent Variables\n",
"\n",
"\n",
"In this project, I refer Independent variable as Feature variable and Dependent variable as Target variable. These variables are also recognized by different names as follows: -\n",
"\n",
"\n",
"### Independent variable\n",
"\n",
"Independent variable is also called Input variable and is denoted by X. In practical applications, independent variable is also called Feature variable or Predictor variable. We can denote it as:-\n",
"\n",
"Independent or Input variable (X) = Feature variable = Predictor variable \n",
"\n",
"\n",
"### Dependent variable\n",
"\n",
"Dependent variable is also called Output variable and is denoted by y. \n",
"\n",
"Dependent variable is also called Target variable or Response variable. It can be denoted it as follows:-\n",
"\n",
"Dependent or Output variable (y) = Target variable = Response variable\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"# Declare feature variable and target variable\n",
"\n",
"X = df['Sales'].values\n",
"y = df['Advertising'].values\n",
"\n",
"# Sales and Advertising data values are given by X and y respectively.\n",
"\n",
"# Values attribute of pandas dataframe returns the numpy arrays."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Visual exploratory data analysis\n",
"\n",
"I visualize the relationship between X and y by plotting a scatterplot between X and y."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plot scatter plot between X and y\n",
"\n",
"plt.scatter(X, y, color = 'blue', label='Scatter Plot')\n",
"plt.title('Relationship between Sales and Advertising')\n",
"plt.xlabel('Sales')\n",
"plt.ylabel('Advertising')\n",
"plt.legend(loc=4)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Checking dimensions of X and y\n",
"\n",
"We need to check the dimensions of X and y to make sure they are in right format for Scikit-Learn API. \n",
"\n",
"It is an important precursor to model building. "
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(36,)\n",
"(36,)\n"
]
}
],
"source": [
"# Print the dimensions of X and y\n",
"\n",
"print(X.shape)\n",
"print(y.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reshaping X and y\n",
"\n",
"Since we are working with only one feature variable, so we need to reshape using Numpy reshape() method.\n",
"\n",
"It specifies first dimension to be -1, which means \"unspecified\".\n",
"\n",
"Its value is inferred from the length of the array and the remaining dimensions.\n"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"# Reshape X and y\n",
"\n",
"X = X.reshape(-1,1)\n",
"y = y.reshape(-1,1)\n"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(36, 1)\n",
"(36, 1)\n"
]
}
],
"source": [
"# Print the dimensions of X and y after reshaping\n",
"\n",
"print(X.shape)\n",
"print(y.shape)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Difference in dimensions of X and y after reshaping\n",
"\n",
"\n",
"We can see the difference in diminsions of X and y before and after reshaping.\n",
"\n",
"It is essential in this case because getting the feature and target variable right is an important precursor to model building."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train test split\n",
"\n",
"\n",
"I split the dataset into two sets namely - train set and test set.\n",
"\n",
"The model learn the relationships from the training data and predict on test data.\n"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"# Split X and y into training and test data sets\n",
"\n",
"from sklearn.model_selection import train_test_split\n",
"X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.33, random_state=42)\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"(24, 1)\n",
"(24, 1)\n",
"(12, 1)\n",
"(12, 1)\n"
]
}
],
"source": [
"# Print the dimensions of X_train,X_test,y_train,y_test\n",
"\n",
"print(X_train.shape)\n",
"print(y_train.shape)\n",
"print(X_test.shape)\n",
"print(y_test.shape)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mechanics of the model\n",
"\n",
"\n",
"I split the dataset into two sets – the training set and the test set. Then, I instantiate the regressor lm and fit it on the training set with the fit method. \n",
"\n",
"In this step, the model learned the relationships between the training data (X_train, y_train). \n",
"\n",
"Now the model is ready to make predictions on the test data (X_test). Hence, I predict on the test data using the predict method. \n"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [],
"source": [
"# Fit the linear model\n",
"\n",
"# Instantiate the linear regression object lm\n",
"from sklearn.linear_model import LinearRegression\n",
"lm = LinearRegression()\n",
"\n",
"\n",
"# Train the model using training data sets\n",
"lm.fit(X_train,y_train)\n",
"\n",
"\n",
"# Predict on the test data\n",
"y_pred=lm.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Model slope and intercept term\n",
"\n",
"The model slope is given by lm.coef_ and model intercept term is given by lm.intercept_. \n",
"\n",
"The estimated model slope and intercept values are 1.60509347 and -11.16003616.\n",
"\n",
"So, the equation of the fitted regression line is\n",
"\n",
"y = 1.60509347 * x - 11.16003616 \n"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Estimated model slope, a: [[1.60509347]]\n",
"Estimated model intercept, b: (array([-11.16003616]),)\n"
]
}
],
"source": [
"# Compute model slope and intercept\n",
"\n",
"a = lm.coef_\n",
"b = lm.intercept_,\n",
"print(\"Estimated model slope, a:\" , a)\n",
"print(\"Estimated model intercept, b:\" , b) \n"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"# So, our fitted regression line is \n",
"\n",
"# y = 1.60509347 * x - 11.16003616 \n",
"\n",
"# That is our linear model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Making predictions\n",
"\n",
"\n",
"I have predicted the Advertising values on first five 5 Sales datasets by writing code\n",
"\n",
"\n",
"\t\tlm.predict(X) [0:5] \n",
" \n",
"\n",
"If I remove [0:5], then I will get predicted Advertising values for the whole Sales dataset.\n",
"\n",
"\n",
"To make prediction, on an individual Sales value, I write\n",
"\n",
"\n",
"\t\tlm.predict(Xi)\n",
" \n",
"\n",
"where Xi is the Sales data value of the ith observation.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[ 8.10108551],\n",
" [21.74438002],\n",
" [22.54692675],\n",
" [13.71891266],\n",
" [13.39789396]])"
]
},
"execution_count": 20,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Predicting Advertising values\n",
"\n",
"lm.predict(X)[0:5]\n",
"\n",
"# Predicting Advertising values on first five Sales values."
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[[27.36220717]]\n"
]
}
],
"source": [
"# To make an individual prediction using the linear regression model.\n",
"\n",
"print(str(lm.predict(24)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Regression metrics for model performance\n",
"\n",
"\n",
"Now, it is the time to evaluate model performance. \n",
"\n",
"For regression problems, there are two ways to compute the model performance. They are RMSE (Root Mean Square Error) and R-Squared Value. These are explained below:- \n",
"\n",
"\n",
"### RMSE\n",
"\n",
"RMSE is the standard deviation of the residuals. So, RMSE gives us the standard deviation of the unexplained variance by the model. It can be calculated by taking square root of Mean Squared Error.\n",
"RMSE is an absolute measure of fit. It gives us how spread the residuals are, given by the standard deviation of the residuals. The more concentrated the data is around the regression line, the lower the residuals and hence lower the standard deviation of residuals. It results in lower values of RMSE. So, lower values of RMSE indicate better fit of data. \n"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"RMSE value: 11.2273\n"
]
}
],
"source": [
"# Calculate and print Root Mean Square Error(RMSE)\n",
"\n",
"from sklearn.metrics import mean_squared_error\n",
"mse = mean_squared_error(y_test, y_pred)\n",
"rmse = np.sqrt(mse)\n",
"print(\"RMSE value: {:.4f}\".format(rmse))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### R2 Score\n",
"\n",
"\n",
"R2 Score is another metric to evaluate performance of a regression model. It is also called coefficient of determination. It gives us an idea of goodness of fit for the linear regression models. It indicates the percentage of variance that is explained by the model. \n",
"\n",
"\n",
"Mathematically, \n",
"\n",
"\n",
"R2 Score = Explained Variation/Total Variation\n",
"\n",
"\n",
"In general, the higher the R2 Score value, the better the model fits the data. Usually, its value ranges from 0 to 1. So, we want its value to be as close to 1. Its value can become negative if our model is wrong.\n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"R2 Score value: 0.5789\n"
]
}
],
"source": [
"# Calculate and print r2_score\n",
"\n",
"from sklearn.metrics import r2_score\n",
"print (\"R2 Score value: {:.4f}\".format(r2_score(y_test, y_pred)))\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Interpretation and Conclusion\n",
"\n",
"\n",
"The RMSE value has been found to be 11.2273. It means the standard deviation for our prediction is 11.2273. So, sometimes we expect the predictions to be off by more than 11.2273 and other times we expect less than 11.2273. So, the model is not good fit to the data. \n",
"\n",
"\n",
"In business decisions, the benchmark for the R2 score value is 0.7. It means if R2 score value >= 0.7, then the model is good enough to deploy on unseen data whereas if R2 score value < 0.7, then the model is not good enough to deploy. Our R2 score value has been found to be .5789. It means that this model explains 57.89 % of the variance in our dependent variable. So, the R2 score value confirms that the model is not good enough to deploy because it does not provide good fit to the data.\n"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plot the Regression Line\n",
"\n",
"\n",
"plt.scatter(X, y, color = 'blue', label='Scatter Plot')\n",
"plt.plot(X_test, y_pred, color = 'black', linewidth=3, label = 'Regression Line')\n",
"plt.title('Relationship between Sales and Advertising')\n",
"plt.xlabel('Sales')\n",
"plt.ylabel('Advertising')\n",
"plt.legend(loc=4)\n",
"plt.show()\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Residual analysis\n",
"\n",
"\n",
"\n",
"A linear regression model may not represent the data appropriately. The model may be a poor fit to the data. So, we should validate our model by defining and examining residual plots.\n",
"\n",
"The difference between the observed value of the dependent variable (y) and the predicted value (ŷi) is called the residual and is denoted by e. The scatter-plot of these residuals is called residual plot.\n",
"\n",
"If the data points in a residual plot are randomly dispersed around horizontal axis and an approximate zero residual mean, a linear regression model may be appropriate for the data. Otherwise a non-linear model may be more appropriate.\n",
"\n",
"If we take a look at the generated ‘Residual errors’ plot, we can clearly see that the train data plot pattern is non-random. Same is the case with the test data plot pattern.\n",
"So, it suggests a better-fit for a non-linear model. \n",
"\n"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"# Plotting residual errors\n",
"\n",
"plt.scatter(lm.predict(X_train), lm.predict(X_train) - y_train, color = 'red', label = 'Train data')\n",
"plt.scatter(lm.predict(X_test), lm.predict(X_test) - y_test, color = 'blue', label = 'Test data')\n",
"plt.hlines(xmin = 0, xmax = 50, y = 0, linewidth = 3)\n",
"plt.title('Residual errors')\n",
"plt.legend(loc = 4)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Checking for Overfitting and Underfitting\n",
"\n",
"\n",
"I calculate training set score as 0.2861. Similarly, I calculate test set score as 0.5789. \n",
"The training set score is very poor. So, the model does not learn the relationships appropriately from the training data. Thus, the model performs poorly on the training data. It is a clear sign of Underfitting. Hence, I validated my finding that the linear regression model does not provide good fit to the data. \n",
"\n",
"\n",
"Underfitting means our model performs poorly on the training data. It means the model does not capture the relationships between the training data. This problem can be improved by increasing model complexity. We should use more powerful models like Polynomial regression to increase model complexity. \n"
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Training set score: 0.2861\n",
"Test set score: 0.5789\n"
]
}
],
"source": [
"# Checking for Overfitting or Underfitting the data\n",
"\n",
"print(\"Training set score: {:.4f}\".format(lm.score(X_train,y_train)))\n",
"\n",
"print(\"Test set score: {:.4f}\".format(lm.score(X_test,y_test)))"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['lm_regressor.pkl']"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Save model for future use\n",
"\n",
"from sklearn.externals import joblib\n",
"joblib.dump(lm, 'lm_regressor.pkl')\n",
"\n",
"# To load the model\n",
"\n",
"# lm2=joblib.load('lm_regressor.pkl')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Simple Linear Regression - Model Assumptions\n",
"\n",
"\n",
"\n",
"The Linear Regression Model is based on several assumptions which are listed below:-\n",
"\n",
"i.\tLinear relationship\n",
"ii.\tMultivariate normality\n",
"iii.No or little multicollinearity\n",
"iv.\tNo auto-correlation\n",
"v.\tHomoscedasticity\n",
"\n",
"\n",
"### i.\tLinear relationship\n",
"\n",
"\n",
"The relationship between response and feature variables should be linear. This linear relationship assumption can be tested by plotting a scatter-plot between response and feature variables.\n",
"\n",
"\n",
"### ii.\tMultivariate normality\n",
"\n",
"The linear regression model requires all variables to be multivariate normal. A multivariate normal distribution means a vector in multiple normally distributed variables, where any linear combination of the variables is also normally distributed.\n",
"\n",
"\n",
"### iii.\tNo or little multicollinearity\n",
"\n",
"It is assumed that there is little or no multicollinearity in the data. Multicollinearity occurs when the features (or independent variables) are highly correlated.\n",
"\n",
"\n",
"### iv.\tNo auto-correlation\n",
"\n",
"Also, it is assumed that there is little or no auto-correlation in the data. Autocorrelation occurs when the residual errors are not independent from each other.\n",
"\n",
"\n",
"### v.\tHomoscedasticity\n",
"\n",
"Homoscedasticity describes a situation in which the error term (that is, the noise in the model) is the same across all values of the independent variables. It means the residuals are same across the regression line. It can be checked by looking at scatter plot.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## References\n",
"\n",
"\n",
"The concepts and ideas in this project have been taken from the following websites and books:-\n",
"\n",
"i.\tMachine learning notes by Andrew Ng\n",
"\n",
"ii.\thttps://en.wikipedia.org/wiki/Linear_regression\n",
"\n",
"iii.https://en.wikipedia.org/wiki/Simple_linear_regression\n",
"\n",
"iv.\thttps://en.wikipedia.org/wiki/Ordinary_least_squares\n",
"\n",
"v.\thttps://en.wikipedia.org/wiki/Root-mean-square_deviation\n",
"\n",
"vi.\thttps://en.wikipedia.org/wiki/Coefficient_of_determination\n",
"\n",
"vii.https://www.statisticssolutions.com/assumptions-of-linear-regression/\n",
"\n",
"viii.Python Data Science Handbook by Jake VanderPlas\n",
"\n",
"ix.\tHands-On Machine Learning with Scikit Learn and Tensorflow by Aurilien Geron\n",
"\n",
"x.\tIntroduction to Machine Learning with Python by Andreas C Muller and Sarah Guido\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.5"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment