Skip to content

Instantly share code, notes, and snippets.

@tkhan0
Last active October 21, 2019 23:58
Show Gist options
  • Save tkhan0/8e4ee93feed1228d0101b23a67e8ee1c to your computer and use it in GitHub Desktop.
Save tkhan0/8e4ee93feed1228d0101b23a67e8ee1c to your computer and use it in GitHub Desktop.
Multiple Linear Regression
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 1. Import the data to a DataFrame using Pandas"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"df = pd.read_csv('/Admission_Predict_Ver1.2.csv',encoding = 'utf-8')"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Serial No.</th>\n",
" <th>GRE Score</th>\n",
" <th>TOEFL Score</th>\n",
" <th>University Rating</th>\n",
" <th>SOP</th>\n",
" <th>LOR</th>\n",
" <th>CGPA</th>\n",
" <th>Research</th>\n",
" <th>Admit</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1</td>\n",
" <td>337</td>\n",
" <td>118</td>\n",
" <td>4</td>\n",
" <td>4.5</td>\n",
" <td>4.5</td>\n",
" <td>9.65</td>\n",
" <td>1</td>\n",
" <td>0.92</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2</td>\n",
" <td>324</td>\n",
" <td>107</td>\n",
" <td>4</td>\n",
" <td>4.0</td>\n",
" <td>4.5</td>\n",
" <td>8.87</td>\n",
" <td>1</td>\n",
" <td>0.76</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>3</td>\n",
" <td>316</td>\n",
" <td>104</td>\n",
" <td>3</td>\n",
" <td>3.0</td>\n",
" <td>3.5</td>\n",
" <td>8.00</td>\n",
" <td>1</td>\n",
" <td>0.72</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>4</td>\n",
" <td>322</td>\n",
" <td>110</td>\n",
" <td>3</td>\n",
" <td>3.5</td>\n",
" <td>2.5</td>\n",
" <td>8.67</td>\n",
" <td>1</td>\n",
" <td>0.80</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5</td>\n",
" <td>314</td>\n",
" <td>103</td>\n",
" <td>2</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>8.21</td>\n",
" <td>0</td>\n",
" <td>0.65</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA \\\n",
"0 1 337 118 4 4.5 4.5 9.65 \n",
"1 2 324 107 4 4.0 4.5 8.87 \n",
"2 3 316 104 3 3.0 3.5 8.00 \n",
"3 4 322 110 3 3.5 2.5 8.67 \n",
"4 5 314 103 2 2.0 3.0 8.21 \n",
"\n",
" Research Admit \n",
"0 1 0.92 \n",
"1 1 0.76 \n",
"2 1 0.72 \n",
"3 1 0.80 \n",
"4 0 0.65 "
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df.head()"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 500 entries, 0 to 499\n",
"Data columns (total 9 columns):\n",
"Serial No. 500 non-null int64\n",
"GRE Score 500 non-null int64\n",
"TOEFL Score 500 non-null int64\n",
"University Rating 500 non-null int64\n",
"SOP 500 non-null float64\n",
"LOR 500 non-null float64\n",
"CGPA 500 non-null float64\n",
"Research 500 non-null int64\n",
"Admit 500 non-null float64\n",
"dtypes: float64(4), int64(5)\n",
"memory usage: 35.2 KB\n"
]
}
],
"source": [
"df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 2. It is a good practice to shuffle the data to remove any kind of order effects in data."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.utils import shuffle"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"df_shuffled = shuffle(df,random_state = 42)"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Serial No.</th>\n",
" <th>GRE Score</th>\n",
" <th>TOEFL Score</th>\n",
" <th>University Rating</th>\n",
" <th>SOP</th>\n",
" <th>LOR</th>\n",
" <th>CGPA</th>\n",
" <th>Research</th>\n",
" <th>Admit</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>361</th>\n",
" <td>362</td>\n",
" <td>334</td>\n",
" <td>116</td>\n",
" <td>4</td>\n",
" <td>4.0</td>\n",
" <td>3.5</td>\n",
" <td>9.54</td>\n",
" <td>1</td>\n",
" <td>0.93</td>\n",
" </tr>\n",
" <tr>\n",
" <th>73</th>\n",
" <td>74</td>\n",
" <td>314</td>\n",
" <td>108</td>\n",
" <td>4</td>\n",
" <td>4.5</td>\n",
" <td>4.0</td>\n",
" <td>9.04</td>\n",
" <td>1</td>\n",
" <td>0.84</td>\n",
" </tr>\n",
" <tr>\n",
" <th>374</th>\n",
" <td>375</td>\n",
" <td>315</td>\n",
" <td>105</td>\n",
" <td>2</td>\n",
" <td>2.0</td>\n",
" <td>2.5</td>\n",
" <td>7.65</td>\n",
" <td>0</td>\n",
" <td>0.39</td>\n",
" </tr>\n",
" <tr>\n",
" <th>155</th>\n",
" <td>156</td>\n",
" <td>312</td>\n",
" <td>109</td>\n",
" <td>3</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>8.69</td>\n",
" <td>0</td>\n",
" <td>0.77</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>105</td>\n",
" <td>326</td>\n",
" <td>112</td>\n",
" <td>3</td>\n",
" <td>3.5</td>\n",
" <td>3.0</td>\n",
" <td>9.05</td>\n",
" <td>1</td>\n",
" <td>0.74</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Serial No. GRE Score TOEFL Score University Rating SOP LOR CGPA \\\n",
"361 362 334 116 4 4.0 3.5 9.54 \n",
"73 74 314 108 4 4.5 4.0 9.04 \n",
"374 375 315 105 2 2.0 2.5 7.65 \n",
"155 156 312 109 3 3.0 3.0 8.69 \n",
"104 105 326 112 3 3.5 3.0 9.05 \n",
"\n",
" Research Admit \n",
"361 1 0.93 \n",
"73 1 0.84 \n",
"374 0 0.39 \n",
"155 0 0.77 \n",
"104 1 0.74 "
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_shuffled.head()"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"DV = 'Admit '"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 3. Splitting the DataFrame df_shuffled into feature variable(X) and dependent variable(y)"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"X = df_shuffled.drop(['Admit ','Serial No.'], axis=1)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>GRE Score</th>\n",
" <th>TOEFL Score</th>\n",
" <th>University Rating</th>\n",
" <th>SOP</th>\n",
" <th>LOR</th>\n",
" <th>CGPA</th>\n",
" <th>Research</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>361</th>\n",
" <td>334</td>\n",
" <td>116</td>\n",
" <td>4</td>\n",
" <td>4.0</td>\n",
" <td>3.5</td>\n",
" <td>9.54</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>73</th>\n",
" <td>314</td>\n",
" <td>108</td>\n",
" <td>4</td>\n",
" <td>4.5</td>\n",
" <td>4.0</td>\n",
" <td>9.04</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>374</th>\n",
" <td>315</td>\n",
" <td>105</td>\n",
" <td>2</td>\n",
" <td>2.0</td>\n",
" <td>2.5</td>\n",
" <td>7.65</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>155</th>\n",
" <td>312</td>\n",
" <td>109</td>\n",
" <td>3</td>\n",
" <td>3.0</td>\n",
" <td>3.0</td>\n",
" <td>8.69</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>104</th>\n",
" <td>326</td>\n",
" <td>112</td>\n",
" <td>3</td>\n",
" <td>3.5</td>\n",
" <td>3.0</td>\n",
" <td>9.05</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" GRE Score TOEFL Score University Rating SOP LOR CGPA Research\n",
"361 334 116 4 4.0 3.5 9.54 1\n",
"73 314 108 4 4.5 4.0 9.04 1\n",
"374 315 105 2 2.0 2.5 7.65 0\n",
"155 312 109 3 3.0 3.0 8.69 0\n",
"104 326 112 3 3.5 3.0 9.05 1"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X.head()"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [],
"source": [
"y = df_shuffled[DV]"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"361 0.93\n",
"73 0.84\n",
"374 0.39\n",
"155 0.77\n",
"104 0.74\n",
"Name: Admit , dtype: float64"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 4. Split X and y into training and testing sets"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = 0.33, random_state = 42)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>GRE Score</th>\n",
" <th>TOEFL Score</th>\n",
" <th>University Rating</th>\n",
" <th>SOP</th>\n",
" <th>LOR</th>\n",
" <th>CGPA</th>\n",
" <th>Research</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>443</th>\n",
" <td>321</td>\n",
" <td>114</td>\n",
" <td>5</td>\n",
" <td>4.5</td>\n",
" <td>4.5</td>\n",
" <td>9.16</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>497</th>\n",
" <td>330</td>\n",
" <td>120</td>\n",
" <td>5</td>\n",
" <td>4.5</td>\n",
" <td>5.0</td>\n",
" <td>9.56</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>124</th>\n",
" <td>301</td>\n",
" <td>106</td>\n",
" <td>4</td>\n",
" <td>2.5</td>\n",
" <td>3.0</td>\n",
" <td>8.47</td>\n",
" <td>0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>50</th>\n",
" <td>313</td>\n",
" <td>98</td>\n",
" <td>3</td>\n",
" <td>2.5</td>\n",
" <td>4.5</td>\n",
" <td>8.30</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>331</th>\n",
" <td>311</td>\n",
" <td>105</td>\n",
" <td>2</td>\n",
" <td>3.0</td>\n",
" <td>2.0</td>\n",
" <td>8.12</td>\n",
" <td>1</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" GRE Score TOEFL Score University Rating SOP LOR CGPA Research\n",
"443 321 114 5 4.5 4.5 9.16 1\n",
"497 330 120 5 4.5 5.0 9.56 1\n",
"124 301 106 4 2.5 3.0 8.47 0\n",
"50 313 98 3 2.5 4.5 8.30 1\n",
"331 311 105 2 3.0 2.0 8.12 1"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.head()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(335, 7)"
]
},
"execution_count": 18,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5. Instantiating the Linear Regression model and fitting the model"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [],
"source": [
"model = LinearRegression()"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.fit(X_train[['GRE Score']],y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6. Extracting the intercept(c) and the coefficient(m)"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"intercept = model.intercept_"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-2.6151678753807004"
]
},
"execution_count": 23,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"intercept"
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [],
"source": [
"coefficient = model.coef_"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.01054397])"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"coefficient"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7. Printing out the Equation (y = c + mx) in terms of Admit(y), intercept(c), coefficient(m) and x(GRE Score) "
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Admit = -2.62 + (0.01 x GRE Score)\n"
]
}
],
"source": [
"print('Admit = {0:0.2f} + ({1:0.2f} x GRE Score)'.format(intercept,coefficient[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Predicted Value"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [],
"source": [
"Admit = -2.62 + (0.01 * X_train['GRE Score'])"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"443 0.59\n",
"497 0.68\n",
"124 0.39\n",
"50 0.51\n",
"331 0.49\n",
"Name: GRE Score, dtype: float64"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Admit.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 8. Generate predictions of the test data using the following"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [],
"source": [
"predictions = model.predict(X_test[['GRE Score']])"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(165,)"
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictions.shape"
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.56911022, 0.62183005, 0.69563782, 0.55856625, 0.63237402,\n",
" 0.59019815, 0.82216543, 0.74835766, 0.65346195, 0.81162146,\n",
" 0.75890163, 0.59019815, 0.79053353, 0.79053353, 0.81162146,\n",
" 0.70618179, 0.85379733, 0.54802228, 0.68509386, 0.63237402,\n",
" 0.90651717, 0.8327094 , 0.77998956, 0.61128609, 0.8327094 ,\n",
" 0.54802228, 0.56911022, 0.8327094 , 0.71672576, 0.67454989,\n",
" 0.72726972, 0.67454989, 0.69563782, 0.50584641, 0.70618179,\n",
" 0.49530245, 0.70618179, 0.57965418, 0.69563782, 0.59019815,\n",
" 0.59019815, 0.70618179, 0.72726972, 0.88542923, 0.79053353,\n",
" 0.84325336, 0.90651717, 0.71672576, 0.63237402, 0.66400592,\n",
" 0.75890163, 0.66400592, 0.75890163, 0.52693435, 0.48475848,\n",
" 0.75890163, 0.67454989, 0.52693435, 0.72726972, 0.60074212,\n",
" 0.9276051 , 0.52693435, 0.74835766, 0.79053353, 0.60074212,\n",
" 0.70618179, 0.77998956, 0.93814907, 0.90651717, 0.79053353,\n",
" 0.69563782, 0.50584641, 0.8010775 , 0.72726972, 0.94869304,\n",
" 0.67454989, 0.52693435, 0.73781369, 0.76944559, 0.73781369,\n",
" 0.62183005, 0.64291799, 0.8010775 , 0.56911022, 0.84325336,\n",
" 0.74835766, 0.69563782, 0.8010775 , 0.8010775 , 0.96978097,\n",
" 0.59019815, 0.79053353, 0.8643413 , 0.81162146, 0.53747832,\n",
" 0.76944559, 0.54802228, 0.68509386, 0.62183005, 0.50584641,\n",
" 0.68509386, 0.71672576, 0.51639038, 0.82216543, 0.75890163,\n",
" 0.959237 , 0.59019815, 0.66400592, 0.84325336, 0.75890163,\n",
" 0.9276051 , 0.8327094 , 0.87488527, 0.55856625, 0.82216543,\n",
" 0.63237402, 0.59019815, 0.61128609, 0.85379733, 0.66400592,\n",
" 0.8010775 , 0.76944559, 0.79053353, 0.72726972, 0.71672576,\n",
" 0.67454989, 0.71672576, 0.77998956, 0.71672576, 0.54802228,\n",
" 0.74835766, 0.96978097, 0.85379733, 0.96978097, 0.8327094 ,\n",
" 0.73781369, 0.59019815, 0.8327094 , 0.54802228, 0.53747832,\n",
" 0.8010775 , 0.73781369, 0.66400592, 0.75890163, 0.66400592,\n",
" 0.87488527, 0.8327094 , 0.53747832, 0.63237402, 0.49530245,\n",
" 0.85379733, 0.66400592, 0.55856625, 0.88542923, 0.76944559,\n",
" 0.96978097, 0.67454989, 0.54802228, 0.74835766, 0.55856625,\n",
" 0.71672576, 0.65346195, 0.8010775 , 0.67454989, 0.72726972])"
]
},
"execution_count": 31,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 9. Using the pearson correlation coefficient to determine direction of the linear relationship between the dependent variable y and the predicted value of y which is the variable predictions in our case\n",
"\n",
"ex: correlation_coeff, p_value = pearsonr(x,y). In the below code we access the 0th index as the correlation_coeff is the 1st element output for pearsonr(x,y)."
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"Text(0.5,1,'Predicted vs Actual Values (r =0.80)')"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import matplotlib.pyplot as plt\n",
"from scipy.stats import pearsonr\n",
"plt.scatter(y_test,predictions)\n",
"plt.xlabel('Y test (True values)')\n",
"plt.ylabel('Predicted Values')\n",
"plt.title('Predicted vs Actual Values (r ={0:0.2f})'.format(pearsonr(y_test,predictions)[0]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the above plot we can see that the Predicted vs the Actual value(pearson r value = 0.80). \n",
"\n",
"This shows that there is a moderate, positive linear correlation between the predicted and the actual value. For a perfect model all the points would align in a straight line with the pearson r value being 1.0"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 10. Now we will find out the residuals (difference between the true and predicted value). Generally a model fits the data very well if it will have normally distributed residuals. We can test this using the Shapiro - Wilk test.\n",
"\n",
"Shapiro- Wilk test is generally used for testing the normality. It tests the null hypothesis that the data was drawn from a normal distribution.\n",
"\n",
"A p-value(probability value) < 0.05 indicates a non-normal distribution while a p-value > 0.05 indicates a normal distribution. \n",
"\n",
"In the below code we are indexing the [1] element of (shapiro(y_test - predictions)[1] as the p-value is at the [1] index of the shapiro.\n",
"\n",
"ex: shap_w, Shap_p = shapiro(y_test - predictions)\n",
"\n",
"#### Let's take an example to better understand this:\n",
"\n",
"Let's consider a food delivery app claims that their delivery times are 30 minutes or less on average. However according to you it’s more than that. To verify this you carry out a hypothesis test because you believe the null hypothesis, that the mean delivery time is 30 minutes max, is incorrect. \n",
"\n",
"The alternative hypothesis is that the mean time is greater than 30 minutes. To carry out we randomly sample some delivery times and run the data through the hypothesis test, and the p-value(probability) turns out to be 0.001, which is much less than 0.05. \n",
"\n",
"In real terms, there is a probability of 0.05 that you will mistakenly reject the pizza place’s claim that their delivery time is less than or equal to 30 minutes. Since typically we are willing to reject the null hypothesis when this probability is less than 0.05, you conclude that the pizza place is wrong."
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\tkhan050\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\scipy\\stats\\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.\n",
" return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"from scipy.stats import shapiro\n",
"sns.distplot((y_test - predictions),bins = 50)\n",
"plt.xlabel('Residuals')\n",
"plt.ylabel('density')\n",
"plt.title('Histogram of residuals (Shapiro W p-value = {0:0.3f})'.format(shapiro(y_test-predictions)[1]))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The histogram shows us that the residuals are negatively skewed and the value of the Shapiro W p-value in the title tells us that the distribution is not normal. This gives us further evidence that our model has room for improvement."
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [],
"source": [
"shap_w, shap_p = shapiro(y_test - predictions)"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"2.519790541555267e-06"
]
},
"execution_count": 35,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"shap_p"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 11. Computing the metrics for mean absolute error, mean squared error, root mean squared error, and R-squared, and put them into a DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import metrics"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"metrics_df = pd.DataFrame({'Metric':['MAE','MSE','RMSE','R-Squared'],\n",
" 'Value':[metrics.mean_absolute_error(y_test,predictions),\n",
" metrics.mean_squared_error(y_test,predictions),\n",
" np.sqrt(metrics.mean_squared_error(y_test,predictions)),\n",
" metrics.explained_variance_score(y_test,predictions)]}).round(3)"
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Metric</th>\n",
" <th>Value</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>MAE</td>\n",
" <td>0.059</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>MSE</td>\n",
" <td>0.006</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>RMSE</td>\n",
" <td>0.080</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>R-Squared</td>\n",
" <td>0.629</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Metric Value\n",
"0 MAE 0.059\n",
"1 MSE 0.006\n",
"2 RMSE 0.080\n",
"3 R-Squared 0.629"
]
},
"execution_count": 38,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"metrics_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"***Mean absolute error (MAE)*** is the average absolute difference between the predicted values and the actual values. \n",
"\n",
"***Mean squared error (MSE)*** is the average of the squared differences between the predicted and actual values. \n",
"\n",
"***Root mean squared error (RMSE)*** is the square root of the MSE. \n",
"\n",
"***R-squared*** tells us the proportion of variance in the dependent variable that can be explained by the model. Thus, in this simple linear regression model, **GRE Score** explained 62.9% of the variance in Admit (The meaning of that is 62.9% of the times the **Admit** will change with the change of **GRE Score**). Additionally, our predictions were within ± 0.059 **Admit score**."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Exercise: Fitting Multiple Linear Regression Model and determining the Intercept and Coefficient"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 5. Instantiating the Multiple Linear Regression model and fitting the model"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {},
"outputs": [],
"source": [
"model = LinearRegression()"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1, normalize=False)"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"model.fit(X_train,y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 6. Calculate the Model Intercept and Coefficient --Regression Coefficient"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"intercept = model.intercept_"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"-1.4242541443027852"
]
},
"execution_count": 43,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"intercept"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [],
"source": [
"coefficients = model.coef_"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.00217167, 0.00294465, 0.00431416, 0.00161238, 0.01659515,\n",
" 0.12281766, 0.02050198])"
]
},
"execution_count": 45,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"coefficients"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 7. Printing the equation using the coefficients we got above"
]
},
{
"cell_type": "code",
"execution_count": 75,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Admit_Predict = -1.4243 + (0.0022 x GRE Score) + (0.0029 x TOEFL Score) + (0.0043 x University Rating) + (0.0016 x SOP) +(0.0166 x LOR) + (0.1228 x CGPA) +(0.0205 x Research)\n"
]
}
],
"source": [
"print('Admit_Predict = {0:0.4f} + ({1:0.4f} x GRE Score) + ({2:0.4f} x TOEFL Score) + ({3:0.4f} x University Rating) + ({4:0.4f} x SOP) +({5:0.4f} x LOR) + ({6:0.4f} x CGPA) +({7:0.4f} x Research)'.format(intercept, \n",
" coefficients[0], \n",
" coefficients[1], \n",
" coefficients[2], \n",
" coefficients[3], \n",
" coefficients[4], \n",
" coefficients[5], \n",
" coefficients[6]))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 8. Implementing the above equation and predicting the Admit Scores"
]
},
{
"cell_type": "code",
"execution_count": 117,
"metadata": {},
"outputs": [],
"source": [
"Admit_Predict = -1.4243 + (0.0022 * X_train['GRE Score']) + (0.0029 * X_train['TOEFL Score']) + (0.0043 * X_train['University Rating']) + (0.0016 * X_train['SOP']) +(0.0166 * X_train['LOR ']) + (0.1228 * X_train['CGPA']) +(0.0205 * X_train['Research'])"
]
},
{
"cell_type": "code",
"execution_count": 118,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"443 0.861248\n",
"497 0.955868\n",
"124 0.656416\n",
"50 0.679840\n",
"331 0.628636\n",
"dtype: float64"
]
},
"execution_count": 118,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"Admit_Predict.head()"
]
},
{
"cell_type": "code",
"execution_count": 119,
"metadata": {},
"outputs": [],
"source": [
"predictions = model.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 120,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(165,)"
]
},
"execution_count": 120,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictions.shape"
]
},
{
"cell_type": "code",
"execution_count": 121,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([0.54539918, 0.55363752, 0.78000432, 0.597551 , 0.64566267,\n",
" 0.68833295, 0.82664645, 0.68428919, 0.65744214, 0.81637454,\n",
" 0.80592038, 0.69639124, 0.73953556, 0.72629543, 0.91058275,\n",
" 0.65096547, 0.86424101, 0.55253423, 0.56833614, 0.69443069,\n",
" 0.93981603, 0.82809052, 0.7057749 , 0.68265075, 0.84866986,\n",
" 0.41677175, 0.45426869, 0.78282754, 0.74439537, 0.62941571,\n",
" 0.60821993, 0.71991221, 0.65369032, 0.47574244, 0.56190713,\n",
" 0.41525819, 0.70733081, 0.61018855, 0.5510159 , 0.59136738,\n",
" 0.58473005, 0.64840503, 0.66023105, 0.8882523 , 0.86243497,\n",
" 0.86309401, 0.91663565, 0.72721531, 0.69744991, 0.71971577,\n",
" 0.68922378, 0.56502979, 0.71907273, 0.48716266, 0.46391183,\n",
" 0.58724119, 0.63811268, 0.52989804, 0.69304485, 0.52932209,\n",
" 0.96703953, 0.50234456, 0.760185 , 0.80683706, 0.6320383 ,\n",
" 0.70421576, 0.65733996, 0.95995125, 0.90553398, 0.81891241,\n",
" 0.63092162, 0.51734089, 0.81072184, 0.64351458, 0.93423053,\n",
" 0.68180096, 0.59762578, 0.70679619, 0.81197215, 0.70090992,\n",
" 0.59868914, 0.61671976, 0.75784717, 0.69352287, 0.91258705,\n",
" 0.73889149, 0.62336392, 0.84694287, 0.78719785, 0.95083272,\n",
" 0.58962115, 0.83550833, 0.90836346, 0.78785687, 0.53057778,\n",
" 0.81483149, 0.6040547 , 0.64881483, 0.635637 , 0.50640703,\n",
" 0.63748851, 0.63647359, 0.56320363, 0.88665176, 0.85762558,\n",
" 0.9594266 , 0.59228023, 0.6218097 , 0.85438238, 0.84801528,\n",
" 0.96605969, 0.75338374, 0.89920837, 0.59894699, 0.85988345,\n",
" 0.68324326, 0.50029916, 0.60548738, 0.88691487, 0.66408934,\n",
" 0.83845589, 0.73074958, 0.78803455, 0.71693481, 0.7619414 ,\n",
" 0.6623658 , 0.7788172 , 0.79581755, 0.56030919, 0.61204924,\n",
" 0.73125575, 0.97052884, 0.90841142, 1.00860231, 0.79695337,\n",
" 0.7402222 , 0.60401769, 0.77472074, 0.59145471, 0.52849543,\n",
" 0.86786918, 0.72693087, 0.66230199, 0.7716777 , 0.68760897,\n",
" 0.9152402 , 0.78942901, 0.65033575, 0.64887332, 0.50990954,\n",
" 0.86208432, 0.66961578, 0.57715668, 0.83828783, 0.80576419,\n",
" 0.89029024, 0.70912519, 0.591096 , 0.65530893, 0.62941468,\n",
" 0.63998591, 0.61842808, 0.7481594 , 0.74215304, 0.61427862])"
]
},
"execution_count": 121,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predictions"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 9. Plotting the predicted versus actual values on a scatterplot using the following code:"
]
},
{
"cell_type": "code",
"execution_count": 122,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "code",
"execution_count": 123,
"metadata": {},
"outputs": [],
"source": [
"from scipy.stats import pearsonr"
]
},
{
"cell_type": "code",
"execution_count": 124,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.scatter(y_test,predictions)\n",
"plt.xlabel('Y test(True values)')\n",
"plt.ylabel('Predicted Values')\n",
"plt.title('Predicted vs Actual value(r = {0:0.2f})'.format(pearsonr(y_test,predictions)[0]))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"|Strength of Association|\tPositive Coefficient, r\t|Negative Coefficient, r|\n",
"|-------|-----------|------------|\n",
"|Small\t|.1 to .3\t|-0.1 to -0.3|\n",
"|Medium\t|.3 to .5\t|-0.3 to -0.5|\n",
"|Large\t|.5 to 1.0\t|-0.5 to -1.0|\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 10. Plotting the residuals\n",
"\n",
"As discussed in the Introduction section, A **Residual** in simple terms is the difference between the Actual and Predicted value of the dependent variable."
]
},
{
"cell_type": "code",
"execution_count": 125,
"metadata": {},
"outputs": [],
"source": [
"import seaborn as sns\n",
"from scipy.stats import shapiro"
]
},
{
"cell_type": "code",
"execution_count": 126,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\tkhan050\\AppData\\Local\\Continuum\\anaconda3\\lib\\site-packages\\scipy\\stats\\stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.\n",
" return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval\n"
]
},
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"sns.distplot((y_test- predictions),bins=50)\n",
"plt.xlabel('Residuals')\n",
"plt.ylabel('Density')\n",
"plt.title('Histograms of residuals (Shapiro W p-value = {0:03f})'.format(shapiro(y_test-predictions)[1]))\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The histogram shows us that the residuals are negatively skewed and the value of the Shapiro W p-value in the title tells us that the distribution is not normal. This gives us further evidence that our model has room for improvement."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### 11. Computing the metrics for mean absolute error, mean squared error, root mean squared error, and R-squared to determine the model performance"
]
},
{
"cell_type": "code",
"execution_count": 127,
"metadata": {},
"outputs": [],
"source": [
"from sklearn import metrics"
]
},
{
"cell_type": "code",
"execution_count": 130,
"metadata": {
"code_folding": []
},
"outputs": [],
"source": [
"metrics_df = pd.DataFrame({'Metric':['MAE','MSE','RMSE','R-Squared'],\n",
" 'Value':[metrics.mean_absolute_error(y_test,predictions),\n",
" metrics.mean_squared_error(y_test,predictions),\n",
" np.sqrt(metrics.mean_squared_error(y_test,predictions)),\n",
" metrics.explained_variance_score(y_test,predictions)]}).round(3)"
]
},
{
"cell_type": "code",
"execution_count": 131,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Metric</th>\n",
" <th>Value</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>MAE</td>\n",
" <td>0.041</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>MSE</td>\n",
" <td>0.003</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>RMSE</td>\n",
" <td>0.058</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>R-Squared</td>\n",
" <td>0.809</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Metric Value\n",
"0 MAE 0.041\n",
"1 MSE 0.003\n",
"2 RMSE 0.058\n",
"3 R-Squared 0.809"
]
},
"execution_count": 131,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"metrics_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"|Metric|Simple Linear Regression|Mutiple Linear Regression|\n",
"|-------|-----------|------------|\n",
"|MAE\t|0.059\t|0.041|\n",
"|MSE\t|0.006\t|0.003|\n",
"|RMSE\t|0.080|0.058|\n",
"|R-Squared|0.629|0.809|"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"_draft": {
"nbviewer_url": "https://gist.github.com/f5aa427cb52d033cf1204d9687d087bd"
},
"gist": {
"data": {
"description": "GRE School.ipynb",
"public": true
},
"id": "f5aa427cb52d033cf1204d9687d087bd"
},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.0"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment