Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Save RamzanShahidkhan/de3af425c8ec3726356a21d3b4131308 to your computer and use it in GitHub Desktop.
Save RamzanShahidkhan/de3af425c8ec3726356a21d3b4131308 to your computer and use it in GitHub Desktop.
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fire up pandas create"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Load some house value vs. crime rate data\n",
"\n",
"Dataset is from Philadelphia, PA and includes average house sales price in a number of neighborhoods. The attributes of each neighborhood we have include the crime rate ('CrimeRate'), miles from Center City ('MilesPhila'), town name ('Name'), and county name ('County')."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"sales = pd.read_csv('Philadelphia_Crime_Rate_noNA.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>HousePrice</th>\n",
" <th>HsPrc ($10,000)</th>\n",
" <th>CrimeRate</th>\n",
" <th>MilesPhila</th>\n",
" <th>PopChg</th>\n",
" <th>Name</th>\n",
" <th>County</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>140463</td>\n",
" <td>14.0463</td>\n",
" <td>29.7</td>\n",
" <td>10.0</td>\n",
" <td>-1.0</td>\n",
" <td>Abington</td>\n",
" <td>Montgome</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>113033</td>\n",
" <td>11.3033</td>\n",
" <td>24.1</td>\n",
" <td>18.0</td>\n",
" <td>4.0</td>\n",
" <td>Ambler</td>\n",
" <td>Montgome</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>124186</td>\n",
" <td>12.4186</td>\n",
" <td>19.5</td>\n",
" <td>25.0</td>\n",
" <td>8.0</td>\n",
" <td>Aston</td>\n",
" <td>Delaware</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>110490</td>\n",
" <td>11.0490</td>\n",
" <td>49.4</td>\n",
" <td>25.0</td>\n",
" <td>2.7</td>\n",
" <td>Bensalem</td>\n",
" <td>Bucks</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>79124</td>\n",
" <td>7.9124</td>\n",
" <td>54.1</td>\n",
" <td>19.0</td>\n",
" <td>3.9</td>\n",
" <td>Bristol B.</td>\n",
" <td>Bucks</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" HousePrice HsPrc ($10,000) CrimeRate MilesPhila PopChg Name \\\n",
"0 140463 14.0463 29.7 10.0 -1.0 Abington \n",
"1 113033 11.3033 24.1 18.0 4.0 Ambler \n",
"2 124186 12.4186 19.5 25.0 8.0 Aston \n",
"3 110490 11.0490 49.4 25.0 2.7 Bensalem \n",
"4 79124 7.9124 54.1 19.0 3.9 Bristol B. \n",
"\n",
" County \n",
"0 Montgome \n",
"1 Montgome \n",
"2 Delaware \n",
"3 Bucks \n",
"4 Bucks "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sales.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploring the data "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The house price in a town is correlated with the crime rate of that town. Low crime towns tend to be associated with higher house prices and vice versa."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.scatter(x=sales['CrimeRate'],y=sales['HousePrice'])\n",
"plt.title(\"Scatter Plot\")\n",
"plt.ylabel('HousePrice')\n",
"plt.xlabel('CrimeRate')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Create a Regression Model"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"crime_model = LinearRegression()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# split data for training and testing"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"target = sales['HousePrice']\n",
"features = sales['CrimeRate']"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"x_train, x_test, y_train, y_test = train_test_split(features,target, test_size = 0.20, random_state=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fit the regression model using crime as the feature"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crime_model.fit(x_train.values.reshape(-1,1),y_train)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.05552690736364807\n"
]
}
],
"source": [
"score = crime_model.score(x_test.values.reshape(-1,1),y_test)\n",
"print(score)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"170541.69260352003\n"
]
}
],
"source": [
"print(crime_model.intercept_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Make Prediction "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[159940.63992411 162679.24519962 150708.88988246 157732.08728257\n",
" 149295.41619187 162723.41625246 161928.3373015 157599.57412407\n",
" 162811.75835812 155656.04779952 164799.45573551 150841.40304095\n",
" 160779.8899279 150355.52145981 157422.88991275 159543.10044863\n",
" 148721.19250507 163253.46888643 139666.12667474 151636.48199191]\n"
]
}
],
"source": [
"y_pred = crime_model.predict(x_test.values.reshape(-1,1)) \n",
"print(y_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# compare the actual output values for y_test with the predicted values"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Actual</th>\n",
" <th>Predicted</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>93</td>\n",
" <td>152624</td>\n",
" <td>159940.639924</td>\n",
" </tr>\n",
" <tr>\n",
" <td>30</td>\n",
" <td>389302</td>\n",
" <td>162679.245200</td>\n",
" </tr>\n",
" <tr>\n",
" <td>56</td>\n",
" <td>28000</td>\n",
" <td>150708.889882</td>\n",
" </tr>\n",
" <tr>\n",
" <td>24</td>\n",
" <td>114233</td>\n",
" <td>157732.087283</td>\n",
" </tr>\n",
" <tr>\n",
" <td>16</td>\n",
" <td>104923</td>\n",
" <td>149295.416192</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Actual Predicted\n",
"93 152624 159940.639924\n",
"30 389302 162679.245200\n",
"56 28000 150708.889882\n",
"24 114233 157732.087283\n",
"16 104923 149295.416192"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}) \n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Let's see what our fit looks like"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Matplotlib is a Python plotting library that is also useful for plotting. You can install it with:\n",
"\n",
"'pip install matplotlib'"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(sales['CrimeRate'],sales['HousePrice'],'.',\n",
" sales['CrimeRate'],crime_model.predict(sales['CrimeRate'].values.reshape(-1,1)),'-')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(x_test,y_test,'.',\n",
" x_test,y_pred,'-')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above: dots are original data, line (predicted) is the fit from the simple regression."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Remove Center City and redo the analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Center City is the one observation with an extremely high crime rate, yet house prices are not very low. This point does not follow the trend of the rest of the data very well. A question is how much including Center City is influencing our fit on the other datapoints. Let's remove this datapoint and see what happens."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"sales_noCC = sales[sales['MilesPhila'] != 0.0] "
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.scatter( x=sales_noCC[\"CrimeRate\"], y=sales_noCC[\"HousePrice\"])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Refit our simple regression model on this modified dataset:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"target_noCC = sales_noCC['HousePrice']\n",
"features_noCC = sales_noCC['CrimeRate']\n",
"\n",
"x_train, x_test, y_train, y_test = train_test_split(features_noCC,target_noCC, test_size = 0.20, random_state=2)\n",
"\n",
"crime_noCC_model = LinearRegression()\n",
"crime_noCC_model.fit(x_train.values.reshape(-1,1),y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This plot is of new model which is trained and tested on data after removing the house with high price"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(sales_noCC['CrimeRate'],sales_noCC['HousePrice'],'.',\n",
" sales_noCC['CrimeRate'],crime_noCC_model.predict(sales_noCC['CrimeRate'].values.reshape(-1,1)),'-')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Look at the fit on data after removing the house:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(sales_noCC['CrimeRate'],sales_noCC['HousePrice'],'.',\n",
" sales_noCC['CrimeRate'],crime_model.predict(sales_noCC['CrimeRate'].values.reshape(-1,1)),'-')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Compare coefficients for full-data fit versus no-Center-City fit"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visually, the fit seems different, but let's quantify this by examining the estimated coefficients of our original fit and that of the modified dataset with Center City removed."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[-441.71052831]\n"
]
}
],
"source": [
"print(crime_model.coef_)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[-2002.58015335]\n"
]
}
],
"source": [
"print(crime_noCC_model.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above: We see that for the \"no Center City\" version, per unit increase in crime, the predicted decrease in house prices is 2002. In contrast, for the original dataset, the drop is only 441 per unit increase in crime. This is significantly different!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### High leverage points: \n",
"Center City is said to be a \"high leverage\" point because it is at an extreme x value where there are not other observations. As a result, recalling the closed-form solution for simple regression, this point has the *potential* to dramatically change the least squares line since the center of x mass is heavily influenced by this one point and the least squares line will try to fit close to that outlying (in x) point. If a high leverage point follows the trend of the other data, this might not have much effect. On the other hand, if this point somehow differs, it can be strongly influential in the resulting fit.\n",
"\n",
"### Influential observations: \n",
"An influential observation is one where the removal of the point significantly changes the fit. As discussed above, high leverage points are good candidates for being influential observations, but need not be. Other observations that are *not* leverage points can also be influential observations (e.g., strongly outlying in y even if x is a typical value)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fire up pandas create"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Load some house value vs. crime rate data\n",
"\n",
"Dataset is from Philadelphia, PA and includes average house sales price in a number of neighborhoods. The attributes of each neighborhood we have include the crime rate ('CrimeRate'), miles from Center City ('MilesPhila'), town name ('Name'), and county name ('County')."
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"sales = pd.read_csv('Philadelphia_Crime_Rate_noNA.csv')"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>HousePrice</th>\n",
" <th>HsPrc ($10,000)</th>\n",
" <th>CrimeRate</th>\n",
" <th>MilesPhila</th>\n",
" <th>PopChg</th>\n",
" <th>Name</th>\n",
" <th>County</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>0</td>\n",
" <td>140463</td>\n",
" <td>14.0463</td>\n",
" <td>29.7</td>\n",
" <td>10.0</td>\n",
" <td>-1.0</td>\n",
" <td>Abington</td>\n",
" <td>Montgome</td>\n",
" </tr>\n",
" <tr>\n",
" <td>1</td>\n",
" <td>113033</td>\n",
" <td>11.3033</td>\n",
" <td>24.1</td>\n",
" <td>18.0</td>\n",
" <td>4.0</td>\n",
" <td>Ambler</td>\n",
" <td>Montgome</td>\n",
" </tr>\n",
" <tr>\n",
" <td>2</td>\n",
" <td>124186</td>\n",
" <td>12.4186</td>\n",
" <td>19.5</td>\n",
" <td>25.0</td>\n",
" <td>8.0</td>\n",
" <td>Aston</td>\n",
" <td>Delaware</td>\n",
" </tr>\n",
" <tr>\n",
" <td>3</td>\n",
" <td>110490</td>\n",
" <td>11.0490</td>\n",
" <td>49.4</td>\n",
" <td>25.0</td>\n",
" <td>2.7</td>\n",
" <td>Bensalem</td>\n",
" <td>Bucks</td>\n",
" </tr>\n",
" <tr>\n",
" <td>4</td>\n",
" <td>79124</td>\n",
" <td>7.9124</td>\n",
" <td>54.1</td>\n",
" <td>19.0</td>\n",
" <td>3.9</td>\n",
" <td>Bristol B.</td>\n",
" <td>Bucks</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" HousePrice HsPrc ($10,000) CrimeRate MilesPhila PopChg Name \\\n",
"0 140463 14.0463 29.7 10.0 -1.0 Abington \n",
"1 113033 11.3033 24.1 18.0 4.0 Ambler \n",
"2 124186 12.4186 19.5 25.0 8.0 Aston \n",
"3 110490 11.0490 49.4 25.0 2.7 Bensalem \n",
"4 79124 7.9124 54.1 19.0 3.9 Bristol B. \n",
"\n",
" County \n",
"0 Montgome \n",
"1 Montgome \n",
"2 Delaware \n",
"3 Bucks \n",
"4 Bucks "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"sales.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Exploring the data "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The house price in a town is correlated with the crime rate of that town. Low crime towns tend to be associated with higher house prices and vice versa."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.scatter(x=sales['CrimeRate'],y=sales['HousePrice'])\n",
"plt.title(\"Scatter Plot\")\n",
"plt.ylabel('HousePrice')\n",
"plt.xlabel('CrimeRate')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Create a Regression Model"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"crime_model = LinearRegression()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# split data for training and testing"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
"target = sales['HousePrice']\n",
"features = sales['CrimeRate']"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"x_train, x_test, y_train, y_test = train_test_split(features,target, test_size = 0.20, random_state=2)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Fit the regression model using crime as the feature"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"crime_model.fit(x_train.values.reshape(-1,1),y_train)"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0.05552690736364807\n"
]
}
],
"source": [
"score = crime_model.score(x_test.values.reshape(-1,1),y_test)\n",
"print(score)"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"170541.69260352003\n"
]
}
],
"source": [
"print(crime_model.intercept_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Make Prediction "
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[159940.63992411 162679.24519962 150708.88988246 157732.08728257\n",
" 149295.41619187 162723.41625246 161928.3373015 157599.57412407\n",
" 162811.75835812 155656.04779952 164799.45573551 150841.40304095\n",
" 160779.8899279 150355.52145981 157422.88991275 159543.10044863\n",
" 148721.19250507 163253.46888643 139666.12667474 151636.48199191]\n"
]
}
],
"source": [
"y_pred = crime_model.predict(x_test.values.reshape(-1,1)) \n",
"print(y_pred)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# compare the actual output values for y_test with the predicted values"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Actual</th>\n",
" <th>Predicted</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td>93</td>\n",
" <td>152624</td>\n",
" <td>159940.639924</td>\n",
" </tr>\n",
" <tr>\n",
" <td>30</td>\n",
" <td>389302</td>\n",
" <td>162679.245200</td>\n",
" </tr>\n",
" <tr>\n",
" <td>56</td>\n",
" <td>28000</td>\n",
" <td>150708.889882</td>\n",
" </tr>\n",
" <tr>\n",
" <td>24</td>\n",
" <td>114233</td>\n",
" <td>157732.087283</td>\n",
" </tr>\n",
" <tr>\n",
" <td>16</td>\n",
" <td>104923</td>\n",
" <td>149295.416192</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Actual Predicted\n",
"93 152624 159940.639924\n",
"30 389302 162679.245200\n",
"56 28000 150708.889882\n",
"24 114233 157732.087283\n",
"16 104923 149295.416192"
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}) \n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Let's see what our fit looks like"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Matplotlib is a Python plotting library that is also useful for plotting. You can install it with:\n",
"\n",
"'pip install matplotlib'"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"%matplotlib inline"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(sales['CrimeRate'],sales['HousePrice'],'.',\n",
" sales['CrimeRate'],crime_model.predict(sales['CrimeRate'].values.reshape(-1,1)),'-')\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(x_test,y_test,'.',\n",
" x_test,y_pred,'-')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above: dots are original data, line (predicted) is the fit from the simple regression."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Remove Center City and redo the analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Center City is the one observation with an extremely high crime rate, yet house prices are not very low. This point does not follow the trend of the rest of the data very well. A question is how much including Center City is influencing our fit on the other datapoints. Let's remove this datapoint and see what happens."
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [],
"source": [
"sales_noCC = sales[sales['MilesPhila'] != 0.0] "
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.scatter( x=sales_noCC[\"CrimeRate\"], y=sales_noCC[\"HousePrice\"])\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Refit our simple regression model on this modified dataset:"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
]
},
"execution_count": 21,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"target_noCC = sales_noCC['HousePrice']\n",
"features_noCC = sales_noCC['CrimeRate']\n",
"\n",
"x_train, x_test, y_train, y_test = train_test_split(features_noCC,target_noCC, test_size = 0.20, random_state=2)\n",
"\n",
"crime_noCC_model = LinearRegression()\n",
"crime_noCC_model.fit(x_train.values.reshape(-1,1),y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This plot is of new model which is trained and tested on data after removing the house with high price"
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(sales_noCC['CrimeRate'],sales_noCC['HousePrice'],'.',\n",
" sales_noCC['CrimeRate'],crime_noCC_model.predict(sales_noCC['CrimeRate'].values.reshape(-1,1)),'-')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Look at the fit on data after removing the house:"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [
{
"data": {
"image/png": "\n",
"text/plain": [
"<Figure size 432x288 with 1 Axes>"
]
},
"metadata": {
"needs_background": "light"
},
"output_type": "display_data"
}
],
"source": [
"plt.plot(sales_noCC['CrimeRate'],sales_noCC['HousePrice'],'.',\n",
" sales_noCC['CrimeRate'],crime_model.predict(sales_noCC['CrimeRate'].values.reshape(-1,1)),'-')\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Compare coefficients for full-data fit versus no-Center-City fit"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Visually, the fit seems different, but let's quantify this by examining the estimated coefficients of our original fit and that of the modified dataset with Center City removed."
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[-441.71052831]\n"
]
}
],
"source": [
"print(crime_model.coef_)"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[-2002.58015335]\n"
]
}
],
"source": [
"print(crime_noCC_model.coef_)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Above: We see that for the \"no Center City\" version, per unit increase in crime, the predicted decrease in house prices is 2002. In contrast, for the original dataset, the drop is only 441 per unit increase in crime. This is significantly different!"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### High leverage points: \n",
"Center City is said to be a \"high leverage\" point because it is at an extreme x value where there are not other observations. As a result, recalling the closed-form solution for simple regression, this point has the *potential* to dramatically change the least squares line since the center of x mass is heavily influenced by this one point and the least squares line will try to fit close to that outlying (in x) point. If a high leverage point follows the trend of the other data, this might not have much effect. On the other hand, if this point somehow differs, it can be strongly influential in the resulting fit.\n",
"\n",
"### Influential observations: \n",
"An influential observation is one where the removal of the point significantly changes the fit. As discussed above, high leverage points are good candidates for being influential observations, but need not be. Other observations that are *not* leverage points can also be influential observations (e.g., strongly outlying in y even if x is a typical value)."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.4"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment