{{ message }}

Instantly share code, notes, and snippets.

# navidfrb/model-development.ipynb

Created Jan 26, 2020
Created on Cognitive Class Labs
 { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", " \n", " \n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "

Data Analysis with Python

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Module 4: Model Development

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

In this section, we will develop several models that will predict the price of the car using the variables or features. This is just an estimate but should give us an objective idea of how much the car should cost.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some questions we want to ask in this module\n", "
\n", "
• do I know if the dealer is offering fair value for my trade-in?
• \n", "
• do I know if I put a fair value on my car?
• \n", "
\n", "

Data Analytics, we often use Model Development to help us predict future observations from the data we have.

\n", "\n", "

A Model will help us understand the exact relationship between different variables and how these variables are used to predict the result.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Setup

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Import libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "load data and store in dataframe df:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset was hosted on IBM Cloud object click HERE for free storage." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "
symbolingnormalized-lossesmakeaspirationnum-of-doorsbody-styledrive-wheelsengine-locationwheel-baselength...compression-ratiohorsepowerpeak-rpmcity-mpghighway-mpgpricecity-L/100kmhorsepower-binneddieselgas
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
03122alfa-romerostdtwoconvertiblerwdfront88.60.811148...9.0111.05000.0212713495.011.190476Medium01
13122alfa-romerostdtwoconvertiblerwdfront88.60.811148...9.0111.05000.0212716500.011.190476Medium01
21122alfa-romerostdtwohatchbackrwdfront94.50.822681...9.0154.05000.0192616500.012.368421Medium01
32164audistdfoursedanfwdfront99.80.848630...10.0102.05500.0243013950.09.791667Medium01
42164audistdfoursedan4wdfront99.40.848630...8.0115.05500.0182217450.013.055556Medium01
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "

5 rows × 29 columns

\n", "" ], "text/plain": [ " symboling normalized-losses make aspiration num-of-doors \\\n", "0 3 122 alfa-romero std two \n", "1 3 122 alfa-romero std two \n", "2 1 122 alfa-romero std two \n", "3 2 164 audi std four \n", "4 2 164 audi std four \n", "\n", " body-style drive-wheels engine-location wheel-base length ... \\\n", "0 convertible rwd front 88.6 0.811148 ... \n", "1 convertible rwd front 88.6 0.811148 ... \n", "2 hatchback rwd front 94.5 0.822681 ... \n", "3 sedan fwd front 99.8 0.848630 ... \n", "4 sedan 4wd front 99.4 0.848630 ... \n", "\n", " compression-ratio horsepower peak-rpm city-mpg highway-mpg price \\\n", "0 9.0 111.0 5000.0 21 27 13495.0 \n", "1 9.0 111.0 5000.0 21 27 16500.0 \n", "2 9.0 154.0 5000.0 19 26 16500.0 \n", "3 10.0 102.0 5500.0 24 30 13950.0 \n", "4 8.0 115.0 5500.0 18 22 17450.0 \n", "\n", " city-L/100km horsepower-binned diesel gas \n", "0 11.190476 Medium 0 1 \n", "1 11.190476 Medium 0 1 \n", "2 12.368421 Medium 0 1 \n", "3 9.791667 Medium 0 1 \n", "4 13.055556 Medium 0 1 \n", "\n", "[5 rows x 29 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# path of data \n", "path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'\n", "df = pd.read_csv(path)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

1. Linear Regression and Multiple Linear Regression

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Linear Regression

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "

One example of a Data Model that we will be using is

\n", "Simple Linear Regression.\n", "\n", "
\n", "

Simple Linear Regression is a method to help us understand the relationship between two variables:

\n", "
\n", "
• The predictor/independent variable (X)
• \n", "
• The response/dependent variable (that we want to predict)(Y)
• \n", "
\n", "\n", "

The result of Linear Regression is a linear function that predicts the response (dependent) variable as a function of the predictor (independent) variable.

\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\$\$\n", " Y: Response \\ Variable\\\\\n", " X: Predictor \\ Variables\n", "\$\$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Linear function:\n", "\$\$\n", "Yhat = a + b X\n", "\$\$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
• a refers to the intercept of the regression line0, in other words: the value of Y when X is 0
• \n", "
• b refers to the slope of the regression line, in other words: the value with which Y changes when X increases by 1 unit
• \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Lets load the modules for linear regression

" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Create the linear regression object

" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n", " normalize=False)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm = LinearRegression()\n", "lm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

How could Highway-mpg help us predict car price?

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example, we want to look at how highway-mpg can help us predict car price.\n", "Using simple linear regression, we will create a linear function with \"highway-mpg\" as the predictor variable and the \"price\" as the response variable." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [], "source": [ "X = df[['highway-mpg']]\n", "Y = df['price']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fit the linear model using highway-mpg." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n", " normalize=False)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm.fit(X,Y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " We can output a prediction " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "array([16236.50464347, 16236.50464347, 17058.23802179, 13771.3045085 ,\n", " 20345.17153508])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Yhat=lm.predict(X)\n", "Yhat[0:5] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

What is the value of the intercept (a)?

" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "38423.3058581574" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

What is the value of the Slope (b)?

" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false }, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array([-821.73337832])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

What is the final estimated linear model we get?

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we saw above, we should get a final linear model with the structure:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\$\$\n", "Yhat = a + b X\n", "\$\$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plugging in the actual values we get:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "price = 38423.31 - 821.73 x highway-mpg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Question #1 a):

\n", "\n", "Create a linear regression object?\n", "
" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n", " normalize=False)" ] }, "execution_count": 10, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Write your code below and press Shift+Enter to execute \n", "lm1 = LinearRegression()\n", "lm1" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-click here for the solution.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Question #1 b):

\n", "\n", "Train the model using 'engine-size' as the independent variable and 'price' as the dependent variable?\n", "
" ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n", " normalize=False)" ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Write your code below and press Shift+Enter to execute \n", "lm1.fit(df[['highway-mpg']], df[['price']])\n", "lm1 " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-click here for the solution.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Question #1 c):

\n", "\n", "Find the slope and intercept of the model?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Slope

" ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "array([[-821.73337832]])" ] }, "execution_count": 12, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Write your code below and press Shift+Enter to execute \n", "lm1.coef_\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Intercept

" ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "array([38423.30585816])" ] }, "execution_count": 13, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Write your code below and press Shift+Enter to execute \n", "lm1.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-click here for the solution.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Question #1 d):

\n", "\n", "What is the equation of the predicted line. You can use x and yhat or 'engine-size' or 'price'?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# You can type you answer here\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-click here for the solution.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Multiple Linear Regression

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

What if we want to predict car price using more than one variable?

\n", "\n", "

If we want to use more variables in our model to predict car price, we can use Multiple Linear Regression.\n", "Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables.\n", "Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\$\$\n", "Y: Response \\ Variable\\\\\n", "X_1 :Predictor\\ Variable \\ 1\\\\\n", "X_2: Predictor\\ Variable \\ 2\\\\\n", "X_3: Predictor\\ Variable \\ 3\\\\\n", "X_4: Predictor\\ Variable \\ 4\\\\\n", "\$\$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\$\$\n", "a: intercept\\\\\n", "b_1 :coefficients \\ of\\ Variable \\ 1\\\\\n", "b_2: coefficients \\ of\\ Variable \\ 2\\\\\n", "b_3: coefficients \\ of\\ Variable \\ 3\\\\\n", "b_4: coefficients \\ of\\ Variable \\ 4\\\\\n", "\$\$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The equation is given by" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\$\$\n", "Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4\n", "\$\$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

From the previous section we know that other good predictors of price could be:

\n", "
\n", "
• Horsepower
• \n", "
• Curb-weight
• \n", "
• Engine-size
• \n", "
• Highway-mpg
• \n", "
\n", "

Question #2 a):

\n", "Create and train a Multiple Linear Regression model \"lm2\" where the response variable is price, and the predictor variable is 'normalized-losses' and 'highway-mpg'.\n", "
" ] }, { "cell_type": "code", "execution_count": 19, "metadata": { "collapsed": false, "jupyter": { "outputs_hidden": false } }, "outputs": [ { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n", " normalize=False)" ] }, "execution_count": 19, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Write your code below and press Shift+Enter to execute \n", "lm2=LinearRegression()\n", "lm2.fit(df[['normalized-losses','highway-mpg']], df['price'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-click here for the solution.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Question #2 b):

\n", "Find the coefficient of the model?\n", "
" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 1.49789586, -820.45434016])" ] }, "execution_count": 20, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Write your code below and press Shift+Enter to execute \n", "lm2.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-click here for the solution.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

2) Model Evaluation using Visualization

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've developed some models, how do we evaluate our models and how do we choose the best one? One way to do this is by using visualization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "import the visualization package: seaborn" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [], "source": [ "# import the visualization package: seaborn\n", "import seaborn as sns\n", "%matplotlib inline " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Regression Plot

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using regression plots.

\n", "\n", "

This plot will show a combination of a scattered data points (a scatter plot), as well as the fitted linear regression line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).

peak-rpmhighway-mpgprice
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
peak-rpm1.000000-0.058598-0.101616
highway-mpg-0.0585981.000000-0.704692
price-0.101616-0.7046921.000000
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "" ], "text/plain": [ " peak-rpm highway-mpg price\n", "peak-rpm 1.000000 -0.058598 -0.101616\n", "highway-mpg -0.058598 1.000000 -0.704692\n", "price -0.101616 -0.704692 1.000000" ] }, "execution_count": 27, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Write your code below and press Shift+Enter to execute \n", "\"The highway-mpg has stronger corrolation\"\n", "df[[\"peak-rpm\",\"highway-mpg\",\"price\"]].corr()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-click here for the solution.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Residual Plot

\n", "\n", "

A good way to visualize the variance of the data is to use a residual plot.

\n", "\n", "

What is a residual?

\n", "\n", "

The difference between the observed value (y) and the predicted value (Yhat) is called the residual (e). When we look at a regression plot, the residual is the distance from the data point to the fitted regression line.

\n", "\n", "

So what is a residual plot?

\n", "\n", "

A residual plot is a graph that shows the residuals on the vertical y-axis and the independent variable on the horizontal x-axis.

\n", "\n", "

What do we pay attention to when looking at a residual plot?

\n", "\n", "

We look at the spread of the residuals:

\n", "\n", "

- If the points in a residual plot are randomly spread out around the x-axis, then a linear model is appropriate for the data. Why is that? Randomly spread out residuals means that the variance is constant, and thus the linear model is a good fit for this data.

horsepowercurb-weightengine-sizehighway-mpg
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0111.0254813027
1111.0254813027
2154.0282315226
3102.0233710930
4115.0282413622
...............
196114.0295214128
197160.0304914125
198134.0301217323
199106.0321714527
200114.0306214125
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "

201 rows × 4 columns