Instantly share code, notes, and snippets.

# patrickthoreson/model-development.ipynb Created Feb 16, 2019

Created on Cognitive Class Labs
 { "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", " \n", " \n", " \n", "
\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "\n", "

Data Analysis with Python

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Module 4: Model Development

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

In this section, we will develop several models that will predict the price of the car using the variables or features. This is just an estimate but should give us an objective idea of how much the car should cost.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Some questions we want to ask in this module\n", "
\n", "
• do I know if the dealer is offering fair value for my trade-in?
• \n", "
• do I know if I put a fair value on my car?
• \n", "
\n", "

Data Analytics, we often use Model Development to help us predict future observations from the data we have.

\n", "\n", "

A Model will help us understand the exact relationship between different variables and how these variables are used to predict the result.

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Setup

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Import libraries" ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "load data and store in dataframe df:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This dataset was hosted on IBM Cloud object click HERE for free storage." ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", "
symbolingnormalized-lossesmakeaspirationnum-of-doorsbody-styledrive-wheelsengine-locationwheel-baselength...compression-ratiohorsepowerpeak-rpmcity-mpghighway-mpgpricecity-L/100kmhorsepower-binneddieselgas
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
03122alfa-romerostdtwoconvertiblerwdfront88.60.811148...9.0111.05000.0212713495.011.190476Medium01
13122alfa-romerostdtwoconvertiblerwdfront88.60.811148...9.0111.05000.0212716500.011.190476Medium01
21122alfa-romerostdtwohatchbackrwdfront94.50.822681...9.0154.05000.0192616500.012.368421Medium01
32164audistdfoursedanfwdfront99.80.848630...10.0102.05500.0243013950.09.791667Medium01
42164audistdfoursedan4wdfront99.40.848630...8.0115.05500.0182217450.013.055556Medium01
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "

5 rows × 29 columns

\n", "" ], "text/plain": [ " symboling normalized-losses make aspiration num-of-doors \\\n", "0 3 122 alfa-romero std two \n", "1 3 122 alfa-romero std two \n", "2 1 122 alfa-romero std two \n", "3 2 164 audi std four \n", "4 2 164 audi std four \n", "\n", " body-style drive-wheels engine-location wheel-base length ... \\\n", "0 convertible rwd front 88.6 0.811148 ... \n", "1 convertible rwd front 88.6 0.811148 ... \n", "2 hatchback rwd front 94.5 0.822681 ... \n", "3 sedan fwd front 99.8 0.848630 ... \n", "4 sedan 4wd front 99.4 0.848630 ... \n", "\n", " compression-ratio horsepower peak-rpm city-mpg highway-mpg price \\\n", "0 9.0 111.0 5000.0 21 27 13495.0 \n", "1 9.0 111.0 5000.0 21 27 16500.0 \n", "2 9.0 154.0 5000.0 19 26 16500.0 \n", "3 10.0 102.0 5500.0 24 30 13950.0 \n", "4 8.0 115.0 5500.0 18 22 17450.0 \n", "\n", " city-L/100km horsepower-binned diesel gas \n", "0 11.190476 Medium 0 1 \n", "1 11.190476 Medium 0 1 \n", "2 12.368421 Medium 0 1 \n", "3 9.791667 Medium 0 1 \n", "4 13.055556 Medium 0 1 \n", "\n", "[5 rows x 29 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# path of data \n", "path = 'https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'\n", "df = pd.read_csv(path)\n", "df.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

1. Linear Regression and Multiple Linear Regression

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Linear Regression

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "

One example of a Data Model that we will be using is

\n", "Simple Linear Regression.\n", "\n", "
\n", "

Simple Linear Regression is a method to help us understand the relationship between two variables:

\n", "
\n", "
• The predictor/independent variable (X)
• \n", "
• The response/dependent variable (that we want to predict)(Y)
• \n", "
\n", "\n", "

The result of Linear Regression is a linear function that predicts the response (dependent) variable as a function of the predictor (independent) variable.

\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\$\$\n", " Y: Response \\ Variable\\\\\n", " X: Predictor \\ Variables\n", "\$\$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " Linear function:\n", "\$\$\n", "Yhat = a + b X\n", "\$\$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "
• a refers to the intercept of the regression line0, in other words: the value of Y when X is 0
• \n", "
• b refers to the slope of the regression line, in other words: the value with which Y changes when X increases by 1 unit
• \n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Lets load the modules for linear regression

" ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [], "source": [ "from sklearn.linear_model import LinearRegression" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Create the linear regression object

" ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n", " normalize=False)" ] }, "execution_count": 4, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm = LinearRegression()\n", "lm" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

How could Highway-mpg help us predict car price?

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For this example, we want to look at how highway-mpg can help us predict car price.\n", "Using simple linear regression, we will create a linear function with \"highway-mpg\" as the predictor variable and the \"price\" as the response variable." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "collapsed": false }, "outputs": [], "source": [ "X = df[['highway-mpg']]\n", "Y = df['price']" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Fit the linear model using highway-mpg." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n", " normalize=False)" ] }, "execution_count": 6, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm.fit(X,Y)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ " We can output a prediction " ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([16236.50464347, 16236.50464347, 17058.23802179, 13771.3045085 ,\n", " 20345.17153508])" ] }, "execution_count": 7, "metadata": {}, "output_type": "execute_result" } ], "source": [ "Yhat=lm.predict(X)\n", "Yhat[0:5] " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

What is the value of the intercept (a)?

" ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "38423.305858157386" ] }, "execution_count": 8, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

What is the value of the Slope (b)?

" ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "data": { "text/plain": [ "array([-821.73337832])" ] }, "execution_count": 9, "metadata": {}, "output_type": "execute_result" } ], "source": [ "lm.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

What is the final estimated linear model we get?

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "As we saw above, we should get a final linear model with the structure:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\$\$\n", "Yhat = a + b X\n", "\$\$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Plugging in the actual values we get:" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "price = 38423.31 - 821.73 x highway-mpg" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Question #1 a):

\n", "\n", "Create a linear regression object?\n", "
" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Write your code below and press Shift+Enter to execute \n", "lrm1 = LinearRegression()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-click here for the solution.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Question #1 b):

\n", "\n", "Train the model using 'engine-size' as the independent variable and 'price' as the dependent variable?\n", "
" ] }, { "cell_type": "code", "execution_count": 23, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n", " normalize=False)" ] }, "execution_count": 23, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Write your code below and press Shift+Enter to execute \n", "lrm1.fit(df[['engine-size']],df['price'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-click here for the solution.\n", "\n", "\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Question #1 c):

\n", "\n", "Find the slope and intercept of the model?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Slope

" ] }, { "cell_type": "code", "execution_count": 24, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "array([166.86001569])" ] }, "execution_count": 24, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Write your code below and press Shift+Enter to execute \n", "lrm1.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Intercept

" ] }, { "cell_type": "code", "execution_count": 21, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "-7963.338906281049" ] }, "execution_count": 21, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Write your code below and press Shift+Enter to execute \n", "lrm1.intercept_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-click here for the solution.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Question #1 d):

\n", "\n", "What is the equation of the predicted line. You can use x and yhat or 'engine-size' or 'price'?\n", "
" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "# You can type you answer here\n", "yhat = 166.86001569 * x -7963.338906281049\n", "
\n", "price = (166.86001569 * engine_size) -7963.338906281049" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-click here for the solution.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Multiple Linear Regression

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

What if we want to predict car price using more than one variable?

\n", "\n", "

If we want to use more variables in our model to predict car price, we can use Multiple Linear Regression.\n", "Multiple Linear Regression is very similar to Simple Linear Regression, but this method is used to explain the relationship between one continuous response (dependent) variable and two or more predictor (independent) variables.\n", "Most of the real-world regression models involve multiple predictors. We will illustrate the structure by using four predictor variables, but these results can generalize to any integer:

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\$\$\n", "Y: Response \\ Variable\\\\\n", "X_1 :Predictor\\ Variable \\ 1\\\\\n", "X_2: Predictor\\ Variable \\ 2\\\\\n", "X_3: Predictor\\ Variable \\ 3\\\\\n", "X_4: Predictor\\ Variable \\ 4\\\\\n", "\$\$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\$\$\n", "a: intercept\\\\\n", "b_1 :coefficients \\ of\\ Variable \\ 1\\\\\n", "b_2: coefficients \\ of\\ Variable \\ 2\\\\\n", "b_3: coefficients \\ of\\ Variable \\ 3\\\\\n", "b_4: coefficients \\ of\\ Variable \\ 4\\\\\n", "\$\$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The equation is given by" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\$\$\n", "Yhat = a + b_1 X_1 + b_2 X_2 + b_3 X_3 + b_4 X_4\n", "\$\$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

From the previous section we know that other good predictors of price could be:

\n", "
\n", "
• Horsepower
• \n", "
• Curb-weight
• \n", "
• Engine-size
• \n", "
• Highway-mpg
• \n", "
\n", "

Question #2 a):

\n", "Create and train a Multiple Linear Regression model \"lm2\" where the response variable is price, and the predictor variable is 'normalized-losses' and 'highway-mpg'.\n", "
" ] }, { "cell_type": "code", "execution_count": 32, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/plain": [ "LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,\n", " normalize=False)" ] }, "execution_count": 32, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Write your code below and press Shift+Enter to execute \n", "lm2 = LinearRegression()\n", "PV = df[['normalized-losses','highway-mpg']]\n", "lm2.fit(PV,df['price'])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-click here for the solution.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "
\n", "

Question #2 b):

\n", "Find the coefficient of the model?\n", "
" ] }, { "cell_type": "code", "execution_count": 34, "metadata": {}, "outputs": [ { "data": { "text/plain": [ "array([ 1.49789586, -820.45434016])" ] }, "execution_count": 34, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Write your code below and press Shift+Enter to execute \n", "lm2.coef_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Double-click here for the solution.\n", "\n", "" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

2) Model Evaluation using Visualization

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that we've developed some models, how do we evaluate our models and how do we choose the best one? One way to do this is by using visualization." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "import the visualization package: seaborn" ] }, { "cell_type": "code", "execution_count": 35, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# import the visualization package: seaborn\n", "import seaborn as sns\n", "%matplotlib inline " ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Regression Plot

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

When it comes to simple linear regression, an excellent way to visualize the fit of our model is by using regression plots.

\n", "\n", "

This plot will show a combination of a scattered data points (a scatter plot), as well as the fitted linear regression line going through the data. This will give us a reasonable estimate of the relationship between the two variables, the strength of the correlation, as well as the direction (positive or negative correlation).

horsepowercurb-weightengine-sizehighway-mpg
\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
0111.0254813027
1111.0254813027
2154.0282315226
3102.0233710930
4115.0282413622

Pipeline

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Data Pipelines simplify the steps of processing the data. We use the module Pipeline to create a pipeline. We also use StandardScaler as a step in our pipeline.

\n", "

Question #5:

\n", "Create a pipeline that Standardizes the data, then perform prediction using a linear regression model using the features Z and targets y\n", "

Part 4: Measures for In-Sample Evaluation

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

When evaluating our models, not only do we want to visualize the results, but we also want a quantitative measure to determine how accurate the model is.

\n", "\n", "

Two very important measures that are often used in Statistics to determine the accuracy of a model are:

\n", "
\n", "
• R^2 / R-squared
• \n", "
• Mean Squared Error (MSE)
• \n", "
\n", " \n", "R-squared\n", "\n", "

R squared, also known as the coefficient of determination, is a measure to indicate how close the data is to the fitted regression line.

\n", " \n", "

The value of the R-squared is the percentage of variation of the response variable (y) that is explained by a linear model.

\n", "\n", "\n", "\n", "Mean Squared Error (MSE)\n", "\n", "

The Mean Squared Error measures the average of the squares of errors, that is, the difference between actual value (y) and the estimated value (ŷ).

" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "

Model 1: Simple Linear Regression