Skip to content

Instantly share code, notes, and snippets.

Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save MikeTange/699571d9b9d4e5715739eb90917ce840 to your computer and use it in GitHub Desktop.
Save MikeTange/699571d9b9d4e5715739eb90917ce840 to your computer and use it in GitHub Desktop.
Created on Cognitive Class Labs
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
" <a href=\"https://cocl.us/DA0101EN_edx_link_Notebook_link_top\">\n",
" <img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/Images/TopAd.png\" width=\"750\" align=\"center\">\n",
" </a>\n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href=\"https://www.bigdatauniversity.com\"><img src = \"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/Images/CCLog.png\" width = 300, align = \"center\"></a>\n",
"\n",
"<h1 align=center><font size = 5>Data Analysis with Python</font></h1>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Exploratory Data Analysis"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Welcome!</h3>\n",
"In this section, we will explore several methods to see if certain characteristics or features can be used to predict car price. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Table of content</h2>\n",
"\n",
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
"<ol>\n",
" <li><a href=\"#import_data\">Import Data from Module</a></li>\n",
" <li><a href=\"#pattern_visualization\">Analyzing Individual Feature Patterns using Visualization</a></li>\n",
" <li><a href=\"#discriptive_statistics\">Descriptive Statistical Analysis</a></li>\n",
" <li><a href=\"#basic_grouping\">Basics of Grouping</a></li>\n",
" <li><a href=\"#correlation_causation\">Correlation and Causation</a></li>\n",
" <li><a href=\"#anova\">ANOVA</a></li>\n",
"</ol>\n",
" \n",
"Estimated Time Needed: <strong>30 min</strong>\n",
"</div>\n",
" \n",
"<hr>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>What are the main characteristics which have the most impact on the car price?</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"import_data\">1. Import Data from Module 2</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Setup</h4>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Import libraries "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import pandas as pd\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" load data and store in dataframe df:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This dataset was hosted on IBM Cloud object click <a href=\"https://cocl.us/edx_DA0101EN_objectstorage\">HERE</a> for free storage."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"path='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'\n",
"df = pd.read_csv(path)\n",
"df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"pattern_visualization\">2. Analyzing Individual Feature Patterns using Visualization</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To install seaborn we use the pip which is the python package manager."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"! pip install seaborn"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Import visualization packages \"Matplotlib\" and \"Seaborn\", don't forget about \"%matplotlib inline\" to plot in a Jupyter notebook."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"%matplotlib inline "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>How to choose the right visualization method?</h4>\n",
"<p>When visualizing individual variables, it is important to first understand what type of variable you are dealing with. This will help us find the right visualization method for that variable.</p>\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# list the data types for each column\n",
"print(df.dtypes)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
"<h3>Question #1:</h3>\n",
"\n",
"<b>What is the data type of the column \"peak-rpm\"? </b>\n",
"</div>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click <b>here</b> for the solution.\n",
"\n",
"<!-- The answer is below:\n",
"\n",
"float64\n",
"\n",
"-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"for example, we can calculate the correlation between variables of type \"int64\" or \"float64\" using the method \"corr\":"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The diagonal elements are always one; we will study correlation more precisely Pearson correlation in-depth at the end of the notebook."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
"<h1> Question #2: </h1>\n",
"\n",
"<p>Find the correlation between the following columns: bore, stroke,compression-ratio , and horsepower.</p>\n",
"<p>Hint: if you would like to select those columns use the following syntax: df[['bore','stroke' ,'compression-ratio','horsepower']]</p>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"# Write your code below and press Shift+Enter to execute \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click <b>here</b> for the solution.\n",
"\n",
"<!-- The answer is below:\n",
"\n",
"df[['bore', 'stroke', 'compression-ratio', 'horsepower']].corr() \n",
"\n",
"-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2>Continuous numerical variables:</h2> \n",
"\n",
"<p>Continuous numerical variables are variables that may contain any value within some range. Continuous numerical variables can have the type \"int64\" or \"float64\". A great way to visualize these variables is by using scatterplots with fitted lines.</p>\n",
"\n",
"<p>In order to start understanding the (linear) relationship between an individual variable and the price. We can do this by using \"regplot\", which plots the scatterplot plus the fitted regression line for the data.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Let's see several examples of different linear relationships:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Positive linear relationship</h4>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's find the scatterplot of \"engine-size\" and \"price\" "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"# Engine size as potential predictor variable of price\n",
"sns.regplot(x=\"engine-size\", y=\"price\", data=df)\n",
"plt.ylim(0,)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>As the engine-size goes up, the price goes up: this indicates a positive direct correlation between these two variables. Engine size seems like a pretty good predictor of price since the regression line is almost a perfect diagonal line.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" We can examine the correlation between 'engine-size' and 'price' and see it's approximately 0.87"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df[[\"engine-size\", \"price\"]].corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Highway mpg is a potential predictor variable of price "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"sns.regplot(x=\"highway-mpg\", y=\"price\", data=df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>As the highway-mpg goes up, the price goes down: this indicates an inverse/negative relationship between these two variables. Highway mpg could potentially be a predictor of price.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can examine the correlation between 'highway-mpg' and 'price' and see it's approximately -0.704"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df[['highway-mpg', 'price']].corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Weak Linear Relationship</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's see if \"Peak-rpm\" as a predictor variable of \"price\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"sns.regplot(x=\"peak-rpm\", y=\"price\", data=df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Peak rpm does not seem like a good predictor of the price at all since the regression line is close to horizontal. Also, the data points are very scattered and far from the fitted line, showing lots of variability. Therefore it's it is not a reliable variable.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can examine the correlation between 'peak-rpm' and 'price' and see it's approximately -0.101616 "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df[['peak-rpm','price']].corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" <div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
"<h1> Question 3 a): </h1>\n",
"\n",
"<p>Find the correlation between x=\"stroke\", y=\"price\".</p>\n",
"<p>Hint: if you would like to select those columns use the following syntax: df[[\"stroke\",\"price\"]] </p>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Write your code below and press Shift+Enter to execute\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click <b>here</b> for the solution.\n",
"\n",
"<!-- The answer is below:\n",
"\n",
"#The correlation is 0.0823, the non-diagonal elements of the table.\n",
"#code:\n",
"df[[\"stroke\",\"price\"]].corr() \n",
"\n",
"-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
"<h1>Question 3 b):</h1>\n",
"\n",
"<p>Given the correlation results between \"price\" and \"stroke\" do you expect a linear relationship?</p> \n",
"<p>Verify your results using the function \"regplot()\".</p>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Write your code below and press Shift+Enter to execute \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click <b>here</b> for the solution.\n",
"\n",
"<!-- The answer is below:\n",
"\n",
"#There is a weak correlation between the variable 'stroke' and 'price.' as such regression will not work well. We #can see this use \"regplot\" to demonstrate this.\n",
"\n",
"#Code: \n",
"sns.regplot(x=\"stroke\", y=\"price\", data=df)\n",
"\n",
"-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Categorical variables</h3>\n",
"\n",
"<p>These are variables that describe a 'characteristic' of a data unit, and are selected from a small group of categories. The categorical variables can have the type \"object\" or \"int64\". A good way to visualize categorical variables is by using boxplots.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's look at the relationship between \"body-style\" and \"price\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"sns.boxplot(x=\"body-style\", y=\"price\", data=df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>We see that the distributions of price between the different body-style categories have a significant overlap, and so body-style would not be a good predictor of price. Let's examine engine \"engine-location\" and \"price\":</p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"sns.boxplot(x=\"engine-location\", y=\"price\", data=df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Here we see that the distribution of price between these two engine-location categories, front and rear, are distinct enough to take engine-location as a potential good predictor of price.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Let's examine \"drive-wheels\" and \"price\"."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"# drive-wheels\n",
"sns.boxplot(x=\"drive-wheels\", y=\"price\", data=df)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Here we see that the distribution of price between the different drive-wheels categories differs; as such drive-wheels could potentially be a predictor of price.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"discriptive_statistics\">3. Descriptive Statistical Analysis</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Let's first take a look at the variables by utilizing a description method.</p>\n",
"\n",
"<p>The <b>describe</b> function automatically computes basic statistics for all continuous variables. Any NaN values are automatically skipped in these statistics.</p>\n",
"\n",
"This will show:\n",
"<ul>\n",
" <li>the count of that variable</li>\n",
" <li>the mean</li>\n",
" <li>the standard deviation (std)</li> \n",
" <li>the minimum value</li>\n",
" <li>the IQR (Interquartile Range: 25%, 50% and 75%)</li>\n",
" <li>the maximum value</li>\n",
"<ul>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" We can apply the method \"describe\" as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" The default setting of \"describe\" skips variables of type object. We can apply the method \"describe\" on the variables of type 'object' as follows:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"df.describe(include=['object'])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Value Counts</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Value-counts is a good way of understanding how many units of each characteristic/variable we have. We can apply the \"value_counts\" method on the column 'drive-wheels'. Don’t forget the method \"value_counts\" only works on Pandas series, not Pandas Dataframes. As a result, we only include one bracket \"df['drive-wheels']\" not two brackets \"df[['drive-wheels']]\".</p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df['drive-wheels'].value_counts()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can convert the series to a Dataframe as follows :"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df['drive-wheels'].value_counts().to_frame()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's repeat the above steps but save the results to the dataframe \"drive_wheels_counts\" and rename the column 'drive-wheels' to 'value_counts'."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"drive_wheels_counts = df['drive-wheels'].value_counts().to_frame()\n",
"drive_wheels_counts.rename(columns={'drive-wheels': 'value_counts'}, inplace=True)\n",
"drive_wheels_counts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Now let's rename the index to 'drive-wheels':"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"drive_wheels_counts.index.name = 'drive-wheels'\n",
"drive_wheels_counts"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can repeat the above process for the variable 'engine-location'."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# engine-location as variable\n",
"engine_loc_counts = df['engine-location'].value_counts().to_frame()\n",
"engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)\n",
"engine_loc_counts.index.name = 'engine-location'\n",
"engine_loc_counts.head(10)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Examining the value counts of the engine location would not be a good predictor variable for the price. This is because we only have three cars with a rear engine and 198 with an engine in the front, this result is skewed. Thus, we are not able to draw any conclusions about the engine location.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"basic_grouping\">4. Basics of Grouping</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>The \"groupby\" method groups data by different categories. The data is grouped based on one or several variables and analysis is performed on the individual groups.</p>\n",
"\n",
"<p>For example, let's group by the variable \"drive-wheels\". We see that there are 3 different categories of drive wheels.</p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df['drive-wheels'].unique()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>If we want to know, on average, which type of drive wheel is most valuable, we can group \"drive-wheels\" and then average them.</p>\n",
"\n",
"<p>We can select the columns 'drive-wheels', 'body-style' and 'price', then assign it to the variable \"df_group_one\".</p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"df_group_one = df[['drive-wheels','body-style','price']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can then calculate the average price for each of the different categories of data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# grouping results\n",
"df_group_one = df_group_one.groupby(['drive-wheels'],as_index=False).mean()\n",
"df_group_one"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>From our data, it seems rear-wheel drive vehicles are, on average, the most expensive, while 4-wheel and front-wheel are approximately the same in price.</p>\n",
"\n",
"<p>You can also group with multiple variables. For example, let's group by both 'drive-wheels' and 'body-style'. This groups the dataframe by the unique combinations 'drive-wheels' and 'body-style'. We can store the results in the variable 'grouped_test1'.</p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# grouping results\n",
"df_gptest = df[['drive-wheels','body-style','price']]\n",
"grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()\n",
"grouped_test1"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>This grouped data is much easier to visualize when it is made into a pivot table. A pivot table is like an Excel spreadsheet, with one variable along the column and another along the row. We can convert the dataframe to a pivot table using the method \"pivot \" to create a pivot table from the groups.</p>\n",
"\n",
"<p>In this case, we will leave the drive-wheel variable as the rows of the table, and pivot body-style to become the columns of the table:</p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')\n",
"grouped_pivot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Often, we won't have data for some of the pivot cells. We can fill these missing cells with the value 0, but any other value could potentially be used as well. It should be mentioned that missing data is quite a complex subject and is an entire course on its own.</p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"grouped_pivot = grouped_pivot.fillna(0) #fill missing values with 0\n",
"grouped_pivot"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-danger alertdanger\" style=\"margin-top: 20px\">\n",
"<h1>Question 4:</h1>\n",
"\n",
"<p>Use the \"groupby\" function to find the average \"price\" of each car based on \"body-style\" ? </p>\n",
"</div>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# Write your code below and press Shift+Enter to execute \n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Double-click <b>here</b> for the solution.\n",
"\n",
"<!-- The answer is below:\n",
"\n",
"# grouping results\n",
"df_gptest2 = df[['body-style','price']]\n",
"grouped_test_bodystyle = df_gptest2.groupby(['body-style'],as_index= False).mean()\n",
"grouped_test_bodystyle\n",
"\n",
"-->"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you did not import \"pyplot\" let's do it again. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt\n",
"%matplotlib inline "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>Variables: Drive Wheels and Body Style vs Price</h4>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's use a heat map to visualize the relationship between Body Style vs Price."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": false
},
"outputs": [],
"source": [
"#use the grouped results\n",
"plt.pcolor(grouped_pivot, cmap='RdBu')\n",
"plt.colorbar()\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>The heatmap plots the target variable (price) proportional to colour with respect to the variables 'drive-wheel' and 'body-style' in the vertical and horizontal axis respectively. This allows us to visualize how the price is related to 'drive-wheel' and 'body-style'.</p>\n",
"\n",
"<p>The default labels convey no useful information to us. Let's change that:</p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"fig, ax = plt.subplots()\n",
"im = ax.pcolor(grouped_pivot, cmap='RdBu')\n",
"\n",
"#label names\n",
"row_labels = grouped_pivot.columns.levels[1]\n",
"col_labels = grouped_pivot.index\n",
"\n",
"#move ticks and labels to the center\n",
"ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)\n",
"ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)\n",
"\n",
"#insert labels\n",
"ax.set_xticklabels(row_labels, minor=False)\n",
"ax.set_yticklabels(col_labels, minor=False)\n",
"\n",
"#rotate label if too long\n",
"plt.xticks(rotation=90)\n",
"\n",
"fig.colorbar(im)\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Visualization is very important in data science, and Python visualization packages provide great freedom. We will go more in-depth in a separate Python Visualizations course.</p>\n",
"\n",
"<p>The main question we want to answer in this module, is \"What are the main characteristics which have the most impact on the car price?\".</p>\n",
"\n",
"<p>To get a better measure of the important characteristics, we look at the correlation of these variables with the car price, in other words: how is the car price dependent on this variable?</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"correlation_causation\">5. Correlation and Causation</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p><b>Correlation</b>: a measure of the extent of interdependence between variables.</p>\n",
"\n",
"<p><b>Causation</b>: the relationship between cause and effect between two variables.</p>\n",
"\n",
"<p>It is important to know the difference between these two and that correlation does not imply causation. Determining correlation is much simpler the determining causation as causation may require independent experimentation.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p3>Pearson Correlation</p>\n",
"<p>The Pearson Correlation measures the linear dependence between two variables X and Y.</p>\n",
"<p>The resulting coefficient is a value between -1 and 1 inclusive, where:</p>\n",
"<ul>\n",
" <li><b>1</b>: Total positive linear correlation.</li>\n",
" <li><b>0</b>: No linear correlation, the two variables most likely do not affect each other.</li>\n",
" <li><b>-1</b>: Total negative linear correlation.</li>\n",
"</ul>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Pearson Correlation is the default method of the function \"corr\". Like before we can calculate the Pearson Correlation of the of the 'int64' or 'float64' variables.</p>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"df.corr()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" sometimes we would like to know the significant of the correlation estimate. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<b>P-value</b>: \n",
"<p>What is this P-value? The P-value is the probability value that the correlation between these two variables is statistically significant. Normally, we choose a significance level of 0.05, which means that we are 95% confident that the correlation between the variables is significant.</p>\n",
"\n",
"By convention, when the\n",
"<ul>\n",
" <li>p-value is $<$ 0.001: we say there is strong evidence that the correlation is significant.</li>\n",
" <li>the p-value is $<$ 0.05: there is moderate evidence that the correlation is significant.</li>\n",
" <li>the p-value is $<$ 0.1: there is weak evidence that the correlation is significant.</li>\n",
" <li>the p-value is $>$ 0.1: there is no evidence that the correlation is significant.</li>\n",
"</ul>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" We can obtain this information using \"stats\" module in the \"scipy\" library."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from scipy import stats"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Wheel-base vs Price</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's calculate the Pearson Correlation Coefficient and P-value of 'wheel-base' and 'price'. "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pearson_coef, p_value = stats.pearsonr(df['wheel-base'], df['price'])\n",
"print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P =\", p_value) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h5>Conclusion:</h5>\n",
"<p>Since the p-value is $<$ 0.001, the correlation between wheel-base and price is statistically significant, although the linear relationship isn't extremely strong (~0.585)</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Horsepower vs Price</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Let's calculate the Pearson Correlation Coefficient and P-value of 'horsepower' and 'price'."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])\n",
"print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P = \", p_value) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h5>Conclusion:</h5>\n",
"\n",
"<p>Since the p-value is $<$ 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1)</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Length vs Price</h3>\n",
"\n",
"Let's calculate the Pearson Correlation Coefficient and P-value of 'length' and 'price'."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pearson_coef, p_value = stats.pearsonr(df['length'], df['price'])\n",
"print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P = \", p_value) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h5>Conclusion:</h5>\n",
"<p>Since the p-value is $<$ 0.001, the correlation between length and price is statistically significant, and the linear relationship is moderately strong (~0.691).</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Width vs Price</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Let's calculate the Pearson Correlation Coefficient and P-value of 'width' and 'price':"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pearson_coef, p_value = stats.pearsonr(df['width'], df['price'])\n",
"print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P =\", p_value ) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Conclusion:\n",
"\n",
"Since the p-value is < 0.001, the correlation between width and price is statistically significant, and the linear relationship is quite strong (~0.751)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Curb-weight vs Price"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Let's calculate the Pearson Correlation Coefficient and P-value of 'curb-weight' and 'price':"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pearson_coef, p_value = stats.pearsonr(df['curb-weight'], df['price'])\n",
"print( \"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P = \", p_value) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h5>Conclusion:</h5>\n",
"<p>Since the p-value is $<$ 0.001, the correlation between curb-weight and price is statistically significant, and the linear relationship is quite strong (~0.834).</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Engine-size vs Price</h3>\n",
"\n",
"Let's calculate the Pearson Correlation Coefficient and P-value of 'engine-size' and 'price':"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pearson_coef, p_value = stats.pearsonr(df['engine-size'], df['price'])\n",
"print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P =\", p_value) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h5>Conclusion:</h5>\n",
"\n",
"<p>Since the p-value is $<$ 0.001, the correlation between engine-size and price is statistically significant, and the linear relationship is very strong (~0.872).</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Bore vs Price</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Let's calculate the Pearson Correlation Coefficient and P-value of 'bore' and 'price':"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pearson_coef, p_value = stats.pearsonr(df['bore'], df['price'])\n",
"print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P = \", p_value ) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h5>Conclusion:</h5>\n",
"<p>Since the p-value is $<$ 0.001, the correlation between bore and price is statistically significant, but the linear relationship is only moderate (~0.521).</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" We can relate the process for each 'City-mpg' and 'Highway-mpg':"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>City-mpg vs Price</h3>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pearson_coef, p_value = stats.pearsonr(df['city-mpg'], df['price'])\n",
"print(\"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P = \", p_value) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h5>Conclusion:</h5>\n",
"<p>Since the p-value is $<$ 0.001, the correlation between city-mpg and price is statistically significant, and the coefficient of ~ -0.687 shows that the relationship is negative and moderately strong.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Highway-mpg vs Price</h3>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"pearson_coef, p_value = stats.pearsonr(df['highway-mpg'], df['price'])\n",
"print( \"The Pearson Correlation Coefficient is\", pearson_coef, \" with a P-value of P = \", p_value ) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### Conclusion:\n",
"Since the p-value is < 0.001, the correlation between highway-mpg and price is statistically significant, and the coefficient of ~ -0.705 shows that the relationship is negative and moderately strong."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 id=\"anova\">6. ANOVA</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>ANOVA: Analysis of Variance</h3>\n",
"<p>The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:</p>\n",
"\n",
"<p><b>F-test score</b>: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.</p>\n",
"\n",
"<p><b>P-value</b>: P-value tells how statistically significant is our calculated score value.</p>\n",
"\n",
"<p>If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Drive Wheels</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>Since ANOVA analyzes the difference between different groups of the same variable, the groupby function will come in handy. Because the ANOVA algorithm averages the data automatically, we do not need to take the average before hand.</p>\n",
"\n",
"<p>Let's see if different types 'drive-wheels' impact 'price', we group the data.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Let's see if different types 'drive-wheels' impact 'price', we group the data."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])\n",
"grouped_test2.head(2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_gptest"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" We can obtain the values of the method group using the method \"get_group\". "
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"grouped_test2.get_group('4wd')['price']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"we can use the function 'f_oneway' in the module 'stats' to obtain the <b>F-test score</b> and <b>P-value</b>."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"# ANOVA\n",
"f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price']) \n",
" \n",
"print( \"ANOVA results: F=\", f_val, \", P =\", p_val) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This is a great result, with a large F test score showing a strong correlation and a P value of almost 0 implying almost certain statistical significance. But does this mean all three tested groups are all this highly correlated? "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Separately: fwd and rwd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price']) \n",
" \n",
"print( \"ANOVA results: F=\", f_val, \", P =\", p_val )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
" Let's examine the other groups "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### 4wd and rwd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false,
"scrolled": true
},
"outputs": [],
"source": [
"f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('rwd')['price']) \n",
" \n",
"print( \"ANOVA results: F=\", f_val, \", P =\", p_val) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h4>4wd and fwd</h4>"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"collapsed": false
},
"outputs": [],
"source": [
"f_val, p_val = stats.f_oneway(grouped_test2.get_group('4wd')['price'], grouped_test2.get_group('fwd')['price']) \n",
" \n",
"print(\"ANOVA results: F=\", f_val, \", P =\", p_val) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Conclusion: Important Variables</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<p>We now have a better idea of what our data looks like and which variables are important to take into account when predicting the car price. We have narrowed it down to the following variables:</p>\n",
"\n",
"Continuous numerical variables:\n",
"<ul>\n",
" <li>Length</li>\n",
" <li>Width</li>\n",
" <li>Curb-weight</li>\n",
" <li>Engine-size</li>\n",
" <li>Horsepower</li>\n",
" <li>City-mpg</li>\n",
" <li>Highway-mpg</li>\n",
" <li>Wheel-base</li>\n",
" <li>Bore</li>\n",
"</ul>\n",
" \n",
"Categorical variables:\n",
"<ul>\n",
" <li>Drive-wheels</li>\n",
"</ul>\n",
"\n",
"<p>As we now move into building machine learning models to automate our analysis, feeding the model with variables that meaningfully affect our target variable will improve our model's prediction performance.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h1>Thank you for completing this notebook</h1>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<div class=\"alert alert-block alert-info\" style=\"margin-top: 20px\">\n",
"\n",
" <p><a href=\"https://cocl.us/DA0101EN_edx_link_Notebook_bottom\"><img src=\"https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/Images/BottomAd.png\" width=\"750\" align=\"center\"></a></p>\n",
"</div>\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>About the Authors:</h3>\n",
"\n",
"This notebook was written by <a href=\"https://www.linkedin.com/in/mahdi-noorian-58219234/\" target=\"_blank\">Mahdi Noorian PhD</a>, <a href=\"https://www.linkedin.com/in/joseph-s-50398b136/\" target=\"_blank\">Joseph Santarcangelo</a>, Bahare Talayian, Eric Xiao, Steven Dong, Parizad, Hima Vsudevan and <a href=\"https://www.linkedin.com/in/fiorellawever/\" target=\"_blank\">Fiorella Wenver</a> and <a href=\" https://www.linkedin.com/in/yi-leng-yao-84451275/ \" target=\"_blank\" >Yi Yao</a>.\n",
"\n",
"<p><a href=\"https://www.linkedin.com/in/joseph-s-50398b136/\" target=\"_blank\">Joseph Santarcangelo</a> is a Data Scientist at IBM, and holds a PhD in Electrical Engineering. His research focused on using Machine Learning, Signal Processing, and Computer Vision to determine how videos impact human cognition. Joseph has been working for IBM since he completed his PhD.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<hr>\n",
"<p>Copyright &copy; 2018 IBM Developer Skills Network. This notebook and its source code are released under the terms of the <a href=\"https://cognitiveclass.ai/mit-license/\">MIT License</a>.</p>"
]
}
],
"metadata": {
"anaconda-cloud": {},
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.7"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment