Last active
October 15, 2019 22:30
-
-
Save y2kbugger/7546d597d5faa8216a4aa3fdeb38b6e8 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"cells": [ | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# REG200 - Regression 200 Homework example" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"By the end of this module, the participant will be able to:\n", | |
"- Measure the strength of correlation between two variables\n", | |
"- Determine if a correlation coefficient is statistically significant\n", | |
"- Measure the correlation between multiple variables\n" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## How to use this notebook\n", | |
"\n", | |
"Try running the cell below:\n", | |
"\n", | |
"1. Click into the code portion of the cell near `3 + 8`\n", | |
"2. Press Shift and Enter at the same time to run the cell and move to the next.\n", | |
"3. Press Shift+Enter two more times to set the value of a and print a message.\n", | |
"\n", | |
"The last value in a cell is shown as output." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"3 + 8" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"a = 4" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"print(f\"I have {a} chicken wings.\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Repeatedly run a cell\n", | |
"\n", | |
"Click into the cell below and repeated press Ctrl+Enter" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"a += 2\n", | |
"print(f\"I have {a} chicken wings.\")" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## For the rest of the notebook run all of the code cells\n", | |
"\n", | |
"You can edit values/code and rerun, but I reccomend going strait all of the cells once first" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Import libraries for using data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# for spreadsheets that we call DataFrames\n", | |
"import pandas as pd\n", | |
"\n", | |
"# for plotting\n", | |
"import matplotlib.pyplot as plt" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"## Read in the data" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# emulate a .txt file using text, typically you just provide a filename to pandas\n", | |
"from io import StringIO\n", | |
"\n", | |
"ab = StringIO(\n", | |
"\"\"\"A\tB\n", | |
"1\t3.8380\n", | |
"3\t10.2051\n", | |
"16\t29.4932\n", | |
"4\t14.0697\n", | |
"5\t15.7511\n", | |
"19\t44.7980\n", | |
"6\t14.2803\n", | |
"4\t16.9219\n", | |
"7\t15.8849\n", | |
"3\t10.5358\n", | |
"22\t53.2528\n", | |
"7\t22.3758\n", | |
"2\t15.8896\n", | |
"1\t-2.1355\n", | |
"4\t21.5362\n", | |
"6\t25.0148\n", | |
"12\t25.1035\n", | |
"4\t3.2217\n", | |
"23\t55.7904\n", | |
"16\t29.5169\n", | |
"14\t41.6494\n", | |
"13\t37.2478\n", | |
"19\t41.1936\n", | |
"18\t53.9627\n", | |
"21\t63.9905\n", | |
"24\t51.5585\n", | |
"17\t34.6849\n", | |
"5\t8.4110\n", | |
"10\t24.9118\n", | |
"14\t32.9547\"\"\")\n", | |
"\n", | |
"df_ab = pd.read_table(ab)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"### Show the first 4 rows of your dataframe" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_ab.head(4)" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"plt.scatter(x=df_ab['A'], y=df_ab['B'])\n", | |
"plt.show()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": { | |
"scrolled": true | |
}, | |
"outputs": [], | |
"source": [ | |
"xy = StringIO(\n", | |
"\"\"\"X\tY\n", | |
"-12.3355\t91.6910\n", | |
"-10.5935\t56.0191\n", | |
"-8.9611\t50.3329\n", | |
"-9.1287\t34.4467\n", | |
"-7.8851\t38.4195\n", | |
"-7.5763\t25.3774\n", | |
"-5.4090\t9.4484\n", | |
"-4.2056\t13.7673\n", | |
"-3.8530\t16.9343\n", | |
"-2.9034\t3.7421\n", | |
"-1.1896\t19.8859\n", | |
"-1.9951\t-2.7608\n", | |
"-0.9127\t-5.8631\n", | |
"-0.6323\t-0.3808\n", | |
"2.3055\t2.6454\n", | |
"2.8040\t9.6105\n", | |
"5.6442\t18.5256\n", | |
"5.5861\t9.7630\n", | |
"4.6719\t20.6125\n", | |
"5.9124\t38.4381\n", | |
"8.8544\t56.0579\n", | |
"8.2864\t46.5456\n", | |
"10.4703\t49.9356\n", | |
"11.5479\t86.3035\n", | |
"11.5704\t79.8768\n", | |
"-13.4738\t83.3070\n", | |
"-11.6391\t57.1737\n", | |
"-10.9981\t52.5305\n", | |
"-9.6959\t39.9723\n", | |
"-5.8497\t39.7582\n", | |
"-7.2250\t23.0322\n", | |
"-4.5866\t22.4866\n", | |
"-3.5534\t-0.6284\n", | |
"-2.8923\t21.9397\n", | |
"-0.9984\t18.4569\n", | |
"-1.7183\t21.7793\n", | |
"-0.4028\t4.4911\n", | |
"0.7777\t-6.4068\n", | |
"1.9729\t-2.6752\n", | |
"2.3569\t18.0500\n", | |
"1.4580\t10.1399\n", | |
"4.9599\t0.4283\n", | |
"6.7307\t0.3959\n", | |
"5.8546\t18.0307\n", | |
"7.8785\t30.2044\n", | |
"6.4566\t34.3580\n", | |
"9.3001\t60.2244\n", | |
"8.8975\t62.0833\n", | |
"11.2928\t56.0161\n", | |
"12.4169\t88.8742\n", | |
"-13.7014\t65.4625\n", | |
"-10.1969\t72.7266\n", | |
"-9.8106\t65.2517\n", | |
"-8.5776\t43.2419\n", | |
"-7.5841\t35.3953\n", | |
"-9.4148\t39.6577\n", | |
"-5.8556\t19.6736\n", | |
"-4.6215\t24.8778\n", | |
"-4.3202\t13.1955\n", | |
"-4.5850\t-5.1053\n", | |
"-2.0614\t10.0775\n", | |
"-0.7651\t-3.1292\n", | |
"0.0748\t-11.3666\n", | |
"0.3974\t18.5064\n", | |
"1.5747\t13.3974\n", | |
"2.7116\t25.6631\n", | |
"4.8122\t11.5316\n", | |
"4.1023\t27.5848\n", | |
"6.2293\t39.5805\n", | |
"7.5989\t34.2157\n", | |
"7.4117\t50.3946\n", | |
"9.1963\t38.8196\n", | |
"9.5195\t56.1779\n", | |
"12.1518\t68.0794\n", | |
"11.6490\t69.7307\"\"\")\n", | |
"df_xy = pd.read_table(xy)\n", | |
"df_xy.head(5)" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Get some basic summary statistics about each variable" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"df_xy.describe()" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Calculate the correlation between each variable" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# pearson correlation matrix\n", | |
"\"\"\"\n", | |
"df.corr(method='pearson')\n", | |
"\n", | |
"Compute pairwise correlation of columns\n", | |
"\n", | |
"method : {'pearson', 'kendall', 'spearman'}\n", | |
"\"\"\"\n", | |
"df_xy.corr()" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"plt.scatter(x=df_xy['X'], y=df_xy['Y']);" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"multicorr = StringIO(\"\"\"Feature\tHigh Speed (km)\tSquareness (%)\tEPI\tApplied OW Tension\tE Diam Change\n", | |
"C\t293.7\t84.4\t28\t20\t0.00\n", | |
"C\t276.0\t86.0\t28\t20\t0.00\n", | |
"F1\t255.0\t91.5\t24\t30\t0.00\n", | |
"F1\t270.0\t92.0\t24\t30\t0.00\n", | |
"F2\t240.0\t92.5\t22\t50\t-0.15\n", | |
"F2\t232.0\t94.5\t22\t50\t-0.15\n", | |
"F3\t285.3\t90.4\t24\t40\t-0.40\n", | |
"F3\t264.6\t92.5\t24\t40\t-0.40\n", | |
"F4\t267.0\t85.1\t20\t30\t-0.30\n", | |
"F4\t252.0\t86.4\t20\t30\t-0.30\n", | |
"F5\t274.5\t82.0\t22\t20\t0.00\n", | |
"F5\t285.0\t81.5\t22\t20\t0.00\n", | |
"\"\"\")\n", | |
"\n", | |
"import pandas as pd\n", | |
"df_multicorr = pd.read_table(multicorr)\n", | |
"df_multicorr" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Calculate the correlation again, this time with more variables" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"corr = df_multicorr.corr()\n", | |
"corr" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Correlation matrix as heatmap\n", | |
"\n", | |
"Note the good correlation between Applied OW Tension and Squareness" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"import seaborn as sns\n", | |
"sns.heatmap(\n", | |
" df_multicorr.corr(),\n", | |
" cmap='Blues',\n", | |
" vmin=-1.0,\n", | |
" vmax=1.0,\n", | |
" );" | |
] | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"# Crossplot each variable against the others\n", | |
"\n", | |
"This is a more graphical way to interpret correlation.\n", | |
"\n", | |
"Again check out the plot between Applied OW Tension and Squareness" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# https://seaborn.pydata.org/generated/seaborn.pairplot.html\n", | |
"sns.pairplot(df_multicorr);" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"execution_count": null, | |
"metadata": {}, | |
"outputs": [], | |
"source": [ | |
"# Another slightly nicer version\n", | |
"# https://seaborn.pydata.org/examples/many_pairwise_correlations.html\n", | |
"\n", | |
"import numpy as np\n", | |
"\n", | |
"# Set up the matplotlib figure\n", | |
"f, ax = plt.subplots(figsize=(11, 9))\n", | |
"\n", | |
"# Generate a mask for the upper triangle\n", | |
"mask = np.zeros_like(corr, dtype=np.bool)\n", | |
"mask[np.triu_indices_from(mask)] = True\n", | |
"\n", | |
"# Generate a custom diverging colormap\n", | |
"cmap = sns.diverging_palette(220, 10, as_cmap=True)\n", | |
"sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1, vmax=1, center=0,\n", | |
" square=True, linewidths=.5, cbar_kws={\"shrink\": .5});" | |
] | |
} | |
], | |
"metadata": { | |
"kernelspec": { | |
"display_name": "Python 3", | |
"language": "python", | |
"name": "python3" | |
}, | |
"language_info": { | |
"codemirror_mode": { | |
"name": "ipython", | |
"version": 3 | |
}, | |
"file_extension": ".py", | |
"mimetype": "text/x-python", | |
"name": "python", | |
"nbconvert_exporter": "python", | |
"pygments_lexer": "ipython3", | |
"version": "3.6.9" | |
} | |
}, | |
"nbformat": 4, | |
"nbformat_minor": 4 | |
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
matplotlib==3.1.1 | |
numpy==1.16.4 | |
pandas==0.25.0 | |
seaborn==0.9.0 | |
jupyterlab==1.0.2 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
python-3.6 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment