Skip to content

Instantly share code, notes, and snippets.

@y2kbugger
Last active October 15, 2019 22:30
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save y2kbugger/7546d597d5faa8216a4aa3fdeb38b6e8 to your computer and use it in GitHub Desktop.
Save y2kbugger/7546d597d5faa8216a4aa3fdeb38b6e8 to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# REG200 - Regression 200 Homework example"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"By the end of this module, the participant will be able to:\n",
"- Measure the strength of correlation between two variables\n",
"- Determine if a correlation coefficient is statistically significant\n",
"- Measure the correlation between multiple variables\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## How to use this notebook\n",
"\n",
"Try running the cell below:\n",
"\n",
"1. Click into the code portion of the cell near `3 + 8`\n",
"2. Press Shift and Enter at the same time to run the cell and move to the next.\n",
"3. Press Shift+Enter two more times to set the value of a and print a message.\n",
"\n",
"The last value in a cell is shown as output."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"3 + 8"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"a = 4"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(f\"I have {a} chicken wings.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Repeatedly run a cell\n",
"\n",
"Click into the cell below and repeated press Ctrl+Enter"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"a += 2\n",
"print(f\"I have {a} chicken wings.\")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## For the rest of the notebook run all of the code cells\n",
"\n",
"You can edit values/code and rerun, but I reccomend going strait all of the cells once first"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import libraries for using data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# for spreadsheets that we call DataFrames\n",
"import pandas as pd\n",
"\n",
"# for plotting\n",
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Read in the data"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# emulate a .txt file using text, typically you just provide a filename to pandas\n",
"from io import StringIO\n",
"\n",
"ab = StringIO(\n",
"\"\"\"A\tB\n",
"1\t3.8380\n",
"3\t10.2051\n",
"16\t29.4932\n",
"4\t14.0697\n",
"5\t15.7511\n",
"19\t44.7980\n",
"6\t14.2803\n",
"4\t16.9219\n",
"7\t15.8849\n",
"3\t10.5358\n",
"22\t53.2528\n",
"7\t22.3758\n",
"2\t15.8896\n",
"1\t-2.1355\n",
"4\t21.5362\n",
"6\t25.0148\n",
"12\t25.1035\n",
"4\t3.2217\n",
"23\t55.7904\n",
"16\t29.5169\n",
"14\t41.6494\n",
"13\t37.2478\n",
"19\t41.1936\n",
"18\t53.9627\n",
"21\t63.9905\n",
"24\t51.5585\n",
"17\t34.6849\n",
"5\t8.4110\n",
"10\t24.9118\n",
"14\t32.9547\"\"\")\n",
"\n",
"df_ab = pd.read_table(ab)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Show the first 4 rows of your dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_ab.head(4)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.scatter(x=df_ab['A'], y=df_ab['B'])\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"xy = StringIO(\n",
"\"\"\"X\tY\n",
"-12.3355\t91.6910\n",
"-10.5935\t56.0191\n",
"-8.9611\t50.3329\n",
"-9.1287\t34.4467\n",
"-7.8851\t38.4195\n",
"-7.5763\t25.3774\n",
"-5.4090\t9.4484\n",
"-4.2056\t13.7673\n",
"-3.8530\t16.9343\n",
"-2.9034\t3.7421\n",
"-1.1896\t19.8859\n",
"-1.9951\t-2.7608\n",
"-0.9127\t-5.8631\n",
"-0.6323\t-0.3808\n",
"2.3055\t2.6454\n",
"2.8040\t9.6105\n",
"5.6442\t18.5256\n",
"5.5861\t9.7630\n",
"4.6719\t20.6125\n",
"5.9124\t38.4381\n",
"8.8544\t56.0579\n",
"8.2864\t46.5456\n",
"10.4703\t49.9356\n",
"11.5479\t86.3035\n",
"11.5704\t79.8768\n",
"-13.4738\t83.3070\n",
"-11.6391\t57.1737\n",
"-10.9981\t52.5305\n",
"-9.6959\t39.9723\n",
"-5.8497\t39.7582\n",
"-7.2250\t23.0322\n",
"-4.5866\t22.4866\n",
"-3.5534\t-0.6284\n",
"-2.8923\t21.9397\n",
"-0.9984\t18.4569\n",
"-1.7183\t21.7793\n",
"-0.4028\t4.4911\n",
"0.7777\t-6.4068\n",
"1.9729\t-2.6752\n",
"2.3569\t18.0500\n",
"1.4580\t10.1399\n",
"4.9599\t0.4283\n",
"6.7307\t0.3959\n",
"5.8546\t18.0307\n",
"7.8785\t30.2044\n",
"6.4566\t34.3580\n",
"9.3001\t60.2244\n",
"8.8975\t62.0833\n",
"11.2928\t56.0161\n",
"12.4169\t88.8742\n",
"-13.7014\t65.4625\n",
"-10.1969\t72.7266\n",
"-9.8106\t65.2517\n",
"-8.5776\t43.2419\n",
"-7.5841\t35.3953\n",
"-9.4148\t39.6577\n",
"-5.8556\t19.6736\n",
"-4.6215\t24.8778\n",
"-4.3202\t13.1955\n",
"-4.5850\t-5.1053\n",
"-2.0614\t10.0775\n",
"-0.7651\t-3.1292\n",
"0.0748\t-11.3666\n",
"0.3974\t18.5064\n",
"1.5747\t13.3974\n",
"2.7116\t25.6631\n",
"4.8122\t11.5316\n",
"4.1023\t27.5848\n",
"6.2293\t39.5805\n",
"7.5989\t34.2157\n",
"7.4117\t50.3946\n",
"9.1963\t38.8196\n",
"9.5195\t56.1779\n",
"12.1518\t68.0794\n",
"11.6490\t69.7307\"\"\")\n",
"df_xy = pd.read_table(xy)\n",
"df_xy.head(5)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Get some basic summary statistics about each variable"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"df_xy.describe()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Calculate the correlation between each variable"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# pearson correlation matrix\n",
"\"\"\"\n",
"df.corr(method='pearson')\n",
"\n",
"Compute pairwise correlation of columns\n",
"\n",
"method : {'pearson', 'kendall', 'spearman'}\n",
"\"\"\"\n",
"df_xy.corr()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"plt.scatter(x=df_xy['X'], y=df_xy['Y']);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"multicorr = StringIO(\"\"\"Feature\tHigh Speed (km)\tSquareness (%)\tEPI\tApplied OW Tension\tE Diam Change\n",
"C\t293.7\t84.4\t28\t20\t0.00\n",
"C\t276.0\t86.0\t28\t20\t0.00\n",
"F1\t255.0\t91.5\t24\t30\t0.00\n",
"F1\t270.0\t92.0\t24\t30\t0.00\n",
"F2\t240.0\t92.5\t22\t50\t-0.15\n",
"F2\t232.0\t94.5\t22\t50\t-0.15\n",
"F3\t285.3\t90.4\t24\t40\t-0.40\n",
"F3\t264.6\t92.5\t24\t40\t-0.40\n",
"F4\t267.0\t85.1\t20\t30\t-0.30\n",
"F4\t252.0\t86.4\t20\t30\t-0.30\n",
"F5\t274.5\t82.0\t22\t20\t0.00\n",
"F5\t285.0\t81.5\t22\t20\t0.00\n",
"\"\"\")\n",
"\n",
"import pandas as pd\n",
"df_multicorr = pd.read_table(multicorr)\n",
"df_multicorr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Calculate the correlation again, this time with more variables"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"corr = df_multicorr.corr()\n",
"corr"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Correlation matrix as heatmap\n",
"\n",
"Note the good correlation between Applied OW Tension and Squareness"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import seaborn as sns\n",
"sns.heatmap(\n",
" df_multicorr.corr(),\n",
" cmap='Blues',\n",
" vmin=-1.0,\n",
" vmax=1.0,\n",
" );"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Crossplot each variable against the others\n",
"\n",
"This is a more graphical way to interpret correlation.\n",
"\n",
"Again check out the plot between Applied OW Tension and Squareness"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# https://seaborn.pydata.org/generated/seaborn.pairplot.html\n",
"sns.pairplot(df_multicorr);"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Another slightly nicer version\n",
"# https://seaborn.pydata.org/examples/many_pairwise_correlations.html\n",
"\n",
"import numpy as np\n",
"\n",
"# Set up the matplotlib figure\n",
"f, ax = plt.subplots(figsize=(11, 9))\n",
"\n",
"# Generate a mask for the upper triangle\n",
"mask = np.zeros_like(corr, dtype=np.bool)\n",
"mask[np.triu_indices_from(mask)] = True\n",
"\n",
"# Generate a custom diverging colormap\n",
"cmap = sns.diverging_palette(220, 10, as_cmap=True)\n",
"sns.heatmap(corr, mask=mask, cmap=cmap, vmin=-1, vmax=1, center=0,\n",
" square=True, linewidths=.5, cbar_kws={\"shrink\": .5});"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
matplotlib==3.1.1
numpy==1.16.4
pandas==0.25.0
seaborn==0.9.0
jupyterlab==1.0.2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment