hsnee/Effective_Data_Visualization_Workbook2021.ipynb

## Effective_Data_Visualization_Workbook2021.ipynb
{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Effective Data Visualization"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## PyCon 2021"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Husni Almoubayyed [https://husni.space]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Intro on packages: \n",
    "\n",
    "- **Matplotlib and Seaborn**: Main plotting package in python is called matplotlib. Matplotlib is the base for another package which builds on top of it called Seaborn. We will use Seaborn when possible as it makes most things a lot more easier and allows us to achieve plots with sensible choices and significantly less lines of code. We will still use matplotlib for some things and it is important to understand every time Seaborn creates a plot it is calling Matplotlib in the background (it is also sometimes calling other things like statsmodels in the background to do some statistical calculations)\n",
    "\n",
    "Matplotlib and Seaborn syntax is usually used as follows: plt. or sns.*typeofgraph*(arguments) \n",
    "arguments are usually X and Y coordinates (or names of X and Y columns in a dataframe), colors, sizes, etc.\n",
    "\n",
    "- **Pandas** is a library that handles [Pan]el [Da]ta. Basically it allows us to manipulate data in tables a lot more easily. \n",
    "\n",
    "- **Numpy** is a python library that contains all the standard numerical operations you might wanna do\n",
    "\n",
    "- **Sci-Kit Learn (sklearn)** is a widely used library that you can use to do most common non-deep machine learning methods.\n",
    "\n",
    "## Intro to datasets: \n",
    "We will use a few standard datasets throughout this tutorial. These can be imported from seaborn as will be shown later:\n",
    " - **diamonds**: data on diamonds with prices, carats, color, clarity, cut, etc.\n",
    " - **flights**: number of passengers in each month for each year for a few years in the ~50s\n",
    " - **iris**: famous biology dataset that quantifies the morphologic variation of Iris flowers of three related species\n",
    " - **titanic**: data on all titanic passengers including survival, age, ticket price paid, etc.\n",
    " - **anscombe**: this is compiled of 4 different datasets that have the same first and second moments but look dramatically different\n",
    " - **digits**: handwritten data of digits, used widely in machine learning\n",
    " \n",
    "Other datasets that are not directly imported from seaborn:\n",
    " - **financial data**: this will be requested in real time from yahoo finance using pandas.\n",
    " - **CoViD-19 data**: https://usafactsstatic.blob.core.windows.net/public/data/covid-19/covid_confirmed_usafacts.csv from usafacts.org\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Installation Instructions"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "uncomment and run the following (if you have not done so already)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#!pip install --upgrade matplotlib numpy scipy sklearn seaborn plotly statsmodels plotly-geo pandas_datareader geopandas shapely pyshp pandas"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You might need to restart the kernel at this point to use any newly installed packages"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Alternatively, you can go to https://bit.ly/PyConViz20 to use a Colab hosted version of this notebook. If you're looking for the Colab-hosted Solutions version of the notebook, you can find it at https://bit.ly/PyConViz20Full"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import numpy and matplotlib and setup inline plots by running:\n",
    "%pylab inline"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import seaborn as sns\n",
    "import pandas as pd\n",
    "sns.set_style('darkgrid')\n",
    "sns.set_context('notebook', font_scale=1.5)\n",
    "sns.set_palette('colorblind')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# set the matplotlib backend to a higher-resolution option, on macOS, this is:\n",
    "%config InlineBackend.figure_format = 'retina'\n",
    "\n",
    "# set larger figure size for the rest of this notebook\n",
    "matplotlib.rcParams['figure.figsize'] = 12, 8"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 1. Density Estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Often when we are making plots, we are trying to estimate the underlying distribution from which it was randomly drawn, this is known as Density Estimation in statistics. The simplest density estimator that does not make particular assumptions on the distribution of the data (we call this nonparametric) is the histogram."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Histograms"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# import out first dataset, an example from biology\n",
    "iris = sns.load_dataset('iris')\n",
    "iris.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "data = iris['sepal_length']\n",
    "plt.hist(data)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = iris['sepal_length']\n",
    "f, (ax1, ax2) = plt.subplots(1, 2)\n",
    "ax1.hist(data, bins=5, density=True);\n",
    "ax2.hist(data, bins=100, density=True);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Formally, The histogram estimator is $$ \\hat{p}(x) = \\frac{\\hat{\\theta_j}}{h} $$ where $$ \\hat{\\theta_j} = \\frac{1}{n} \\sum_{i=1}^n I(X_i \\in B_j ) $$\n",
    "\n",
    "We can calculate the mean squared error, which is a metric that tells us how well our estimator is, it turns out to be: $$MSE(x) = bias^2(x) + Var(x) = Ch^2 + \\frac{C}{nh} $$\n",
    "minimized by choosing $h = (\\frac{C}{n})^{1/3}$, resulting in a risk (the expected value of the MSE) of:\n",
    "$$ R = \\mathcal{O}(\\frac{1}{n})^{2/3}$$\n",
    "This means that \n",
    " - There is a bias-variance tradeoff when it comes to choosing the width of the bins, lower width ($h$), means more bias and less variance. There is no choice of $h$ that optimizes both.\n",
    " - The risk goes down at a pretty slow rate as the number of datapoints increases, which begs the question, is there a better estimator that converges more quickly? The answer is yes, this is achieved by:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Kernel Density Estimation"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Kernels follow the conditions:\n",
    "$$ K(x) \\geq 0, \\int K(x) dx = 1, \\int x K(x) dx = 0$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.kdeplot(x=data, bw_method=0.3);"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "So how is this better than the histogram?\n",
    "We can again calculate the MSE, which turns out to be:\n",
    "\n",
    "$$MSE(x) = bias^2(x) + Var(x) = C_1h^4 + \\frac{C_2}{nh}$$\n",
    "\n",
    "minimized by choosing $ h = (\\frac{C_1}{4nC_2})^{1/5} $, giving a risk of:\n",
    "\n",
    "$$ R_{KDE} = \\mathcal{O}(\\frac{1}{n})^{4/5} < R_{histogram}$$\n",
    "\n",
    "This still has a bias-variance tradeoff, but the estimator converges faster than in the case of histograms. Can we do even better? The answer is no, due to something in statistics called the minimax theorem."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise**: Instead of using just petal length, consider a 2D distribution with the two axes being petal length and petal width. Plot the distribution, the Histogram of the distribution and the KDE of the distribution. Make sure you play around with bin numbers and bandwidth to get a reasonably satisfying plot"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.displot(x=iris['petal_length'], kde=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Data Exploration"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "anscombe = sns.load_dataset('anscombe')\n",
    "for dataset in ['I','II','III','IV']:\n",
    "    print(anscombe[anscombe['dataset']==dataset].describe())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.lmplot(x='x', y='y', data=anscombe[anscombe['dataset']=='I'], height=8)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.lmplot(x='x', y='y', data=anscombe[anscombe['dataset']=='II'], height=8, order=2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.lmplot(x='x', y='y', data=anscombe[anscombe['dataset']=='III'], height=8, robust=True, ci=None)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "iris = sns.load_dataset('iris')\n",
    "iris.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.scatterplot(x='petal_length', y='petal_width', hue='species', data=iris)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Exercise:\n",
    "On the same plot, fit 3 linear models for the 3 different iris species with the same x and y axes"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.jointplot(x='petal_length', y='petal_width', data=iris, height=8, kind='kde')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.pairplot(iris, height=8, hue='species')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "How about categorical data? \n",
    "\n",
    "We can make boxplots and violin plots simply by running:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.catplot(y='petal_width', x='species',\n",
    "            kind='violin', data=iris, height=8)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** Load up the flights dataset, plot a linear model of the passengers number as a function of year, one for each month of the year."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = sns.load_dataset('flights')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** Load up the diamonds dataset from seaborn. Plot the price as a function of carat, with different color grades colored differently. choose a small marker size and change the transparency (alpha agrument) to a smaller value than 1. Add some jitter to the x values to make them clearer."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = sns.load_dataset('diamonds')\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.lmplot(x='carat', y='price', hue='color', \n",
    "           data=df, height=20, markers='.', \n",
    "           x_jitter=0.2, scatter_kws={'alpha':0.3})\n",
    "plt.xlim((0, 4))\n",
    "plt.ylim((2e2, 2e4))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** Load up the Titanic dataset from seaborn. Make a boxplot of the fare of the ticket paid against whether a person survived or not."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df = sns.load_dataset('titanic')\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Polar coordinates"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.quiver??"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X = np.random.uniform(0, 10, 100)\n",
    "Y = np.random.uniform(0, 1, 100)\n",
    "\n",
    "U = np.ones_like(X)\n",
    "V = np.ones_like(Y)\n",
    "\n",
    "f = plt.figure()\n",
    "ax = f.add_subplot(111)\n",
    "ax.quiver(X, Y, U, V, headlength=0, headaxislength=0, color='steelblue')\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "theta = np.linspace(0,2*np.pi,100)\n",
    "r = np.linspace(0, 1, 100)\n",
    "\n",
    "dr = 1\n",
    "dt = 0\n",
    "\n",
    "\n",
    "U = dr * cos(theta) - dt * sin (theta)\n",
    "V = dr * sin(theta) + dt * cos(theta)\n",
    "\n",
    "\n",
    "f = plt.figure()\n",
    "ax = f.add_subplot(111, polar=True)\n",
    "ax.quiver(theta, r, U, V, headlength=0, headaxislength=0,  color='steelblue')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "theta = np.linspace(0,2*np.pi,100)\n",
    "r = np.random.uniform(0, 1, 100)\n",
    "\n",
    "dr = 1\n",
    "dt = 0\n",
    "\n",
    "\n",
    "U = dr * cos(theta) - dt * sin (theta)\n",
    "V = dr * sin(theta) + dt * cos(theta)\n",
    "\n",
    "\n",
    "f = plt.figure()\n",
    "ax = f.add_subplot(111, polar=True)\n",
    "ax.quiver(theta, r, U, V, headlength=0, headaxislength=0,  color='steelblue')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 1:** radial plot with all sticks starting at a radius of 1"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 2:** all sticks are horizontal"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 3:** Use a 'mollweide' projection using the projection argument of add_subplot(). Use horizontal sticks now but make sure your sticks span the entire space. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Visualizing High Dimensional Datasets"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import load_digits\n",
    "digits = load_digits()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "shape(digits['data'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.imshow(digits['images'][0])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Principal Component Analysis"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "x = np.linspace(0, 9, 10)\n",
    "y = np.linspace(0, 9, 10)\n",
    "sns.scatterplot(x=x, y=y)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "PCA computes the linear projections of greatest variance from the top eigenvectors of the data covariance matrix\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.decomposition import PCA"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca = PCA(1)\n",
    "principal_components = pca.fit_transform(np.array([x, y]).reshape(-1, 1))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.scatterplot(x=principal_components[:, 0], y=[0]*20)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Check out some more cool visualization of PCA at https://setosa.io/ev/principal-component-analysis/ and read more about the math and applications at https://www.cs.cmu.edu/~bapoczos/other_presentations/PCA_24_10_2009.pdf "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca = PCA(2)\n",
    "projected = pca.fit_transform(digits['data'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(projected[:, 0], projected[:, 1],\n",
    "            c=digits.target, edgecolor='none', alpha=0.5,\n",
    "            cmap=plt.cm.get_cmap('Paired', 10))\n",
    "plt.colorbar()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.datasets import make_swiss_roll\n",
    "X, t = make_swiss_roll(1000, noise=0.05)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import mpl_toolkits.mplot3d as p3\n",
    "from sklearn.cluster import AgglomerativeClustering\n",
    "from sklearn.neighbors import kneighbors_graph\n",
    "ward = AgglomerativeClustering(n_clusters=5, \n",
    "                               connectivity=kneighbors_graph(X, n_neighbors=5, include_self=False),\n",
    "                               linkage='ward').fit(X)\n",
    "labels = ward.labels_"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure()\n",
    "ax = p3.Axes3D(fig)\n",
    "ax.view_init(7, -80)\n",
    "for label in np.unique(labels):\n",
    "    ax.scatter(X[labels == label, 0], X[labels == label, 1], X[labels == label, 2])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pca = PCA(2)\n",
    "projected = pca.fit_transform(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "for label in np.unique(labels):\n",
    "    sns.scatterplot(x=projected[labels == label, 0], y=projected[labels == label, 1],\n",
    "               color=plt.cm.jet(float(label) / np.max(labels + 1)), marker='.')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## t-Distributed Stochastic Neighbor Embedding\n",
    "\n",
    "Converts similarities between data points to joint probabilities and tries to minimize the Kullback-Leibler divergence between the joint probabilities of the low-dimensional embedding and the high-dimensional data.\n",
    "First, t-SNE constructs a probability distribution over pairs of high-dimensional objects in such a way that similar objects have a high probability of being picked while dissimilar points have an extremely small probability of being picked. Second, t-SNE defines a similar probability distribution over the points in the low-dimensional map, and it minimizes the Kullback–Leibler divergence (KL divergence ) between the two distributions with respect to the locations of the points in the map.\n",
    "\n",
    "For more details on t-SNE, check out the original paper http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.manifold import TSNE\n",
    "tSNE = TSNE(learning_rate=10,\n",
    "            perplexity=30)\n",
    "projected = tSNE.fit_transform(X)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(projected[:, 0], projected[:, 1],\n",
    "            c=labels, alpha=0.3,\n",
    "            cmap=plt.cm.get_cmap('Paired', 5))\n",
    "#plt.colorbar()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exercise: Plot the result of a tSNE model fit to the digits['data'] "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 4. Interactive Visualization\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas_datareader.data as web\n",
    "import datetime\n",
    "import plotly.figure_factory as ff\n",
    "import plotly.graph_objs as go\n",
    "import plotly.express as px"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start = datetime.datetime(2008, 1, 1)\n",
    "end = datetime.datetime(2018, 1, 1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# This fetches the stock prices for the S%P 500 for the dates we selected from Yahoo Finance.\n",
    "spy_df = web.DataReader('SPY', 'yahoo', start, end).reset_index()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spy_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data = go.Scatter(x=spy_df.Date, y=spy_df.Close)\n",
    "\n",
    "go.Figure(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** A candlestick chart is a powerful chart in finance that shows the starting price, closing price, highest price and lowerst price of a trading day. Create a aandlestick chart of the first 90 days of the data. You can find Candlestick in the 'go' module."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aapl_df = web.DataReader('AAPL', 'yahoo', start, end).reset_index()\n",
    "data = [go.Scatter(x=spy_df.Date, y=spy_df.Close, name='SPY'),\n",
    "        go.Scatter(x=aapl_df.Date, y=aapl_df.Close, name='AAPL')]\n",
    "\n",
    "\n",
    "layout = dict(title='Yahoo Finance Data')\n",
    "\n",
    "go.Figure({'data':data, 'layout':layout})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "aapl_df = web.DataReader('AAPL', 'yahoo', start, end).reset_index()\n",
    "data = [go.Scatter(x=spy_df.Date, y=spy_df.Close, name='SPY'),\n",
    "        go.Scatter(x=aapl_df.Date, y=aapl_df.Close, name='AAPL') ]\n",
    "\n",
    "\n",
    "# updatemenus defines buttons and what they do\n",
    "updatemenus = list([\n",
    "    dict(active = -1, # initialize with no buttons active\n",
    "         buttons = list([ # buttons is a list of dictionaries \n",
    "            dict(label = 'SPY',\n",
    "                 method = 'update',\n",
    "                 args = [{'visible': [True, False]},\n",
    "                         {'title': 'S&P 500'}]),\n",
    "            dict(label = 'AAPL',\n",
    "                 method = 'update',\n",
    "                 args = [{'visible': [False, True]},\n",
    "                         {'title': 'Apple'}]),\n",
    "            dict(label = 'SPY + AAPL',\n",
    "                 method = 'update',\n",
    "                 args = [{'visible': [True, True]},\n",
    "                         {'title': 'Apple vs S&P 500'}])\n",
    "        ]),\n",
    "    )\n",
    "])\n",
    "\n",
    "# layout here is not optional if you would like the buttons to show up\n",
    "layout = dict(title='Yahoo Finance', updatemenus=updatemenus)\n",
    "\n",
    "fig = dict(data=data, layout=layout)\n",
    "go.Figure(fig)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise:** It's hard to compare AAPL to SPY when viewed as is. Can you plot this again in a way that makes the returns of AAPL more easily comparable to the returns of the benchmark SPY?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "aapl_df = web.DataReader('AAPL', 'yahoo', start, end).reset_index()\n",
    "data = [go.Scatter(x=spy_df.Date, y=spy_df.Close/spy_df.iloc[0]['Close'], name='SPY'),\n",
    "        go.Scatter(x=aapl_df.Date, y=aapl_df.Close/aapl_df.iloc[0]['Close'], name='AAPL') ]\n",
    "\n",
    "\n",
    "# updatemenus tells plotly what to show and what to hide when a button is pressed\n",
    "updatemenus = list([\n",
    "    dict(active = -1, # initialize with no buttons active\n",
    "         buttons = list([ # buttons is a list of dictionaries \n",
    "            dict(label = 'SPY',\n",
    "                 method = 'update',\n",
    "                 args = [{'visible': [True, False]},\n",
    "                         {'title': 'S&P 500'}]),\n",
    "            dict(label = 'AAPL',\n",
    "                 method = 'update',\n",
    "                 args = [{'visible': [False, True]},\n",
    "                         {'title': 'Apple'}]),\n",
    "            dict(label = 'Both',\n",
    "                 method = 'update',\n",
    "                 args = [{'visible': [True, True]},\n",
    "                         {'title': 'Apple vs S&P 500'}])\n",
    "        ]),\n",
    "    )\n",
    "])\n",
    "\n",
    "# layout here is not optional if you would like the buttons to show up\n",
    "layout = dict(title='Yahoo Finance', showlegend=True,\n",
    "              updatemenus=updatemenus)\n",
    "\n",
    "fig = dict(data=data, layout=layout)\n",
    "go.Figure(fig)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!wget https://usafactsstatic.blob.core.windows.net/public/data/covid-19/covid_confirmed_usafacts.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "covidf = pd.read_csv('covid_confirmed_usafacts.csv', \n",
    "                     dtype={\"countyFIPS\": str})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "covidf.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "values=covidf['2021-02-04']\n",
    "colorscale = [\"#f7fbff\",\"#deebf7\",\"#c6dbef\",\"#9ecae1\",\n",
    "              \"#6baed6\",\"#4292c6\",\"#2171b5\",\"#08519c\",\"#08306b\"]\n",
    "endpts = list(np.logspace(1, 5, len(colorscale) - 1))\n",
    "\n",
    "fig = ff.create_choropleth(\n",
    "    fips=covidf['countyFIPS'], values=covidf['2021-04-09'],\n",
    "    binning_endpoints=endpts, colorscale=px.colors.sequential.Blues,\n",
    "    title_text = 'CoViD-19 Confirmed cases',\n",
    "    legend_title = '# of cases'\n",
    ")\n",
    "go.Figure(fig)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "newdf = covidf[covidf.columns.difference(['countyFIPS', 'County Name', 'State', 'StateFIPS'], \n",
    "                                         sort=False)].diff(axis=1).drop(['2020-01-22'], axis=1)\n",
    "newdf['State'] = covidf['State']\n",
    "newdf = newdf.groupby('State').sum()\n",
    "newdf = newdf.reset_index()\n",
    "newdf = pd.melt(newdf, id_vars='State', var_name='Date', value_name='Cases')\n",
    "newdf.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = px.scatter(newdf, x=\"State\", y=\"Cases\", animation_frame=\"Date\", range_y=[0, 50000],\n",
    "                 color=newdf[\"Cases\"], color_continuous_scale=px.colors.sequential.Reds)\n",
    "fig.layout.updatemenus[0].buttons[0].args[1][\"frame\"][\"duration\"] = 100\n",
    "fig.update(layout_coloraxis_showscale=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Many more types of plotly charts are available with examples here https://plotly.com/python/"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Effective Communication through Plotting "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "image = [[i for i in range(100)]]*10\n",
    "sns.heatmap(image, cmap='jet', square=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Color"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# code snippet from Jake Vandeplas https://jakevdp.github.io/blog/2014/10/16/how-bad-is-your-colormap/\n",
    "def grayify_cmap(cmap):\n",
    "    \"\"\"Return a grayscale version of the colormap\"\"\"\n",
    "    cmap = plt.cm.get_cmap(cmap)\n",
    "    colors = cmap(np.arange(cmap.N))\n",
    "    \n",
    "    # convert RGBA to perceived greyscale luminance\n",
    "    # cf. http://alienryderflex.com/hsp.html\n",
    "    RGB_weight = [0.299, 0.587, 0.114]\n",
    "    luminance = np.sqrt(np.dot(colors[:, :3] ** 2, RGB_weight))\n",
    "    colors[:, :3] = luminance[:, np.newaxis]\n",
    "    \n",
    "    return cmap.from_list(cmap.name + \"_grayscale\", colors, cmap.N)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "flights = sns.load_dataset(\"flights\").pivot(\"month\", \"year\", \"passengers\")\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.heatmap(flights, cmap='jet')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.heatmap(flights, cmap=grayify_cmap('jet'))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3 Types of Viable Color palettes/colormaps:\n",
    "### 1. Perceptually uniform sequential"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.heatmap(flights, cmap='viridis')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.heatmap(flights, cmap='Purples')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Diverging "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pit_climate_df = pd.DataFrame(\n",
    "    dict(Month = ['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', \n",
    "                  'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'],\n",
    "         High = [2, 3, 10, 16, 22, 27, 28, 28, 24, 17, 10, 5],\n",
    "         Low = [-7, -5, 0, 5, 11, 16, 18, 18, 14,7, 3, -2])\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "pit_climate_df.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "sns.heatmap(pit_climate_df[['High', 'Low']].T,\n",
    "            cmap='coolwarm', \n",
    "            center=0,#np.mean(pit_climate_df[['High', 'Low']].mean().mean()), \n",
    "            square=True,\n",
    "            xticklabels=pit_climate_df['Month'])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Categorical"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "example from before:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "plt.scatter(projected[:, 0], projected[:, 1],\n",
    "            c=digits.target, alpha=0.3,\n",
    "            cmap=plt.cm.get_cmap('Paired', 10))\n",
    "plt.colorbar()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also specify a color palette to use for the rest of a notebook or script by running"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.set_palette('colorblind')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Other things to consider: \n",
    "* Use salient marker types, full list at https://matplotlib.org/3.2.1/api/markers_api.html"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d1 = np.random.uniform(-2.5, 2.5, (100, 100))\n",
    "d2 = np.random.randn(5,5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": false
   },
   "outputs": [],
   "source": [
    "sns.scatterplot(x=d1[:,0], y=d1[:,1], marker='+')\n",
    "sns.scatterplot(x=d2[:,0], y=d2[:,1], facecolor='steelblue')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sns.lmplot(x='petal_length', y='petal_width', data=iris, \n",
    "           height=10, \n",
    "           hue='species', \n",
    "           markers=['1','2','3'],\n",
    "           fit_reg=False)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "There are more than 2 axes on a 2-dimensional screen. Can you think of ways to include more axes? \n",
    "We can use each of the following to map to an axis:\n",
    "- color\n",
    "- size (for numerical data)\n",
    "- shape (for categorical data)\n",
    "- literally making a 3D plot (as in the swiss roll, useful in the case of 3 spatial dimensions)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Visual attributed ranked by importance\n",
    "\n",
    "| Quantitative   |      Sequential      |  Categorical |\n",
    "|----------|:-------------:|------:|\n",
    "| Position |  Position | Position |\n",
    "| Length |    Density   |   Color Hue |\n",
    "| Angle | Color Saturation |    Texture |\n",
    "| Slope |  Color Hue | Connection |\n",
    "| Area |    Texture   |   Containment |\n",
    "| Volume | Connection |    Density |"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Read more on choosing colors at:\n",
    "\n",
    "* https://seaborn.pydata.org/tutorial/color_palettes.html\n",
    "* https://matplotlib.org/3.1.0/tutorials/colors/colormaps.html\n",
    "\n",
    "One of my favorite resources on clarity in plotting:\n",
    "\n",
    "* http://blogs.nature.com/methagora/2013/07/data-visualization-points-of-view.html"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "New interesting package that we don't have time for today but is definitely worth mentioning. Makes visualization more intuitive by making it declarative is Altair https://altair-viz.github.io"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "visualization_tutorial",
   "language": "python",
   "name": "visualization_tutorial"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.2"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}