Skip to content

Instantly share code, notes, and snippets.

@sarguido
Created July 21, 2014 20:13
Show Gist options
  • Save sarguido/b2862f4d5f5da5655c84 to your computer and use it in GitHub Desktop.
Save sarguido/b2862f4d5f5da5655c84 to your computer and use it in GitHub Desktop.
{
"metadata": {
"name": "",
"signature": "sha256:22b1564adb80f4e0f58087a205bc2e95bbb1263c76ba2d04faa87481e0e7ca49"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": [
"%pylab inline"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# matplotlib (code file matplotlib_lessons.py)\n",
"\n",
"## The importance of communicating your results\n",
"\n",
"So, you've done your preprocessing, and you've doen your analysis. Now, you have to communicate your results. Often, you're going to be communicating with people who aren't programmers. Data visualization is a great way to communicate your analyses in clear, meaningful ways. It also allows you to literally see what's going on in your data. (I should note that data visualization is often a great exploratory step, because again, you get to see what's going on in your data.)\n",
"\n",
"One of the most important parts of visualization is choosing the best plot for your data. A great visualization shouldn't hide the analysis you've made but enhance it. It doesn't have to fancy or complicated, just clear.\n",
"\n",
"## What is matplotlib?\n",
"\n",
"Matplotlib is a 2-dimensional visualization library that is very very widely used. You can create a wide variety of plots with it, and you can easily feed in data from other modules in the scientific Python stack, like NumPy, SciPy, scikit-learn, statsmodels, etc. There are a lot of resources online for learning matplotlib, including the matplotlib documentation.\n",
"\n",
"## Why matplotlib?\n",
"\n",
"Matplotlib is, like I mentioned, very widely used. It's very comprehensive, there's a lot of community support for it, and it's in constant development. It is not my personal favorite, but due to its popularity and comprehensiveness, I would be remiss if I didn't talk about it in this tutorial.\n",
"\n",
"## This section\n",
"\n",
"All of the examples I'll show here are fairly simple, uncomplicated examples. After trying out a few different plots, we'll switch back over to our census data and see what we can do with that.\n",
"\n",
"## Plotting\n",
"\n",
"Plotting is really easy in matplotlib. Let's make a scatterplot of our wine data, of the abv and color features."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"\n",
"wine = pd.read_csv('../data/wine.csv')\n",
"\n",
"# scatterplot of that data\n",
"\n",
"plt.scatter(wine.abv, wine.color)\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"As you can see, it's very easy to feed data from Pandas into matplotlib.\n",
"\n",
"It's also easy to feed in data from scikit-learn. Let's make a graph of the regression line from the linear regression example. First, we'll create our estimator again."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn.cross_validation import train_test_split\n",
"from sklearn.linear_model import LinearRegression\n",
"\n",
"wine_data_mag = wine.loc[:, ['magnesium']]\n",
"wine_data_abv = wine.loc[:, 'abv']\n",
"\n",
"wine_mag_train, wine_mag_test, wine_abv_train, wine_abv_test = train_test_split(wine_data_mag, wine_data_abv)\n",
"\n",
"lr = LinearRegression()\n",
"lr.fit(wine_mag_train, wine_abv_train)"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Then, we'll plot our predicted regression line."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"plt.scatter(wine_data_mag.magnesium, wine_data_abv, color='black')\n",
"plt.plot(wine_mag_test, lr.predict(wine_mag_test), color='green',\n",
" linewidth=3)\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Another easy plot to make is a bar chart. Let's look at the distribution of wine_types in the wine DataFrame. We're going to get the counts using the value_counts() function from Pandas, and then split up the index from those counts."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"wine_type = wine.loc[:, 'wine_type'].value_counts()\n",
"plt.bar(wine_type.index, wine_type.values)\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We have a pretty good value distribution.\n",
"\n",
"Making a histogram is also pretty simple. Let's look at the breakdown of abv across the wine dataset. Histograms group data into counts by values, or into bins. With the hist() function, you can set the number of bins. The default is 10."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"wine_abv = wine.loc[:, 'abv']\n",
"plt.hist(wine_abv, bins=14, color='red')\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's plot the predicted values for wine type against the real values for wine type. First, we'll create our classifier again."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"from sklearn.cross_validation import train_test_split\n",
"from sklearn.neighbors import KNeighborsClassifier\n",
"\n",
"wine_data = pd.read_csv('../data/wine_data.csv')\n",
"wine_labels = pd.read_csv('../data/wine_labels.csv', squeeze=True)\n",
"\n",
"wine_data_train, wine_data_test, wine_labels_train, wine_labels_test = train_test_split(wine_data, wine_labels)\n",
"\n",
"knn = KNeighborsClassifier()\n",
"knn.fit(wine_data_train, wine_labels_train)\n",
"pred = list(knn.predict(wine_data_test))"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Now, we'll take the counts of each wine type in each set. We'll also offset the x values for one of the sets, so that we can look at the bars more easily."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"x = [1, 2, 3]\n",
"x2 = [1.1, 2.1, 3.1]\n",
"y = [pred.count(1), pred.count(2), pred.count(3)]\n",
"wine_test = list(wine_labels_test)\n",
"y2 = [wine_test.count(1), wine_test.count(2), wine_test.count(3)]\n",
"\n",
"plt.bar(x2, y, width=0.2)\n",
"plt.bar(x, y2, width=0.2, color='green', align='center')"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Our kNN classifier doesn't look too shabby!\n",
"\n",
"#### Customizing your plots\n",
"\n",
"Let's talk about customizing your plots. matplotlib offers a wide variety of customization, from colors to shapes to line widths to axis labels. We'll go through a few different customizations.\n",
"\n",
"You've already seen that it's easy to change the color, using the 'color' parameter in each of the various plot functions. This parameter takes a string, and that can be a color name like red or blue, or a hex string:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"plt.hist(wine_abv, bins=14, color='#CC00FF')\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's add axis labels."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"plt.hist(wine_abv, bins=14, color='#CC00FF')\n",
"plt.xlabel('abv')\n",
"plt.ylabel('counts')\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's take our regression plot and make it look more interesting."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"plt.scatter(wine_data_mag.magnesium, wine_data_abv, color='blue', marker='*')\n",
"plt.plot(wine_mag_test, lr.predict(wine_mag_test), 'g^',\n",
" linewidth=2)\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have blue stars for our plot points and green triangles for our regression line. Let's add a legend that tells us that the stars signify our wine data."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"scat = plt.scatter(wine_data_mag.magnesium, wine_data_abv, color='blue', marker='*')\n",
"plt.plot(wine_mag_test, lr.predict(wine_mag_test), 'g^',\n",
" linewidth=2)\n",
"plt.legend([scat], ['wine data'])\n",
"plt.show()"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Lesson: make some plots!"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Using the auto_mpg DataFrame, come up with a scatter plot for weight and mpg.\n",
"\n",
"auto_mpg = pd.read_csv('../data/auto_mpg.txt', delimiter=\"\\t\")\n",
"\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Can you add weight and mpg axis labels to your plot?\n",
"\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Can you make a histogram of mpg? Can you change the color to something other than that awful blue?\n",
"\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Make a plot of anything you want.\n",
"\n"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# For those using IPython Notebook/Wakari/NBViewer: Go to the data_analysis notebook!\n",
"\n",
"# For those using code files, go to data_analysis.py!"
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment