Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
Ipython notebook json
{
"metadata": {
"name": "Pandas DataFrames"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": "import numpy as np\nimport pandas as pd",
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 3
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "Creating DataFrames"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "The easiest way to create a DataFrame is using an ndarray. If an index is not specified, a default integer index will be automatically created. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "dframe = pd.DataFrame(np.random.randn(8,4), columns=['A','B','C','D'])\ndframe",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>0</th>\n <td>-1.329538</td>\n <td>-1.252777</td>\n <td>-1.888587</td>\n <td>-0.008552</td>\n </tr>\n <tr>\n <th>1</th>\n <td>-0.601241</td>\n <td> 0.665922</td>\n <td>-0.342204</td>\n <td> 0.707170</td>\n </tr>\n <tr>\n <th>2</th>\n <td> 0.724993</td>\n <td>-0.200555</td>\n <td> 0.226544</td>\n <td> 0.751363</td>\n </tr>\n <tr>\n <th>3</th>\n <td> 0.888530</td>\n <td> 0.292810</td>\n <td>-1.039384</td>\n <td> 1.020441</td>\n </tr>\n <tr>\n <th>4</th>\n <td>-1.330474</td>\n <td> 0.194110</td>\n <td> 1.433966</td>\n <td> 0.335734</td>\n </tr>\n <tr>\n <th>5</th>\n <td> 1.009005</td>\n <td> 0.620863</td>\n <td> 0.360362</td>\n <td>-1.552132</td>\n </tr>\n <tr>\n <th>6</th>\n <td> 1.139423</td>\n <td>-0.888833</td>\n <td> 0.512584</td>\n <td>-1.293783</td>\n </tr>\n <tr>\n <th>7</th>\n <td> 1.152392</td>\n <td> 2.318118</td>\n <td> 1.918517</td>\n <td>-1.064239</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 4,
"text": " A B C D\n0 -1.329538 -1.252777 -1.888587 -0.008552\n1 -0.601241 0.665922 -0.342204 0.707170\n2 0.724993 -0.200555 0.226544 0.751363\n3 0.888530 0.292810 -1.039384 1.020441\n4 -1.330474 0.194110 1.433966 0.335734\n5 1.009005 0.620863 0.360362 -1.552132\n6 1.139423 -0.888833 0.512584 -1.293783\n7 1.152392 2.318118 1.918517 -1.064239"
}
],
"prompt_number": 4
},
{
"cell_type": "markdown",
"metadata": {},
"source": "To create an index just add the index argument. Lenght of the index should match the number of rows, of course. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "dframe = pd.DataFrame(np.random.randn(8,4), columns=['A','B','C','D'], index = list('abcdefgh'))\ndframe",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>a</th>\n <td>-0.088492</td>\n <td> 0.837208</td>\n <td>-0.206235</td>\n <td> 0.308445</td>\n </tr>\n <tr>\n <th>b</th>\n <td> 0.514290</td>\n <td> 1.551145</td>\n <td> 1.129503</td>\n <td> 0.706509</td>\n </tr>\n <tr>\n <th>c</th>\n <td>-0.289319</td>\n <td>-0.665703</td>\n <td>-0.752778</td>\n <td> 1.089550</td>\n </tr>\n <tr>\n <th>d</th>\n <td> 0.241300</td>\n <td>-1.148401</td>\n <td>-0.282621</td>\n <td> 1.000189</td>\n </tr>\n <tr>\n <th>e</th>\n <td>-0.080722</td>\n <td>-0.824329</td>\n <td> 1.102019</td>\n <td> 0.845919</td>\n </tr>\n <tr>\n <th>f</th>\n <td>-1.732742</td>\n <td> 0.901447</td>\n <td>-2.782270</td>\n <td>-0.206729</td>\n </tr>\n <tr>\n <th>g</th>\n <td> 0.632871</td>\n <td> 0.826829</td>\n <td>-0.452925</td>\n <td> 0.435213</td>\n </tr>\n <tr>\n <th>h</th>\n <td> 0.813164</td>\n <td>-0.036353</td>\n <td> 0.462501</td>\n <td>-0.309223</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 5,
"text": " A B C D\na -0.088492 0.837208 -0.206235 0.308445\nb 0.514290 1.551145 1.129503 0.706509\nc -0.289319 -0.665703 -0.752778 1.089550\nd 0.241300 -1.148401 -0.282621 1.000189\ne -0.080722 -0.824329 1.102019 0.845919\nf -1.732742 0.901447 -2.782270 -0.206729\ng 0.632871 0.826829 -0.452925 0.435213\nh 0.813164 -0.036353 0.462501 -0.309223"
}
],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Another convenient way to create a DataFrame is using a python dict. The keys of the dict are used as the column names, and the values as the columns themselves. The values of the dict need to be collections of the same length. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "dfdict = pd.DataFrame({'A':[1,2,3,4],'B': [5,6,7,8]},index = ['one','two','three','four'])\ndfdict",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>one</th>\n <td> 1</td>\n <td> 5</td>\n </tr>\n <tr>\n <th>two</th>\n <td> 2</td>\n <td> 6</td>\n </tr>\n <tr>\n <th>three</th>\n <td> 3</td>\n <td> 7</td>\n </tr>\n <tr>\n <th>four</th>\n <td> 4</td>\n <td> 8</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 6,
"text": " A B\none 1 5\ntwo 2 6\nthree 3 7\nfour 4 8"
}
],
"prompt_number": 6
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "Selecting rows from DataFrames: "
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Location based selection:"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Selecting by integer index - using the standard numpy/R notation: "
},
{
"cell_type": "code",
"collapsed": false,
"input": "dframe[0:4]",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>a</th>\n <td>-0.088492</td>\n <td> 0.837208</td>\n <td>-0.206235</td>\n <td> 0.308445</td>\n </tr>\n <tr>\n <th>b</th>\n <td> 0.514290</td>\n <td> 1.551145</td>\n <td> 1.129503</td>\n <td> 0.706509</td>\n </tr>\n <tr>\n <th>c</th>\n <td>-0.289319</td>\n <td>-0.665703</td>\n <td>-0.752778</td>\n <td> 1.089550</td>\n </tr>\n <tr>\n <th>d</th>\n <td> 0.241300</td>\n <td>-1.148401</td>\n <td>-0.282621</td>\n <td> 1.000189</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 7,
"text": " A B C D\na -0.088492 0.837208 -0.206235 0.308445\nb 0.514290 1.551145 1.129503 0.706509\nc -0.289319 -0.665703 -0.752778 1.089550\nd 0.241300 -1.148401 -0.282621 1.000189"
}
],
"prompt_number": 7
},
{
"cell_type": "markdown",
"metadata": {},
"source": "But using the .iloc method is more efficient and optimized. iloc stands for 'integer location' and accepts indices for rows and columns as integer ranges: "
},
{
"cell_type": "code",
"collapsed": false,
"input": "dframe.iloc[0:4,0:2]",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>a</th>\n <td>-0.088492</td>\n <td> 0.837208</td>\n </tr>\n <tr>\n <th>b</th>\n <td> 0.514290</td>\n <td> 1.551145</td>\n </tr>\n <tr>\n <th>c</th>\n <td>-0.289319</td>\n <td>-0.665703</td>\n </tr>\n <tr>\n <th>d</th>\n <td> 0.241300</td>\n <td>-1.148401</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 8,
"text": " A B\na -0.088492 0.837208\nb 0.514290 1.551145\nc -0.289319 -0.665703\nd 0.241300 -1.148401"
}
],
"prompt_number": 8
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Label based selection:"
},
{
"cell_type": "raw",
"metadata": {},
"source": "The .loc method provides a way to use the labels of the dataframe effectively. It accepts integers as arguments but these are strictly related to the labels, not the position of a row/column. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "dframe.loc[list('bca'),['A','B']]",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>b</th>\n <td> 0.514290</td>\n <td> 1.551145</td>\n </tr>\n <tr>\n <th>c</th>\n <td>-0.289319</td>\n <td>-0.665703</td>\n </tr>\n <tr>\n <th>a</th>\n <td>-0.088492</td>\n <td> 0.837208</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 9,
"text": " A B\nb 0.514290 1.551145\nc -0.289319 -0.665703\na -0.088492 0.837208"
}
],
"prompt_number": 9
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Criteria based selection: "
},
{
"cell_type": "markdown",
"metadata": {},
"source": "How do you select rows based on values satisfying certain criteria? This is likely the part where most time is spent while doing exploratory data analysis: slicing the dataframe based on certain criteria. "
},
{
"cell_type": "code",
"collapsed": false,
"input": "dframe[dframe['C']>0]",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>b</th>\n <td> 0.514290</td>\n <td> 1.551145</td>\n <td> 1.129503</td>\n <td> 0.706509</td>\n </tr>\n <tr>\n <th>e</th>\n <td>-0.080722</td>\n <td>-0.824329</td>\n <td> 1.102019</td>\n <td> 0.845919</td>\n </tr>\n <tr>\n <th>h</th>\n <td> 0.813164</td>\n <td>-0.036353</td>\n <td> 0.462501</td>\n <td>-0.309223</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 10,
"text": " A B C D\nb 0.514290 1.551145 1.129503 0.706509\ne -0.080722 -0.824329 1.102019 0.845919\nh 0.813164 -0.036353 0.462501 -0.309223"
}
],
"prompt_number": 10
},
{
"cell_type": "markdown",
"metadata": {},
"source": "It is possible to combine boolean criteria using '&', '|' and '~', however these <b> must </b> be grouped by using parentheses."
},
{
"cell_type": "code",
"collapsed": false,
"input": "dframe[(dframe['C']>0) & (dframe['D']<0)]",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>h</th>\n <td> 0.813164</td>\n <td>-0.036353</td>\n <td> 0.462501</td>\n <td>-0.309223</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 11,
"text": " A B C D\nh 0.813164 -0.036353 0.462501 -0.309223"
}
],
"prompt_number": 11
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": "Applying functions to a group of values:"
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Using map "
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Applying a function to an entire column can be done using the map method: "
},
{
"cell_type": "code",
"collapsed": false,
"input": "dframe['A'] = dframe['A'].map(lambda x: x+1)\ndframe",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>a</th>\n <td> 0.911508</td>\n <td> 0.837208</td>\n <td>-0.206235</td>\n <td> 0.308445</td>\n </tr>\n <tr>\n <th>b</th>\n <td> 1.514290</td>\n <td> 1.551145</td>\n <td> 1.129503</td>\n <td> 0.706509</td>\n </tr>\n <tr>\n <th>c</th>\n <td> 0.710681</td>\n <td>-0.665703</td>\n <td>-0.752778</td>\n <td> 1.089550</td>\n </tr>\n <tr>\n <th>d</th>\n <td> 1.241300</td>\n <td>-1.148401</td>\n <td>-0.282621</td>\n <td> 1.000189</td>\n </tr>\n <tr>\n <th>e</th>\n <td> 0.919278</td>\n <td>-0.824329</td>\n <td> 1.102019</td>\n <td> 0.845919</td>\n </tr>\n <tr>\n <th>f</th>\n <td>-0.732742</td>\n <td> 0.901447</td>\n <td>-2.782270</td>\n <td>-0.206729</td>\n </tr>\n <tr>\n <th>g</th>\n <td> 1.632871</td>\n <td> 0.826829</td>\n <td>-0.452925</td>\n <td> 0.435213</td>\n </tr>\n <tr>\n <th>h</th>\n <td> 1.813164</td>\n <td>-0.036353</td>\n <td> 0.462501</td>\n <td>-0.309223</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 12,
"text": " A B C D\na 0.911508 0.837208 -0.206235 0.308445\nb 1.514290 1.551145 1.129503 0.706509\nc 0.710681 -0.665703 -0.752778 1.089550\nd 1.241300 -1.148401 -0.282621 1.000189\ne 0.919278 -0.824329 1.102019 0.845919\nf -0.732742 0.901447 -2.782270 -0.206729\ng 1.632871 0.826829 -0.452925 0.435213\nh 1.813164 -0.036353 0.462501 -0.309223"
}
],
"prompt_number": 12
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Grouping by a property can be done as follows:"
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Groupby"
},
{
"cell_type": "code",
"collapsed": false,
"input": "gb = dframe.groupby(dframe['A']>0)\ngb.size()",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 14,
"text": "A\nFalse 1\nTrue 7\ndtype: int64"
}
],
"prompt_number": 14
},
{
"cell_type": "markdown",
"metadata": {},
"source": "Aggregate functions can be applied to these groups."
},
{
"cell_type": "code",
"collapsed": false,
"input": "gb.sum()",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n </tr>\n <tr>\n <th>A</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>False</th>\n <td>-0.732742</td>\n <td> 0.901447</td>\n <td>-2.782270</td>\n <td>-0.206729</td>\n </tr>\n <tr>\n <th>True </th>\n <td> 8.743092</td>\n <td> 0.540395</td>\n <td> 0.999464</td>\n <td> 4.076603</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 15,
"text": " A B C D\nA \nFalse -0.732742 0.901447 -2.782270 -0.206729\nTrue 8.743092 0.540395 0.999464 4.076603"
}
],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": "gb.median()",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>A</th>\n <th>B</th>\n <th>C</th>\n <th>D</th>\n </tr>\n <tr>\n <th>A</th>\n <th></th>\n <th></th>\n <th></th>\n <th></th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>False</th>\n <td>-0.732742</td>\n <td> 0.901447</td>\n <td>-2.782270</td>\n <td>-0.206729</td>\n </tr>\n <tr>\n <th>True </th>\n <td> 1.241300</td>\n <td>-0.036353</td>\n <td>-0.206235</td>\n <td> 0.706509</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 16,
"text": " A B C D\nA \nFalse -0.732742 0.901447 -2.782270 -0.206729\nTrue 1.241300 -0.036353 -0.206235 0.706509"
}
],
"prompt_number": 16
},
{
"cell_type": "code",
"collapsed": false,
"input": "gb.groups",
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 17,
"text": "{False: ['f'], True: ['a', 'b', 'c', 'd', 'e', 'g', 'h']}"
}
],
"prompt_number": 17
},
{
"cell_type": "heading",
"level": 3,
"metadata": {},
"source": "Apply"
},
{
"cell_type": "markdown",
"metadata": {},
"source": "If you need to square an entire column of a Dataframe, here is how you would do it with apply:"
},
{
"cell_type": "code",
"collapsed": false,
"input": "def square(x):\n return pd.Series([x, x**2])\n\ndframe['A'].apply(square)",
"language": "python",
"metadata": {},
"outputs": [
{
"html": "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n<table border=\"1\" class=\"dataframe\">\n <thead>\n <tr style=\"text-align: right;\">\n <th></th>\n <th>0</th>\n <th>1</th>\n </tr>\n </thead>\n <tbody>\n <tr>\n <th>a</th>\n <td> 0.911508</td>\n <td> 0.830848</td>\n </tr>\n <tr>\n <th>b</th>\n <td> 1.514290</td>\n <td> 2.293074</td>\n </tr>\n <tr>\n <th>c</th>\n <td> 0.710681</td>\n <td> 0.505067</td>\n </tr>\n <tr>\n <th>d</th>\n <td> 1.241300</td>\n <td> 1.540824</td>\n </tr>\n <tr>\n <th>e</th>\n <td> 0.919278</td>\n <td> 0.845072</td>\n </tr>\n <tr>\n <th>f</th>\n <td>-0.732742</td>\n <td> 0.536911</td>\n </tr>\n <tr>\n <th>g</th>\n <td> 1.632871</td>\n <td> 2.666266</td>\n </tr>\n <tr>\n <th>h</th>\n <td> 1.813164</td>\n <td> 3.287565</td>\n </tr>\n </tbody>\n</table>\n</div>",
"metadata": {},
"output_type": "pyout",
"prompt_number": 19,
"text": " 0 1\na 0.911508 0.830848\nb 1.514290 2.293074\nc 0.710681 0.505067\nd 1.241300 1.540824\ne 0.919278 0.845072\nf -0.732742 0.536911\ng 1.632871 2.666266\nh 1.813164 3.287565"
}
],
"prompt_number": 19
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment