Skip to content

Instantly share code, notes, and snippets.

@dalejung
Created August 13, 2012 20:57
Show Gist options
  • Save dalejung/3344040 to your computer and use it in GitHub Desktop.
Save dalejung/3344040 to your computer and use it in GitHub Desktop.
Test nbviewer #notebook-project #inactive
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "binning and grouping.ipynb",
"notebook_path": "https://gist.github.com/3344040/binning and grouping.ipynb"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import pandas.util.testing as tm"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ind = pd.date_range(start=\"1/1/2000\", freq=\"D\", periods=100)\n",
"df = pd.DataFrame({'high': range(len(ind))}, index=ind)\n"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 5
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Binning vs Grouping\n",
"===================\n",
"\n",
"Binning with resample and Monthly offsets is different than grouping. Binning requires the creation of edges and intervals. It's quite easy to get semanticly incorrect data with Month and MonthStart. All that those offsets do is determine the edges. The machinery after it determines which side the interval is closed. \n",
"\n",
"Remember, an interval has it's edge ON and number. It doesn't necessarily contain that number unless it is closed on that side. This is opposed to grouping where a day only has a single month. Semantically it's simpler to understand. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"end_month_rs = df.resample('M', 'max', label=\"right\", closed=\"right\")\n",
"end_MS_rs = df.resample('MS', 'max', closed='left', label=\"left\")\n",
"tm.assert_almost_equal(end_month_rs, end_MS_rs)\n",
"\n",
"end_month_gb = df.groupby(lambda x: (x.year, x.month)).agg('max')\n",
"tm.assert_almost_equal(end_month_rs, end_month_gb)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 59,
"text": [
"True"
]
}
],
"prompt_number": 59
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Here's a case where things screw up. resample's default closed is right. So while MS creates the proper edge on 2000-01-01, the data is wrong unless it's closed on the left. Technically, there are two intervals touching 2000-01-01, the one of the left is either empty or include just one day, depending on what direction is closed. "
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# This is wrong!\n",
"df.resample('MS', 'max')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>high</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td><strong>2000-01-01 00:00:00</strong></td>\n",
" <td> 0</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>2000-02-01 00:00:00</strong></td>\n",
" <td> 31</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>2000-03-01 00:00:00</strong></td>\n",
" <td> 60</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>2000-04-01 00:00:00</strong></td>\n",
" <td> 91</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>2000-05-01 00:00:00</strong></td>\n",
" <td> 99</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 64,
"text": [
" high\n",
"2000-01-01 0\n",
"2000-02-01 31\n",
"2000-03-01 60\n",
"2000-04-01 91\n",
"2000-05-01 99"
]
}
],
"prompt_number": 64
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# https://github.com/pydata/pandas/issues/1770\n",
"dtrange = pd.date_range(start='1-1-2009', periods=31, freq='D')\n",
"\n",
"raw = pd.Series(np.arange(len(dtrange)), index=dtrange)\n",
"test1 = raw.resample('M', how='count', closed='left', label='left')\n",
"test2 = raw.resample('M', how='mean', closed='left', label='left')\n",
"\n",
"r_test1 = raw.resample('M', how='count', closed='right', label='right')\n",
"r_test2 = raw.resample('M', how='mean', closed='right', label='right')\n",
"\n",
"# M is Month End\n",
"print pd.tseries.frequencies.to_offset('M')"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"<1 MonthEnd>\n"
]
}
],
"prompt_number": 28
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'count\\n', test1\n",
"print 'count\\n', test2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"count\n",
"2008-12-31 30\n",
"2009-01-31 1\n",
"count\n",
"2008-12-31 14.5\n",
"2009-01-31 30.0\n",
"Freq: M\n"
]
}
],
"prompt_number": 15
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"print 'count\\n', r_test1\n",
"print 'mean\\n', r_test2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"count\n",
"2009-01-31 31\n",
"mean\n",
"2008-12-31 15\n",
"2009-01-31 -0\n",
"Freq: M\n"
]
}
],
"prompt_number": 27
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# Probably want MS\n",
"test1 = raw.resample('MS', how='count', closed='left', label='left')\n",
"test2 = raw.resample('MS', how='mean', closed='left', label='left')\n",
"test2"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 34,
"text": [
"2009-01-01 15\n",
"Freq: MS"
]
}
],
"prompt_number": 34
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment