Created
August 13, 2012 20:57
-
-
Save dalejung/3344040 to your computer and use it in GitHub Desktop.
Test nbviewer #notebook-project #inactive
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "binning and grouping.ipynb", | |
"notebook_path": "https://gist.github.com/3344040/binning and grouping.ipynb" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import pandas as pd\n", | |
"import pandas.util.testing as tm" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 1 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"ind = pd.date_range(start=\"1/1/2000\", freq=\"D\", periods=100)\n", | |
"df = pd.DataFrame({'high': range(len(ind))}, index=ind)\n" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 5 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Binning vs Grouping\n", | |
"===================\n", | |
"\n", | |
"Binning with resample and Monthly offsets is different than grouping. Binning requires the creation of edges and intervals. It's quite easy to get semanticly incorrect data with Month and MonthStart. All that those offsets do is determine the edges. The machinery after it determines which side the interval is closed. \n", | |
"\n", | |
"Remember, an interval has it's edge ON and number. It doesn't necessarily contain that number unless it is closed on that side. This is opposed to grouping where a day only has a single month. Semantically it's simpler to understand. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"end_month_rs = df.resample('M', 'max', label=\"right\", closed=\"right\")\n", | |
"end_MS_rs = df.resample('MS', 'max', closed='left', label=\"left\")\n", | |
"tm.assert_almost_equal(end_month_rs, end_MS_rs)\n", | |
"\n", | |
"end_month_gb = df.groupby(lambda x: (x.year, x.month)).agg('max')\n", | |
"tm.assert_almost_equal(end_month_rs, end_month_gb)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 59, | |
"text": [ | |
"True" | |
] | |
} | |
], | |
"prompt_number": 59 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Here's a case where things screw up. resample's default closed is right. So while MS creates the proper edge on 2000-01-01, the data is wrong unless it's closed on the left. Technically, there are two intervals touching 2000-01-01, the one of the left is either empty or include just one day, depending on what direction is closed. " | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# This is wrong!\n", | |
"df.resample('MS', 'max')" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"html": [ | |
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>high</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <td><strong>2000-01-01 00:00:00</strong></td>\n", | |
" <td> 0</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td><strong>2000-02-01 00:00:00</strong></td>\n", | |
" <td> 31</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td><strong>2000-03-01 00:00:00</strong></td>\n", | |
" <td> 60</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td><strong>2000-04-01 00:00:00</strong></td>\n", | |
" <td> 91</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <td><strong>2000-05-01 00:00:00</strong></td>\n", | |
" <td> 99</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"output_type": "pyout", | |
"prompt_number": 64, | |
"text": [ | |
" high\n", | |
"2000-01-01 0\n", | |
"2000-02-01 31\n", | |
"2000-03-01 60\n", | |
"2000-04-01 91\n", | |
"2000-05-01 99" | |
] | |
} | |
], | |
"prompt_number": 64 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# https://github.com/pydata/pandas/issues/1770\n", | |
"dtrange = pd.date_range(start='1-1-2009', periods=31, freq='D')\n", | |
"\n", | |
"raw = pd.Series(np.arange(len(dtrange)), index=dtrange)\n", | |
"test1 = raw.resample('M', how='count', closed='left', label='left')\n", | |
"test2 = raw.resample('M', how='mean', closed='left', label='left')\n", | |
"\n", | |
"r_test1 = raw.resample('M', how='count', closed='right', label='right')\n", | |
"r_test2 = raw.resample('M', how='mean', closed='right', label='right')\n", | |
"\n", | |
"# M is Month End\n", | |
"print pd.tseries.frequencies.to_offset('M')" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"<1 MonthEnd>\n" | |
] | |
} | |
], | |
"prompt_number": 28 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"print 'count\\n', test1\n", | |
"print 'count\\n', test2" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"count\n", | |
"2008-12-31 30\n", | |
"2009-01-31 1\n", | |
"count\n", | |
"2008-12-31 14.5\n", | |
"2009-01-31 30.0\n", | |
"Freq: M\n" | |
] | |
} | |
], | |
"prompt_number": 15 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"print 'count\\n', r_test1\n", | |
"print 'mean\\n', r_test2" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"count\n", | |
"2009-01-31 31\n", | |
"mean\n", | |
"2008-12-31 15\n", | |
"2009-01-31 -0\n", | |
"Freq: M\n" | |
] | |
} | |
], | |
"prompt_number": 27 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"# Probably want MS\n", | |
"test1 = raw.resample('MS', how='count', closed='left', label='left')\n", | |
"test2 = raw.resample('MS', how='mean', closed='left', label='left')\n", | |
"test2" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 34, | |
"text": [ | |
"2009-01-01 15\n", | |
"Freq: MS" | |
] | |
} | |
], | |
"prompt_number": 34 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [] | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment