dalejung/binning and grouping.ipynb

## binning and grouping.ipynb
{
 "metadata": {
  "name": "binning and grouping.ipynb",
  "notebook_path": "https://gist.github.com/3344040/binning and grouping.ipynb"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "import pandas.util.testing as tm"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "ind = pd.date_range(start=\"1/1/2000\", freq=\"D\", periods=100)\n",
      "df = pd.DataFrame({'high': range(len(ind))}, index=ind)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 5
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Binning vs Grouping\n",
      "===================\n",
      "\n",
      "Binning with resample and Monthly offsets is different than grouping. Binning requires the creation of edges and intervals. It's quite easy to get semanticly incorrect data with Month and MonthStart. All that those offsets do is determine the edges. The machinery after it determines which side the interval is closed. \n",
      "\n",
      "Remember, an interval has it's edge ON and number. It doesn't necessarily contain that number unless it is closed on that side. This is opposed to grouping where a day only has a single month. Semantically it's simpler to understand. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "end_month_rs = df.resample('M', 'max', label=\"right\", closed=\"right\")\n",
      "end_MS_rs = df.resample('MS', 'max', closed='left', label=\"left\")\n",
      "tm.assert_almost_equal(end_month_rs, end_MS_rs)\n",
      "\n",
      "end_month_gb = df.groupby(lambda x: (x.year, x.month)).agg('max')\n",
      "tm.assert_almost_equal(end_month_rs, end_month_gb)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 59,
       "text": [
        "True"
       ]
      }
     ],
     "prompt_number": 59
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Here's a case where things screw up. resample's default closed is right. So while MS creates the proper edge on 2000-01-01, the data is wrong unless it's closed on the left. Technically, there are two intervals touching 2000-01-01, the one of the left is either empty or include just one day, depending on what direction is closed. "
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# This is wrong!\n",
      "df.resample('MS', 'max')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>high</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <td><strong>2000-01-01 00:00:00</strong></td>\n",
        "      <td>  0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td><strong>2000-02-01 00:00:00</strong></td>\n",
        "      <td> 31</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td><strong>2000-03-01 00:00:00</strong></td>\n",
        "      <td> 60</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td><strong>2000-04-01 00:00:00</strong></td>\n",
        "      <td> 91</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td><strong>2000-05-01 00:00:00</strong></td>\n",
        "      <td> 99</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "output_type": "pyout",
       "prompt_number": 64,
       "text": [
        "            high\n",
        "2000-01-01     0\n",
        "2000-02-01    31\n",
        "2000-03-01    60\n",
        "2000-04-01    91\n",
        "2000-05-01    99"
       ]
      }
     ],
     "prompt_number": 64
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# https://github.com/pydata/pandas/issues/1770\n",
      "dtrange = pd.date_range(start='1-1-2009', periods=31, freq='D')\n",
      "\n",
      "raw = pd.Series(np.arange(len(dtrange)), index=dtrange)\n",
      "test1 = raw.resample('M', how='count', closed='left', label='left')\n",
      "test2 = raw.resample('M', how='mean', closed='left', label='left')\n",
      "\n",
      "r_test1 = raw.resample('M', how='count', closed='right', label='right')\n",
      "r_test2 = raw.resample('M', how='mean', closed='right', label='right')\n",
      "\n",
      "# M is Month End\n",
      "print pd.tseries.frequencies.to_offset('M')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "<1 MonthEnd>\n"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print 'count\\n', test1\n",
      "print 'count\\n', test2"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "count\n",
        "2008-12-31    30\n",
        "2009-01-31     1\n",
        "count\n",
        "2008-12-31    14.5\n",
        "2009-01-31    30.0\n",
        "Freq: M\n"
       ]
      }
     ],
     "prompt_number": 15
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "print 'count\\n', r_test1\n",
      "print 'mean\\n', r_test2"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "count\n",
        "2009-01-31    31\n",
        "mean\n",
        "2008-12-31    15\n",
        "2009-01-31    -0\n",
        "Freq: M\n"
       ]
      }
     ],
     "prompt_number": 27
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# Probably want MS\n",
      "test1 = raw.resample('MS', how='count', closed='left', label='left')\n",
      "test2 = raw.resample('MS', how='mean', closed='left', label='left')\n",
      "test2"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 34,
       "text": [
        "2009-01-01    15\n",
        "Freq: MS"
       ]
      }
     ],
     "prompt_number": 34
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}

## intraday binning error.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              intraday binning error.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
	{
	"metadata": {
	"name": "binning and grouping.ipynb",
	"notebook_path": "https://gist.github.com/3344040/binning and grouping.ipynb"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import pandas as pd\n",
	"import pandas.util.testing as tm"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 1
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"ind = pd.date_range(start=\"1/1/2000\", freq=\"D\", periods=100)\n",
	"df = pd.DataFrame({'high': range(len(ind))}, index=ind)\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 5
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Binning vs Grouping\n",
	"===================\n",
	"\n",
	"Binning with resample and Monthly offsets is different than grouping. Binning requires the creation of edges and intervals. It's quite easy to get semanticly incorrect data with Month and MonthStart. All that those offsets do is determine the edges. The machinery after it determines which side the interval is closed. \n",
	"\n",
	"Remember, an interval has it's edge ON and number. It doesn't necessarily contain that number unless it is closed on that side. This is opposed to grouping where a day only has a single month. Semantically it's simpler to understand. "
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"end_month_rs = df.resample('M', 'max', label=\"right\", closed=\"right\")\n",
	"end_MS_rs = df.resample('MS', 'max', closed='left', label=\"left\")\n",
	"tm.assert_almost_equal(end_month_rs, end_MS_rs)\n",
	"\n",
	"end_month_gb = df.groupby(lambda x: (x.year, x.month)).agg('max')\n",
	"tm.assert_almost_equal(end_month_rs, end_month_gb)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 59,
	"text": [
	"True"
	]
	}
	],
	"prompt_number": 59
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Here's a case where things screw up. resample's default closed is right. So while MS creates the proper edge on 2000-01-01, the data is wrong unless it's closed on the left. Technically, there are two intervals touching 2000-01-01, the one of the left is either empty or include just one day, depending on what direction is closed. "
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# This is wrong!\n",
	"df.resample('MS', 'max')"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"html": [
	"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>high</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <td><strong>2000-01-01 00:00:00</strong></td>\n",
	" <td> 0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td><strong>2000-02-01 00:00:00</strong></td>\n",
	" <td> 31</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td><strong>2000-03-01 00:00:00</strong></td>\n",
	" <td> 60</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td><strong>2000-04-01 00:00:00</strong></td>\n",
	" <td> 91</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td><strong>2000-05-01 00:00:00</strong></td>\n",
	" <td> 99</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"output_type": "pyout",
	"prompt_number": 64,
	"text": [
	" high\n",
	"2000-01-01 0\n",
	"2000-02-01 31\n",
	"2000-03-01 60\n",
	"2000-04-01 91\n",
	"2000-05-01 99"
	]
	}
	],
	"prompt_number": 64
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# https://github.com/pydata/pandas/issues/1770\n",
	"dtrange = pd.date_range(start='1-1-2009', periods=31, freq='D')\n",
	"\n",
	"raw = pd.Series(np.arange(len(dtrange)), index=dtrange)\n",
	"test1 = raw.resample('M', how='count', closed='left', label='left')\n",
	"test2 = raw.resample('M', how='mean', closed='left', label='left')\n",
	"\n",
	"r_test1 = raw.resample('M', how='count', closed='right', label='right')\n",
	"r_test2 = raw.resample('M', how='mean', closed='right', label='right')\n",
	"\n",
	"# M is Month End\n",
	"print pd.tseries.frequencies.to_offset('M')"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"<1 MonthEnd>\n"
	]
	}
	],
	"prompt_number": 28
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"print 'count\\n', test1\n",
	"print 'count\\n', test2"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"count\n",
	"2008-12-31 30\n",
	"2009-01-31 1\n",
	"count\n",
	"2008-12-31 14.5\n",
	"2009-01-31 30.0\n",
	"Freq: M\n"
	]
	}
	],
	"prompt_number": 15
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"print 'count\\n', r_test1\n",
	"print 'mean\\n', r_test2"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"count\n",
	"2009-01-31 31\n",
	"mean\n",
	"2008-12-31 15\n",
	"2009-01-31 -0\n",
	"Freq: M\n"
	]
	}
	],
	"prompt_number": 27
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# Probably want MS\n",
	"test1 = raw.resample('MS', how='count', closed='left', label='left')\n",
	"test2 = raw.resample('MS', how='mean', closed='left', label='left')\n",
	"test2"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 34,
	"text": [
	"2009-01-01 15\n",
	"Freq: MS"
	]
	}
	],
	"prompt_number": 34
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [],
	"language": "python",
	"metadata": {},
	"outputs": []
	}
	],
	"metadata": {}
	}
	]
	}