Skip to content

Instantly share code, notes, and snippets.

@dalejung
Created August 16, 2012 03:59
Show Gist options
  • Save dalejung/3366583 to your computer and use it in GitHub Desktop.
Save dalejung/3366583 to your computer and use it in GitHub Desktop.
Pandas Stuff #notebook-project #inactive
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "subclassing pandas objects.ipynb",
"notebook_path": "https://gist.github.com/3366583/subclassing pandas objects.ipynb"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": [
"# http://stackoverflow.com/questions/11979194/subclasses-of-pandas-object-work-differently-from-subclass-of-other-object\n",
"import pandas as pd\n",
"\n",
"class Support(pd.Series):\n",
" def __new__(cls, *args, **kwargs):\n",
" arr = Series.__new__(cls, *args, **kwargs)\n",
" return arr.view(Support)\n",
" \n",
" def supportMethod1(self):\n",
" print 'I am support method 1' \n",
" def supportMethod2(self):\n",
" print 'I am support method 2'\n",
"\n",
"class Compute(object):\n",
" supp=None \n",
" def test(self):\n",
" self.supp() \n",
"\n",
"class Config(object):\n",
" supp=None \n",
" @classmethod\n",
" def initializeConfig(cls):\n",
" cls.supp=Support()\n",
" @classmethod\n",
" def setConfig1(cls):\n",
" Compute.supp=cls.supp.supportMethod1\n",
" @classmethod\n",
" def setConfig2(cls):\n",
" Compute.supp=cls.supp.supportMethod2 \n",
"\n",
" "
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 45
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"# adding the __new__ works for this simple demo\n",
"s = Support(range(10))\n",
"assert s.supportMethod1() == Support.supportMethod1(s)\n",
"assert isinstance(s, Support)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"I am support method 1\n",
"I am support method 1\n"
]
}
],
"prompt_number": 46
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The problem is that there are many instances where Series data is boxed and unboxed. That data will come back as a series, which might hamper how useful a series subclass is.\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"assert not isinstance(s.cumsum(), Support)\n",
"assert not isinstance(s.ix[:5], Support)\n"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 47
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Also, when you add a Series to a DataFrame, you're not really adding the Series to frame, but adding the *data*. The DataFrame holds the data and boxes it as a Series when you access it. \n",
"\n",
"Whatever speed you're getting from pandas/numpy has a lot to do with how the data is stored. A DataFrame is not collection of pointers to series. It's data needs to be consolidated and so it just becomes a row in a bigger data set. Losing it's Series-likeness until it's reboxed."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"s = Support(range(10))\n",
"df = pd.DataFrame({'s': s})\n",
"# df.s is not Support or even the Series. It's the data.\n",
"assert not isinstance(df.s, Support)\n",
"assert id(s) != id(df.s)\n",
"s.ix[0] = 888\n",
"# does not change df\n",
"df"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>s</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <td><strong>0</strong></td>\n",
" <td> 0</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>1</strong></td>\n",
" <td> 1</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>2</strong></td>\n",
" <td> 2</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>3</strong></td>\n",
" <td> 3</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>4</strong></td>\n",
" <td> 4</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>5</strong></td>\n",
" <td> 5</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>6</strong></td>\n",
" <td> 6</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>7</strong></td>\n",
" <td> 7</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>8</strong></td>\n",
" <td> 8</td>\n",
" </tr>\n",
" <tr>\n",
" <td><strong>9</strong></td>\n",
" <td> 9</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 103,
"text": [
" s\n",
"0 0\n",
"1 1\n",
"2 2\n",
"3 3\n",
"4 4\n",
"5 5\n",
"6 6\n",
"7 7\n",
"8 8\n",
"9 9"
]
}
],
"prompt_number": 103
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you're working with Frames, your data is being consolidated like so:"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"import numpy as np\n",
"\n",
"s1 = Support(range(10))\n",
"s2 = Support(range(10, 20))\n",
"\n",
"stacked = np.vstack((s1, s2))\n",
"\n",
"s.ix[0] = 10\n",
"assert stacked[0][0] != s[0]\n",
"\n",
"stacked"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 123,
"text": [
"array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],\n",
" [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])"
]
}
],
"prompt_number": 123
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"After the consolidation, all you really know is that stacked is two rows of 10 columns. You could kind of simulate what a DataFrame does like so\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"items = {}\n",
"items['s1'] = 0\n",
"items['s2'] = 1\n",
"def get_support_series(frame, key):\n",
" ind = items[key]\n",
" row = stacked[ind]\n",
" # box row\n",
" return Support(row)\n",
"\n",
"s2_copy = get_support_series(stacked, 's2')\n",
"# values are same...\n",
"assert np.all(s2_copy == s2)\n",
"# but...\n",
"assert s2_copy is not s2\n",
"s2[0] = 888\n",
"assert s2_copy[0] != 888\n",
"# they are not the same!"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 124
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"So, to support subclasses, you'd have keep track of each class type and the additional metadata that class requires. Anything in the __dict__ would be lost unless that data was stored somewhere to be boxed later. And then there's supporting HDF5 and whatever other persistence that assumes the data is really just rows and columns."
]
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment