dalejung/modifying series frames panels.ipynb

## modifying series frames panels.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              modifying series frames panels.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## modifying series, frames, panels.ipynb

      
Display the source blob

    
Display the rendered blob

    
    Raw
  

              modifying series, frames, panels.ipynb
            
          
        Loading

      Sorry, something went wrong. Reload?
      Sorry, we cannot display this file.
      Sorry, this file is invalid so it cannot be displayed.
      
          Viewer requires iframe.
      
    
## subclassing pandas objects.ipynb
{
 "metadata": {
  "name": "subclassing pandas objects.ipynb",
  "notebook_path": "https://gist.github.com/3366583/subclassing pandas objects.ipynb"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# http://stackoverflow.com/questions/11979194/subclasses-of-pandas-object-work-differently-from-subclass-of-other-object\n",
      "import pandas as pd\n",
      "\n",
      "class Support(pd.Series):\n",
      "    def __new__(cls, *args, **kwargs):\n",
      "        arr = Series.__new__(cls, *args, **kwargs)\n",
      "        return arr.view(Support)\n",
      "    \n",
      "    def supportMethod1(self):\n",
      "        print 'I am support method 1'       \n",
      "    def supportMethod2(self):\n",
      "        print 'I am support method 2'\n",
      "\n",
      "class Compute(object):\n",
      "    supp=None        \n",
      "    def test(self):\n",
      "        self.supp()  \n",
      "\n",
      "class Config(object):\n",
      "    supp=None        \n",
      "    @classmethod\n",
      "    def initializeConfig(cls):\n",
      "        cls.supp=Support()\n",
      "    @classmethod\n",
      "    def setConfig1(cls):\n",
      "        Compute.supp=cls.supp.supportMethod1\n",
      "    @classmethod\n",
      "    def setConfig2(cls):\n",
      "        Compute.supp=cls.supp.supportMethod2            \n",
      "\n",
      "    "
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 45
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "# adding the __new__ works for this simple demo\n",
      "s = Support(range(10))\n",
      "assert s.supportMethod1() == Support.supportMethod1(s)\n",
      "assert isinstance(s, Support)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "I am support method 1\n",
        "I am support method 1\n"
       ]
      }
     ],
     "prompt_number": 46
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "The problem is that there are many instances where Series data is boxed and unboxed. That data will come back as a series, which might hamper how useful a series subclass is.\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "assert not isinstance(s.cumsum(), Support)\n",
      "assert not isinstance(s.ix[:5], Support)\n"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 47
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Also, when you add a Series to a DataFrame, you're not really adding the Series to frame, but adding the *data*. The DataFrame holds the data and boxes it as a Series when you access it. \n",
      "\n",
      "Whatever speed you're getting from pandas/numpy has a lot to do with how the data is stored. A DataFrame is not collection of pointers to series. It's data needs to be consolidated and so it just becomes a row in a bigger data set. Losing it's Series-likeness until it's reboxed."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "s = Support(range(10))\n",
      "df = pd.DataFrame({'s': s})\n",
      "# df.s is not Support or even the Series. It's the data.\n",
      "assert not isinstance(df.s, Support)\n",
      "assert id(s) != id(df.s)\n",
      "s.ix[0] = 888\n",
      "# does not change df\n",
      "df"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>s</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <td><strong>0</strong></td>\n",
        "      <td> 0</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td><strong>1</strong></td>\n",
        "      <td> 1</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td><strong>2</strong></td>\n",
        "      <td> 2</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td><strong>3</strong></td>\n",
        "      <td> 3</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td><strong>4</strong></td>\n",
        "      <td> 4</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td><strong>5</strong></td>\n",
        "      <td> 5</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td><strong>6</strong></td>\n",
        "      <td> 6</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td><strong>7</strong></td>\n",
        "      <td> 7</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td><strong>8</strong></td>\n",
        "      <td> 8</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <td><strong>9</strong></td>\n",
        "      <td> 9</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "output_type": "pyout",
       "prompt_number": 103,
       "text": [
        "   s\n",
        "0  0\n",
        "1  1\n",
        "2  2\n",
        "3  3\n",
        "4  4\n",
        "5  5\n",
        "6  6\n",
        "7  7\n",
        "8  8\n",
        "9  9"
       ]
      }
     ],
     "prompt_number": 103
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "When you're working with Frames, your data is being consolidated like so:"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import numpy as np\n",
      "\n",
      "s1 = Support(range(10))\n",
      "s2 = Support(range(10, 20))\n",
      "\n",
      "stacked = np.vstack((s1, s2))\n",
      "\n",
      "s.ix[0] = 10\n",
      "assert stacked[0][0] != s[0]\n",
      "\n",
      "stacked"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 123,
       "text": [
        "array([[ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9],\n",
        "       [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])"
       ]
      }
     ],
     "prompt_number": 123
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "After the consolidation, all you really know is that stacked is two rows of 10 columns. You could kind of simulate what a DataFrame does like so\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "items = {}\n",
      "items['s1'] = 0\n",
      "items['s2'] = 1\n",
      "def get_support_series(frame, key):\n",
      "    ind = items[key]\n",
      "    row = stacked[ind]\n",
      "    # box row\n",
      "    return Support(row)\n",
      "\n",
      "s2_copy = get_support_series(stacked, 's2')\n",
      "# values are same...\n",
      "assert np.all(s2_copy == s2)\n",
      "# but...\n",
      "assert s2_copy is not s2\n",
      "s2[0] = 888\n",
      "assert s2_copy[0] != 888\n",
      "# they are not the same!"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 124
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "So, to support subclasses, you'd have keep track of each class type and the additional metadata that class requires. Anything in the __dict__ would be lost unless that data was stored somewhere to be boxed later. And then there's supporting HDF5 and whatever other persistence that assumes the data is really just rows and columns."
     ]
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "subclassing pandas objects.ipynb",
	"notebook_path": "https://gist.github.com/3366583/subclassing pandas objects.ipynb"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# http://stackoverflow.com/questions/11979194/subclasses-of-pandas-object-work-differently-from-subclass-of-other-object\n",
	"import pandas as pd\n",
	"\n",
	"class Support(pd.Series):\n",
	" def __new__(cls, args, *kwargs):\n",
	" arr = Series.__new__(cls, args, *kwargs)\n",
	" return arr.view(Support)\n",
	" \n",
	" def supportMethod1(self):\n",
	" print 'I am support method 1' \n",
	" def supportMethod2(self):\n",
	" print 'I am support method 2'\n",
	"\n",
	"class Compute(object):\n",
	" supp=None \n",
	" def test(self):\n",
	" self.supp() \n",
	"\n",
	"class Config(object):\n",
	" supp=None \n",
	" @classmethod\n",
	" def initializeConfig(cls):\n",
	" cls.supp=Support()\n",
	" @classmethod\n",
	" def setConfig1(cls):\n",
	" Compute.supp=cls.supp.supportMethod1\n",
	" @classmethod\n",
	" def setConfig2(cls):\n",
	" Compute.supp=cls.supp.supportMethod2 \n",
	"\n",
	" "
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 45
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"# adding the __new__ works for this simple demo\n",
	"s = Support(range(10))\n",
	"assert s.supportMethod1() == Support.supportMethod1(s)\n",
	"assert isinstance(s, Support)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"I am support method 1\n",
	"I am support method 1\n"
	]
	}
	],
	"prompt_number": 46
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"The problem is that there are many instances where Series data is boxed and unboxed. That data will come back as a series, which might hamper how useful a series subclass is.\n"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"assert not isinstance(s.cumsum(), Support)\n",
	"assert not isinstance(s.ix[:5], Support)\n"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 47
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Also, when you add a Series to a DataFrame, you're not really adding the Series to frame, but adding the data. The DataFrame holds the data and boxes it as a Series when you access it. \n",
	"\n",
	"Whatever speed you're getting from pandas/numpy has a lot to do with how the data is stored. A DataFrame is not collection of pointers to series. It's data needs to be consolidated and so it just becomes a row in a bigger data set. Losing it's Series-likeness until it's reboxed."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"s = Support(range(10))\n",
	"df = pd.DataFrame({'s': s})\n",
	"# df.s is not Support or even the Series. It's the data.\n",
	"assert not isinstance(df.s, Support)\n",
	"assert id(s) != id(df.s)\n",
	"s.ix[0] = 888\n",
	"# does not change df\n",
	"df"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"html": [
	"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>s</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <td><strong>0</strong></td>\n",
	" <td> 0</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td><strong>1</strong></td>\n",
	" <td> 1</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td><strong>2</strong></td>\n",
	" <td> 2</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td><strong>3</strong></td>\n",
	" <td> 3</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td><strong>4</strong></td>\n",
	" <td> 4</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td><strong>5</strong></td>\n",
	" <td> 5</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td><strong>6</strong></td>\n",
	" <td> 6</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td><strong>7</strong></td>\n",
	" <td> 7</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td><strong>8</strong></td>\n",
	" <td> 8</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <td><strong>9</strong></td>\n",
	" <td> 9</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"output_type": "pyout",
	"prompt_number": 103,
	"text": [
	" s\n",
	"0 0\n",
	"1 1\n",
	"2 2\n",
	"3 3\n",
	"4 4\n",
	"5 5\n",
	"6 6\n",
	"7 7\n",
	"8 8\n",
	"9 9"
	]
	}
	],
	"prompt_number": 103
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"When you're working with Frames, your data is being consolidated like so:"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import numpy as np\n",
	"\n",
	"s1 = Support(range(10))\n",
	"s2 = Support(range(10, 20))\n",
	"\n",
	"stacked = np.vstack((s1, s2))\n",
	"\n",
	"s.ix[0] = 10\n",
	"assert stacked[0][0] != s[0]\n",
	"\n",
	"stacked"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 123,
	"text": [
	"array([[ 0, 1, 2, 3, 4, 5, 6, 7, 8, 9],\n",
	" [10, 11, 12, 13, 14, 15, 16, 17, 18, 19]])"
	]
	}
	],
	"prompt_number": 123
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"After the consolidation, all you really know is that stacked is two rows of 10 columns. You could kind of simulate what a DataFrame does like so\n"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"items = {}\n",
	"items['s1'] = 0\n",
	"items['s2'] = 1\n",
	"def get_support_series(frame, key):\n",
	" ind = items[key]\n",
	" row = stacked[ind]\n",
	" # box row\n",
	" return Support(row)\n",
	"\n",
	"s2_copy = get_support_series(stacked, 's2')\n",
	"# values are same...\n",
	"assert np.all(s2_copy == s2)\n",
	"# but...\n",
	"assert s2_copy is not s2\n",
	"s2[0] = 888\n",
	"assert s2_copy[0] != 888\n",
	"# they are not the same!"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 124
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"So, to support subclasses, you'd have keep track of each class type and the additional metadata that class requires. Anything in the __dict__ would be lost unless that data was stored somewhere to be boxed later. And then there's supporting HDF5 and whatever other persistence that assumes the data is really just rows and columns."
	]
	}
	],
	"metadata": {}
	}
	]
	}