dalejung/Untitled0.ipynb

## Untitled0.ipynb
{
 "metadata": {
  "name": "lazydataframe   5924008",
  "notebook_path": "https://gist.github.com/5924008"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import pandas as pd\n",
      "import numpy as np\n",
      "import pandas_composition.lazy"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 1
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = pd.DataFrame(np.random.randn(10000, 5))\n",
      "lf = df.lazy()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 50
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<pre>\n",
        "&lt;class 'pandas.core.frame.DataFrame'&gt;\n",
        "Int64Index: 10000 entries, 0 to 9999\n",
        "Data columns (total 5 columns):\n",
        "0    10000  non-null values\n",
        "1    10000  non-null values\n",
        "2    10000  non-null values\n",
        "3    10000  non-null values\n",
        "4    10000  non-null values\n",
        "dtypes: float64(5)\n",
        "</pre>"
       ],
       "output_type": "pyout",
       "prompt_number": 51,
       "text": [
        "<class 'pandas.core.frame.DataFrame'>\n",
        "Int64Index: 10000 entries, 0 to 9999\n",
        "Data columns (total 5 columns):\n",
        "0    10000  non-null values\n",
        "1    10000  non-null values\n",
        "2    10000  non-null values\n",
        "3    10000  non-null values\n",
        "4    10000  non-null values\n",
        "dtypes: float64(5)"
       ]
      }
     ],
     "prompt_number": 51
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Deferred Operations\n",
      "===================\n",
      "Basic math operators such as add, sub, div, mul, pow will be deferred. You can see how the expressions are built by just trying some out. Not that `_pobjN` is used as a placeholder for non-scalar values."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "lf"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "LazyFrame: \n",
        "_pobj1"
       ],
       "output_type": "pyout",
       "prompt_number": 52,
       "text": [
        "LazyFrame: \n",
        "_pobj1"
       ]
      }
     ],
     "prompt_number": 52
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "lf + 1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "LazyFrame: \n",
        "(_pobj1 + 1)"
       ],
       "output_type": "pyout",
       "prompt_number": 53,
       "text": [
        "LazyFrame: \n",
        "(_pobj1 + 1)"
       ]
      }
     ],
     "prompt_number": 53
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "(lf + 1) / lf - 1"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "LazyFrame: \n",
        "(((_pobj1 + 1) / _pobj2) - 1)"
       ],
       "output_type": "pyout",
       "prompt_number": 58,
       "text": [
        "LazyFrame: \n",
        "(((_pobj1 + 1) / _pobj2) - 1)"
       ]
      }
     ],
     "prompt_number": 58
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Ordering\n",
      "========\n",
      "LazyFrame uses regular python evaluation so it will follow pythons order of operations."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "expr = 2 ** (lf + 3) * 2\n",
      "expr"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "LazyFrame: \n",
        "(((_pobj1 + 3) ** 2) * 2)"
       ],
       "output_type": "pyout",
       "prompt_number": 33,
       "text": [
        "LazyFrame: \n",
        "(((_pobj1 + 3) ** 2) * 2)"
       ]
      }
     ],
     "prompt_number": 33
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "eval\n",
      "====\n",
      "`eval` will evaluate and run the function expression through numexpr. \n",
      "\n",
      "It takes a `inplace` parameter which is defaulted to `False`"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "expr.eval()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<pre>\n",
        "&lt;class 'pandas.core.frame.DataFrame'&gt;\n",
        "Int64Index: 10000000 entries, 0 to 9999999\n",
        "Columns: 5 entries, 0 to 4\n",
        "dtypes: float64(5)\n",
        "</pre>"
       ],
       "output_type": "pyout",
       "prompt_number": 49,
       "text": [
        "<class 'pandas.core.frame.DataFrame'>\n",
        "Int64Index: 10000000 entries, 0 to 9999999\n",
        "Columns: 5 entries, 0 to 4\n",
        "dtypes: float64(5)"
       ]
      }
     ],
     "prompt_number": 49
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "If something requires the `LazyFrame` to be evaluated and cannot be deferred. LazyFrame will do so automaticaly. Normally this is triggered by an attribute call that we don't know how to defer. Examples are:\n",
      "\n",
      "* `__array__`\n",
      "* `columns`\n",
      "* `values`"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "expr = lf * 2\n",
      "assert not expr.evaled \n",
      "res = expr > 0\n",
      "assert expr.evaled\n",
      "res.head()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "html": [
        "<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
        "<table border=\"1\" class=\"dataframe\">\n",
        "  <thead>\n",
        "    <tr style=\"text-align: right;\">\n",
        "      <th></th>\n",
        "      <th>0</th>\n",
        "      <th>1</th>\n",
        "      <th>2</th>\n",
        "      <th>3</th>\n",
        "      <th>4</th>\n",
        "    </tr>\n",
        "  </thead>\n",
        "  <tbody>\n",
        "    <tr>\n",
        "      <th>0</th>\n",
        "      <td> False</td>\n",
        "      <td> False</td>\n",
        "      <td>  True</td>\n",
        "      <td>  True</td>\n",
        "      <td> False</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>1</th>\n",
        "      <td> False</td>\n",
        "      <td>  True</td>\n",
        "      <td>  True</td>\n",
        "      <td>  True</td>\n",
        "      <td>  True</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>2</th>\n",
        "      <td> False</td>\n",
        "      <td>  True</td>\n",
        "      <td> False</td>\n",
        "      <td> False</td>\n",
        "      <td> False</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>3</th>\n",
        "      <td> False</td>\n",
        "      <td>  True</td>\n",
        "      <td>  True</td>\n",
        "      <td>  True</td>\n",
        "      <td>  True</td>\n",
        "    </tr>\n",
        "    <tr>\n",
        "      <th>4</th>\n",
        "      <td>  True</td>\n",
        "      <td> False</td>\n",
        "      <td> False</td>\n",
        "      <td> False</td>\n",
        "      <td> False</td>\n",
        "    </tr>\n",
        "  </tbody>\n",
        "</table>\n",
        "</div>"
       ],
       "output_type": "pyout",
       "prompt_number": 48,
       "text": [
        "       0      1      2      3      4\n",
        "0  False  False   True   True  False\n",
        "1  False   True   True   True   True\n",
        "2  False   True  False  False  False\n",
        "3  False   True   True   True   True\n",
        "4   True  False  False  False  False"
       ]
      }
     ],
     "prompt_number": 48
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "Performance\n",
      "===========\n",
      "\n",
      "Due to the lazy eval and numexpr evaluation. Certain types of operations will run faster when deferred\n"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "df = pd.DataFrame(np.random.randn(10000000, 5))\n",
      "lf = df.lazy()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 43
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%timeit df + df + df + df +df"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1 loops, best of 3: 969 ms per loop\n"
       ]
      }
     ],
     "prompt_number": 44
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "%timeit (lf + lf + lf + lf + lf).eval()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "stream",
       "stream": "stdout",
       "text": [
        "1 loops, best of 3: 268 ms per loop\n"
       ]
      }
     ],
     "prompt_number": 45
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "correct = df + df + df + df + df\n",
      "test = (lf + lf + lf + lf + lf ).eval()"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 46
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "pd.util.testing.assert_almost_equal(correct, test)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "output_type": "pyout",
       "prompt_number": 47,
       "text": [
        "True"
       ]
      }
     ],
     "prompt_number": 47
    }
   ],
   "metadata": {}
  }
 ]
}
	{
	"metadata": {
	"name": "lazydataframe 5924008",
	"notebook_path": "https://gist.github.com/5924008"
	},
	"nbformat": 3,
	"nbformat_minor": 0,
	"worksheets": [
	{
	"cells": [
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"import pandas as pd\n",
	"import numpy as np\n",
	"import pandas_composition.lazy"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 1
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"df = pd.DataFrame(np.random.randn(10000, 5))\n",
	"lf = df.lazy()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 50
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"df"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"html": [
	"<pre>\n",
	"<class 'pandas.core.frame.DataFrame'>\n",
	"Int64Index: 10000 entries, 0 to 9999\n",
	"Data columns (total 5 columns):\n",
	"0 10000 non-null values\n",
	"1 10000 non-null values\n",
	"2 10000 non-null values\n",
	"3 10000 non-null values\n",
	"4 10000 non-null values\n",
	"dtypes: float64(5)\n",
	"</pre>"
	],
	"output_type": "pyout",
	"prompt_number": 51,
	"text": [
	"<class 'pandas.core.frame.DataFrame'>\n",
	"Int64Index: 10000 entries, 0 to 9999\n",
	"Data columns (total 5 columns):\n",
	"0 10000 non-null values\n",
	"1 10000 non-null values\n",
	"2 10000 non-null values\n",
	"3 10000 non-null values\n",
	"4 10000 non-null values\n",
	"dtypes: float64(5)"
	]
	}
	],
	"prompt_number": 51
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Deferred Operations\n",
	"===================\n",
	"Basic math operators such as add, sub, div, mul, pow will be deferred. You can see how the expressions are built by just trying some out. Not that `_pobjN` is used as a placeholder for non-scalar values."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"lf"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"html": [
	"LazyFrame: \n",
	"_pobj1"
	],
	"output_type": "pyout",
	"prompt_number": 52,
	"text": [
	"LazyFrame: \n",
	"_pobj1"
	]
	}
	],
	"prompt_number": 52
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"lf + 1"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"html": [
	"LazyFrame: \n",
	"(_pobj1 + 1)"
	],
	"output_type": "pyout",
	"prompt_number": 53,
	"text": [
	"LazyFrame: \n",
	"(_pobj1 + 1)"
	]
	}
	],
	"prompt_number": 53
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"(lf + 1) / lf - 1"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"html": [
	"LazyFrame: \n",
	"(((_pobj1 + 1) / _pobj2) - 1)"
	],
	"output_type": "pyout",
	"prompt_number": 58,
	"text": [
	"LazyFrame: \n",
	"(((_pobj1 + 1) / _pobj2) - 1)"
	]
	}
	],
	"prompt_number": 58
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Ordering\n",
	"========\n",
	"LazyFrame uses regular python evaluation so it will follow pythons order of operations."
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"expr = 2 ** (lf + 3) * 2\n",
	"expr"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"html": [
	"LazyFrame: \n",
	"(((_pobj1 + 3) ** 2) * 2)"
	],
	"output_type": "pyout",
	"prompt_number": 33,
	"text": [
	"LazyFrame: \n",
	"(((_pobj1 + 3) ** 2) * 2)"
	]
	}
	],
	"prompt_number": 33
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"eval\n",
	"====\n",
	"`eval` will evaluate and run the function expression through numexpr. \n",
	"\n",
	"It takes a `inplace` parameter which is defaulted to `False`"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"expr.eval()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"html": [
	"<pre>\n",
	"<class 'pandas.core.frame.DataFrame'>\n",
	"Int64Index: 10000000 entries, 0 to 9999999\n",
	"Columns: 5 entries, 0 to 4\n",
	"dtypes: float64(5)\n",
	"</pre>"
	],
	"output_type": "pyout",
	"prompt_number": 49,
	"text": [
	"<class 'pandas.core.frame.DataFrame'>\n",
	"Int64Index: 10000000 entries, 0 to 9999999\n",
	"Columns: 5 entries, 0 to 4\n",
	"dtypes: float64(5)"
	]
	}
	],
	"prompt_number": 49
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"If something requires the `LazyFrame` to be evaluated and cannot be deferred. LazyFrame will do so automaticaly. Normally this is triggered by an attribute call that we don't know how to defer. Examples are:\n",
	"\n",
	"* `__array__`\n",
	"* `columns`\n",
	"* `values`"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"expr = lf * 2\n",
	"assert not expr.evaled \n",
	"res = expr > 0\n",
	"assert expr.evaled\n",
	"res.head()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"html": [
	"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
	"<table border=\"1\" class=\"dataframe\">\n",
	" <thead>\n",
	" <tr style=\"text-align: right;\">\n",
	" <th></th>\n",
	" <th>0</th>\n",
	" <th>1</th>\n",
	" <th>2</th>\n",
	" <th>3</th>\n",
	" <th>4</th>\n",
	" </tr>\n",
	" </thead>\n",
	" <tbody>\n",
	" <tr>\n",
	" <th>0</th>\n",
	" <td> False</td>\n",
	" <td> False</td>\n",
	" <td> True</td>\n",
	" <td> True</td>\n",
	" <td> False</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>1</th>\n",
	" <td> False</td>\n",
	" <td> True</td>\n",
	" <td> True</td>\n",
	" <td> True</td>\n",
	" <td> True</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>2</th>\n",
	" <td> False</td>\n",
	" <td> True</td>\n",
	" <td> False</td>\n",
	" <td> False</td>\n",
	" <td> False</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>3</th>\n",
	" <td> False</td>\n",
	" <td> True</td>\n",
	" <td> True</td>\n",
	" <td> True</td>\n",
	" <td> True</td>\n",
	" </tr>\n",
	" <tr>\n",
	" <th>4</th>\n",
	" <td> True</td>\n",
	" <td> False</td>\n",
	" <td> False</td>\n",
	" <td> False</td>\n",
	" <td> False</td>\n",
	" </tr>\n",
	" </tbody>\n",
	"</table>\n",
	"</div>"
	],
	"output_type": "pyout",
	"prompt_number": 48,
	"text": [
	" 0 1 2 3 4\n",
	"0 False False True True False\n",
	"1 False True True True True\n",
	"2 False True False False False\n",
	"3 False True True True True\n",
	"4 True False False False False"
	]
	}
	],
	"prompt_number": 48
	},
	{
	"cell_type": "markdown",
	"metadata": {},
	"source": [
	"Performance\n",
	"===========\n",
	"\n",
	"Due to the lazy eval and numexpr evaluation. Certain types of operations will run faster when deferred\n"
	]
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"df = pd.DataFrame(np.random.randn(10000000, 5))\n",
	"lf = df.lazy()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 43
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"%timeit df + df + df + df +df"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"1 loops, best of 3: 969 ms per loop\n"
	]
	}
	],
	"prompt_number": 44
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"%timeit (lf + lf + lf + lf + lf).eval()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "stream",
	"stream": "stdout",
	"text": [
	"1 loops, best of 3: 268 ms per loop\n"
	]
	}
	],
	"prompt_number": 45
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"correct = df + df + df + df + df\n",
	"test = (lf + lf + lf + lf + lf ).eval()"
	],
	"language": "python",
	"metadata": {},
	"outputs": [],
	"prompt_number": 46
	},
	{
	"cell_type": "code",
	"collapsed": false,
	"input": [
	"pd.util.testing.assert_almost_equal(correct, test)"
	],
	"language": "python",
	"metadata": {},
	"outputs": [
	{
	"output_type": "pyout",
	"prompt_number": 47,
	"text": [
	"True"
	]
	}
	],
	"prompt_number": 47
	}
	],
	"metadata": {}
	}
	]
	}