Skip to content

Instantly share code, notes, and snippets.

@dalejung
Created July 4, 2013 00:28
Show Gist options
  • Save dalejung/5924008 to your computer and use it in GitHub Desktop.
Save dalejung/5924008 to your computer and use it in GitHub Desktop.
lazydataframe 5924008 #notebook #trtools
Display the source blob
Display the rendered blob
Raw
{
"metadata": {
"name": "lazydataframe 5924008",
"notebook_path": "https://gist.github.com/5924008"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "code",
"collapsed": false,
"input": [
"import pandas as pd\n",
"import numpy as np\n",
"import pandas_composition.lazy"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df = pd.DataFrame(np.random.randn(10000, 5))\n",
"lf = df.lazy()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 50
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<pre>\n",
"&lt;class 'pandas.core.frame.DataFrame'&gt;\n",
"Int64Index: 10000 entries, 0 to 9999\n",
"Data columns (total 5 columns):\n",
"0 10000 non-null values\n",
"1 10000 non-null values\n",
"2 10000 non-null values\n",
"3 10000 non-null values\n",
"4 10000 non-null values\n",
"dtypes: float64(5)\n",
"</pre>"
],
"output_type": "pyout",
"prompt_number": 51,
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 10000 entries, 0 to 9999\n",
"Data columns (total 5 columns):\n",
"0 10000 non-null values\n",
"1 10000 non-null values\n",
"2 10000 non-null values\n",
"3 10000 non-null values\n",
"4 10000 non-null values\n",
"dtypes: float64(5)"
]
}
],
"prompt_number": 51
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Deferred Operations\n",
"===================\n",
"Basic math operators such as add, sub, div, mul, pow will be deferred. You can see how the expressions are built by just trying some out. Not that `_pobjN` is used as a placeholder for non-scalar values."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"lf"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"LazyFrame: \n",
"_pobj1"
],
"output_type": "pyout",
"prompt_number": 52,
"text": [
"LazyFrame: \n",
"_pobj1"
]
}
],
"prompt_number": 52
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"lf + 1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"LazyFrame: \n",
"(_pobj1 + 1)"
],
"output_type": "pyout",
"prompt_number": 53,
"text": [
"LazyFrame: \n",
"(_pobj1 + 1)"
]
}
],
"prompt_number": 53
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"(lf + 1) / lf - 1"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"LazyFrame: \n",
"(((_pobj1 + 1) / _pobj2) - 1)"
],
"output_type": "pyout",
"prompt_number": 58,
"text": [
"LazyFrame: \n",
"(((_pobj1 + 1) / _pobj2) - 1)"
]
}
],
"prompt_number": 58
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Ordering\n",
"========\n",
"LazyFrame uses regular python evaluation so it will follow pythons order of operations."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"expr = 2 ** (lf + 3) * 2\n",
"expr"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"LazyFrame: \n",
"(((_pobj1 + 3) ** 2) * 2)"
],
"output_type": "pyout",
"prompt_number": 33,
"text": [
"LazyFrame: \n",
"(((_pobj1 + 3) ** 2) * 2)"
]
}
],
"prompt_number": 33
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"eval\n",
"====\n",
"`eval` will evaluate and run the function expression through numexpr. \n",
"\n",
"It takes a `inplace` parameter which is defaulted to `False`"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"expr.eval()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<pre>\n",
"&lt;class 'pandas.core.frame.DataFrame'&gt;\n",
"Int64Index: 10000000 entries, 0 to 9999999\n",
"Columns: 5 entries, 0 to 4\n",
"dtypes: float64(5)\n",
"</pre>"
],
"output_type": "pyout",
"prompt_number": 49,
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 10000000 entries, 0 to 9999999\n",
"Columns: 5 entries, 0 to 4\n",
"dtypes: float64(5)"
]
}
],
"prompt_number": 49
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If something requires the `LazyFrame` to be evaluated and cannot be deferred. LazyFrame will do so automaticaly. Normally this is triggered by an attribute call that we don't know how to defer. Examples are:\n",
"\n",
"* `__array__`\n",
"* `columns`\n",
"* `values`"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"expr = lf * 2\n",
"assert not expr.evaled \n",
"res = expr > 0\n",
"assert expr.evaled\n",
"res.head()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"html": [
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" <th>3</th>\n",
" <th>4</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td> False</td>\n",
" <td> False</td>\n",
" <td> True</td>\n",
" <td> True</td>\n",
" <td> False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td> False</td>\n",
" <td> True</td>\n",
" <td> True</td>\n",
" <td> True</td>\n",
" <td> True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td> False</td>\n",
" <td> True</td>\n",
" <td> False</td>\n",
" <td> False</td>\n",
" <td> False</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td> False</td>\n",
" <td> True</td>\n",
" <td> True</td>\n",
" <td> True</td>\n",
" <td> True</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td> True</td>\n",
" <td> False</td>\n",
" <td> False</td>\n",
" <td> False</td>\n",
" <td> False</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"output_type": "pyout",
"prompt_number": 48,
"text": [
" 0 1 2 3 4\n",
"0 False False True True False\n",
"1 False True True True True\n",
"2 False True False False False\n",
"3 False True True True True\n",
"4 True False False False False"
]
}
],
"prompt_number": 48
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Performance\n",
"===========\n",
"\n",
"Due to the lazy eval and numexpr evaluation. Certain types of operations will run faster when deferred\n"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"df = pd.DataFrame(np.random.randn(10000000, 5))\n",
"lf = df.lazy()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 43
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%timeit df + df + df + df +df"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1 loops, best of 3: 969 ms per loop\n"
]
}
],
"prompt_number": 44
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"%timeit (lf + lf + lf + lf + lf).eval()"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1 loops, best of 3: 268 ms per loop\n"
]
}
],
"prompt_number": 45
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"correct = df + df + df + df + df\n",
"test = (lf + lf + lf + lf + lf ).eval()"
],
"language": "python",
"metadata": {},
"outputs": [],
"prompt_number": 46
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pd.util.testing.assert_almost_equal(correct, test)"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "pyout",
"prompt_number": 47,
"text": [
"True"
]
}
],
"prompt_number": 47
}
],
"metadata": {}
}
]
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment