Created
July 4, 2013 00:28
-
-
Save dalejung/5924008 to your computer and use it in GitHub Desktop.
lazydataframe 5924008 #notebook #trtools
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
{ | |
"metadata": { | |
"name": "lazydataframe 5924008", | |
"notebook_path": "https://gist.github.com/5924008" | |
}, | |
"nbformat": 3, | |
"nbformat_minor": 0, | |
"worksheets": [ | |
{ | |
"cells": [ | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"import pandas as pd\n", | |
"import numpy as np\n", | |
"import pandas_composition.lazy" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 1 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"df = pd.DataFrame(np.random.randn(10000, 5))\n", | |
"lf = df.lazy()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 50 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"df" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"html": [ | |
"<pre>\n", | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"Int64Index: 10000 entries, 0 to 9999\n", | |
"Data columns (total 5 columns):\n", | |
"0 10000 non-null values\n", | |
"1 10000 non-null values\n", | |
"2 10000 non-null values\n", | |
"3 10000 non-null values\n", | |
"4 10000 non-null values\n", | |
"dtypes: float64(5)\n", | |
"</pre>" | |
], | |
"output_type": "pyout", | |
"prompt_number": 51, | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"Int64Index: 10000 entries, 0 to 9999\n", | |
"Data columns (total 5 columns):\n", | |
"0 10000 non-null values\n", | |
"1 10000 non-null values\n", | |
"2 10000 non-null values\n", | |
"3 10000 non-null values\n", | |
"4 10000 non-null values\n", | |
"dtypes: float64(5)" | |
] | |
} | |
], | |
"prompt_number": 51 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Deferred Operations\n", | |
"===================\n", | |
"Basic math operators such as add, sub, div, mul, pow will be deferred. You can see how the expressions are built by just trying some out. Not that `_pobjN` is used as a placeholder for non-scalar values." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"lf" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"html": [ | |
"LazyFrame: \n", | |
"_pobj1" | |
], | |
"output_type": "pyout", | |
"prompt_number": 52, | |
"text": [ | |
"LazyFrame: \n", | |
"_pobj1" | |
] | |
} | |
], | |
"prompt_number": 52 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"lf + 1" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"html": [ | |
"LazyFrame: \n", | |
"(_pobj1 + 1)" | |
], | |
"output_type": "pyout", | |
"prompt_number": 53, | |
"text": [ | |
"LazyFrame: \n", | |
"(_pobj1 + 1)" | |
] | |
} | |
], | |
"prompt_number": 53 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"(lf + 1) / lf - 1" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"html": [ | |
"LazyFrame: \n", | |
"(((_pobj1 + 1) / _pobj2) - 1)" | |
], | |
"output_type": "pyout", | |
"prompt_number": 58, | |
"text": [ | |
"LazyFrame: \n", | |
"(((_pobj1 + 1) / _pobj2) - 1)" | |
] | |
} | |
], | |
"prompt_number": 58 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Ordering\n", | |
"========\n", | |
"LazyFrame uses regular python evaluation so it will follow pythons order of operations." | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"expr = 2 ** (lf + 3) * 2\n", | |
"expr" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"html": [ | |
"LazyFrame: \n", | |
"(((_pobj1 + 3) ** 2) * 2)" | |
], | |
"output_type": "pyout", | |
"prompt_number": 33, | |
"text": [ | |
"LazyFrame: \n", | |
"(((_pobj1 + 3) ** 2) * 2)" | |
] | |
} | |
], | |
"prompt_number": 33 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"eval\n", | |
"====\n", | |
"`eval` will evaluate and run the function expression through numexpr. \n", | |
"\n", | |
"It takes a `inplace` parameter which is defaulted to `False`" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"expr.eval()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"html": [ | |
"<pre>\n", | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"Int64Index: 10000000 entries, 0 to 9999999\n", | |
"Columns: 5 entries, 0 to 4\n", | |
"dtypes: float64(5)\n", | |
"</pre>" | |
], | |
"output_type": "pyout", | |
"prompt_number": 49, | |
"text": [ | |
"<class 'pandas.core.frame.DataFrame'>\n", | |
"Int64Index: 10000000 entries, 0 to 9999999\n", | |
"Columns: 5 entries, 0 to 4\n", | |
"dtypes: float64(5)" | |
] | |
} | |
], | |
"prompt_number": 49 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"If something requires the `LazyFrame` to be evaluated and cannot be deferred. LazyFrame will do so automaticaly. Normally this is triggered by an attribute call that we don't know how to defer. Examples are:\n", | |
"\n", | |
"* `__array__`\n", | |
"* `columns`\n", | |
"* `values`" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"expr = lf * 2\n", | |
"assert not expr.evaled \n", | |
"res = expr > 0\n", | |
"assert expr.evaled\n", | |
"res.head()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"html": [ | |
"<div style=\"max-height:1000px;max-width:1500px;overflow:auto;\">\n", | |
"<table border=\"1\" class=\"dataframe\">\n", | |
" <thead>\n", | |
" <tr style=\"text-align: right;\">\n", | |
" <th></th>\n", | |
" <th>0</th>\n", | |
" <th>1</th>\n", | |
" <th>2</th>\n", | |
" <th>3</th>\n", | |
" <th>4</th>\n", | |
" </tr>\n", | |
" </thead>\n", | |
" <tbody>\n", | |
" <tr>\n", | |
" <th>0</th>\n", | |
" <td> False</td>\n", | |
" <td> False</td>\n", | |
" <td> True</td>\n", | |
" <td> True</td>\n", | |
" <td> False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>1</th>\n", | |
" <td> False</td>\n", | |
" <td> True</td>\n", | |
" <td> True</td>\n", | |
" <td> True</td>\n", | |
" <td> True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>2</th>\n", | |
" <td> False</td>\n", | |
" <td> True</td>\n", | |
" <td> False</td>\n", | |
" <td> False</td>\n", | |
" <td> False</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>3</th>\n", | |
" <td> False</td>\n", | |
" <td> True</td>\n", | |
" <td> True</td>\n", | |
" <td> True</td>\n", | |
" <td> True</td>\n", | |
" </tr>\n", | |
" <tr>\n", | |
" <th>4</th>\n", | |
" <td> True</td>\n", | |
" <td> False</td>\n", | |
" <td> False</td>\n", | |
" <td> False</td>\n", | |
" <td> False</td>\n", | |
" </tr>\n", | |
" </tbody>\n", | |
"</table>\n", | |
"</div>" | |
], | |
"output_type": "pyout", | |
"prompt_number": 48, | |
"text": [ | |
" 0 1 2 3 4\n", | |
"0 False False True True False\n", | |
"1 False True True True True\n", | |
"2 False True False False False\n", | |
"3 False True True True True\n", | |
"4 True False False False False" | |
] | |
} | |
], | |
"prompt_number": 48 | |
}, | |
{ | |
"cell_type": "markdown", | |
"metadata": {}, | |
"source": [ | |
"Performance\n", | |
"===========\n", | |
"\n", | |
"Due to the lazy eval and numexpr evaluation. Certain types of operations will run faster when deferred\n" | |
] | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"df = pd.DataFrame(np.random.randn(10000000, 5))\n", | |
"lf = df.lazy()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 43 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"%timeit df + df + df + df +df" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"1 loops, best of 3: 969 ms per loop\n" | |
] | |
} | |
], | |
"prompt_number": 44 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"%timeit (lf + lf + lf + lf + lf).eval()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "stream", | |
"stream": "stdout", | |
"text": [ | |
"1 loops, best of 3: 268 ms per loop\n" | |
] | |
} | |
], | |
"prompt_number": 45 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"correct = df + df + df + df + df\n", | |
"test = (lf + lf + lf + lf + lf ).eval()" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [], | |
"prompt_number": 46 | |
}, | |
{ | |
"cell_type": "code", | |
"collapsed": false, | |
"input": [ | |
"pd.util.testing.assert_almost_equal(correct, test)" | |
], | |
"language": "python", | |
"metadata": {}, | |
"outputs": [ | |
{ | |
"output_type": "pyout", | |
"prompt_number": 47, | |
"text": [ | |
"True" | |
] | |
} | |
], | |
"prompt_number": 47 | |
} | |
], | |
"metadata": {} | |
} | |
] | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment