Skip to content

Instantly share code, notes, and snippets.

@lgautier
Last active January 18, 2016 21:00
Show Gist options
  • Save lgautier/a76839ba9914863c368d to your computer and use it in GitHub Desktop.
Save lgautier/a76839ba9914863c368d to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# dplyr in Python\n",
"Note: a more recent version of this is in the rpy2 documentation.\n",
"\n",
"We need 2 things for this:\n",
"\n",
"1- A data frame (using one of R's demo datasets)."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from rpy2.robjects.packages import importr, data\n",
"datasets = importr('datasets')\n",
"mtcars_env = data(datasets).fetch('mtcars')\n",
"mtcars = mtcars_env['mtcars']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"2- dplyr"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"from rpy2.robjects.lib.dplyr import DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"With this we have the choice of chaining (D3-style)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Source: local data frame [2 x 2]\n",
"\n",
" gear mean_ptw\n",
"1 4 1237.127\n",
"2 5 2574.033\n",
"\n"
]
}
],
"source": [
"dataf = (DataFrame(mtcars).\n",
" filter('gear>3').\n",
" mutate(powertoweight='hp*36/wt').\n",
" group_by('gear').\n",
" summarize(mean_ptw='mean(powertoweight)'))\n",
"\n",
"print(dataf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"or piping (magrittr style)."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Source: local data frame [2 x 2]\n",
"\n",
" gear mean_ptw\n",
"1 4 1237.127\n",
"2 5 2574.033\n",
"\n"
]
}
],
"source": [
"from rpy2.robjects.lib.dplyr import (filter,\n",
" mutate,\n",
" group_by,\n",
" summarize)\n",
"\n",
"dataf = (DataFrame(mtcars) >>\n",
" filter('gear>3') >>\n",
" mutate(powertoweight='hp*36/wt') >>\n",
" group_by('gear') >>\n",
" summarize(mean_ptw='mean(powertoweight)'))\n",
"\n",
"print(dataf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The strings passed to the dplyr function are evaluated as expression,\n",
"just like this is happening when using dplyr in R. This means that\n",
"when writing `mean(powertoweight)` the R function `mean()` is used.\n",
"\n",
"Using an Python function is not too difficult though. We can just\n",
"call Python back from R:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Source: local data frame [2 x 3]\n",
"\n",
" gear mean_np_ptw mean_ptw\n",
"1 4 1237.127 1237.127\n",
"2 5 2574.033 2574.033\n",
"\n"
]
}
],
"source": [
"from rpy2.rinterface import rternalize\n",
"@rternalize\n",
"def mean_np(x):\n",
" import numpy\n",
" return numpy.mean(x)\n",
"\n",
"from rpy2.robjects import globalenv\n",
"globalenv['mean_np'] = mean_np\n",
"\n",
"dataf = (DataFrame(mtcars) >>\n",
" filter('gear>3') >>\n",
" mutate(powertoweight='hp*36/wt') >>\n",
" group_by('gear') >>\n",
" summarize(mean_ptw='mean(powertoweight)',\n",
" mean_np_ptw='mean_np(powertoweight)'))\n",
"\n",
"print(dataf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"It is also possible to carry this out without having to\n",
"place the custom function in R's global environment."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {
"collapsed": true
},
"outputs": [],
"source": [
"del(globalenv['mean_np'])"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Source: local data frame [2 x 3]\n",
"\n",
" gear mean_np_ptw mean_ptw\n",
"1 4 1237.127 1237.127\n",
"2 5 2574.033 2574.033\n",
"\n"
]
}
],
"source": [
"from rpy2.robjects.lib.dplyr import StringInEnv\n",
"from rpy2.robjects import Environment\n",
"my_env = Environment()\n",
"my_env['mean_np'] = mean_np\n",
"\n",
"dataf = (DataFrame(mtcars) >>\n",
" filter('gear>3') >>\n",
" mutate(powertoweight='hp*36/wt') >>\n",
" group_by('gear') >>\n",
" summarize(mean_ptw='mean(powertoweight)',\n",
" mean_np_ptw=StringInEnv('mean_np(powertoweight)',\n",
" my_env)))\n",
"\n",
"print(dataf)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**note**: rpy2's interface to dplyr is implementing a fix to the (non-?)issue 1323\n",
"(https://github.com/hadley/dplyr/issues/1323)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The seamless translation of transformations to SQL whenever the\n",
"data are in a table can be used directly. Since we are lifting\n",
"the original implementation of `dplyr`, it *just works*."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Source: sqlite 3.8.6 [/tmp/tmpwd488fg1]\n",
"From: <derived table> [?? x 2]\n",
"\n",
" gear mean_ptw\n",
"1 4 1237.127\n",
"2 5 2574.033\n",
".. ... ...\n",
"\n"
]
}
],
"source": [
"from rpy2.robjects.lib.dplyr import dplyr\n",
"# in-memory SQLite database broken in dplyr's src_sqlite\n",
"# db = dplyr.src_sqlite(\":memory:\")\n",
"import tempfile\n",
"with tempfile.NamedTemporaryFile() as db_fh:\n",
" db = dplyr.src_sqlite(db_fh.name)\n",
" # copy the table to that database\n",
" dataf_db = DataFrame(mtcars).copy_to(db, name=\"mtcars\")\n",
" res = (dataf_db >>\n",
" filter('gear>3') >>\n",
" mutate(powertoweight='hp*36/wt') >>\n",
" group_by('gear') >>\n",
" summarize(mean_ptw='mean(powertoweight)'))\n",
" print(res)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Since we are manipulating R objects, anything available to R is also available\n",
"to us. If we want to see the SQL code generated that's:"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<SQL> SELECT \"gear\", \"mean_ptw\"\n",
"FROM (SELECT \"gear\", AVG(\"powertoweight\") AS \"mean_ptw\"\n",
"FROM (SELECT \"mpg\", \"cyl\", \"disp\", \"hp\", \"drat\", \"wt\", \"qsec\", \"vs\", \"am\", \"gear\", \"carb\", \"hp\" * 36.0 / \"wt\" AS \"powertoweight\"\n",
"FROM \"mtcars\"\n",
"WHERE \"gear\" > 3.0) AS \"_W1\"\n",
"GROUP BY \"gear\") AS \"_W2\"\n",
"\n"
]
}
],
"source": [
"print(res.rx2(\"query\")[\"sql\"])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"And if the starting point is a pandas data frame,\n",
"do the following and start over again."
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n"
]
}
],
"source": [
"from rpy2.robjects import pandas2ri\n",
"from rpy2.robjects import default_converter\n",
"from rpy2.robjects.conversion import localconverter\n",
"with localconverter(default_converter + pandas2ri.converter) as cv:\n",
" mtcars = mtcars_env['mtcars']\n",
" mtcars = pandas2ri.ri2py(mtcars)\n",
"print(type(mtcars))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using a local converter let's us also go from the pandas data frame to our dplyr-augmented R data frame."
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"collapsed": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Source: local data frame [3 x 2]\n",
"\n",
" gear mean_ptw\n",
"1 3 1633.990\n",
"2 4 1237.127\n",
"3 5 2574.033\n",
"\n"
]
}
],
"source": [
"with localconverter(default_converter + pandas2ri.converter) as cv:\n",
" dataf = (DataFrame(mtcars).\n",
" filter('gear>=3').\n",
" mutate(powertoweight='hp*36/wt').\n",
" group_by('gear').\n",
" summarize(mean_ptw='mean(powertoweight)'))\n",
"\n",
"print(dataf)"
]
},
{
"cell_type": "markdown",
"metadata": {
"collapsed": false
},
"source": [
"**Reuse. Get things done. Don't reimplement.**"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.4.3"
}
},
"nbformat": 4,
"nbformat_minor": 0
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment