Skip to content

Instantly share code, notes, and snippets.

@etheleon
Created March 17, 2018 10:46
Show Gist options
  • Save etheleon/e66d0bd538b5036987c2739fadaff335 to your computer and use it in GitHub Desktop.
Save etheleon/e66d0bd538b5036987c2739fadaff335 to your computer and use it in GitHub Desktop.
ipython notebook showing common idioms used when manipulating tables in either R or pandas
Display the source blob
Display the rendered blob
Raw
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Introduction\n",
"\n",
"The purpose of this notebook is to help R programmers quickly learn how to do the everyday things they do in R (using tidyverse) in python.\n",
"\n",
"Usually in R, we would import the `tidyverse` package which essentially is a set of packages dplyr, tidyr and ggplot the equivalent in python would be \n",
"pandas and matplotlib coupled together with `seaborn`. More recently I've been toying around with the purrr package as will which makes the idioms less painful like\n",
"\n",
"do.call(rbind,).\n",
"\n",
"in dplyr, Hadley mentions there really is only a few operations which we really use in day to day life they could be summarised into these verbs\n",
"\n",
"\n",
"* select\n",
"* arrange\n",
"* filter\n",
"* group_by\n",
"* summarise\n",
"* mutate\n",
"* merge (left_join, right_join)\n",
"* duplicated\n",
"\n",
"\n",
"The part which i'm still largely unsure about is how do to this:\n",
"\n",
"\n",
"```r\n",
"df %>% group_by(variable) %>% summarise(mean = mean(var2))\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"/Users/uesu/anaconda3/envs/dataAnalysis/lib/python3.6/site-packages/rpy2/rinterface/__init__.py:145: RRuntimeWarning: Error in library(colorout) : there is no package called ‘colorout’\n",
"\n",
" warnings.warn(x, RRuntimeWarning)\n"
]
}
],
"source": [
"import pandas as pd\n",
"import matplotlib.pyplot as plt\n",
"import seaborn as sns\n",
"import numpy as np\n",
"import rpy2\n",
"%load_ext rpy2.ipython"
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>model</th>\n",
" <th>mpg</th>\n",
" <th>cyl</th>\n",
" <th>disp</th>\n",
" <th>hp</th>\n",
" <th>drat</th>\n",
" <th>wt</th>\n",
" <th>qsec</th>\n",
" <th>vs</th>\n",
" <th>am</th>\n",
" <th>gear</th>\n",
" <th>carb</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Mazda RX4</td>\n",
" <td>21.0</td>\n",
" <td>6</td>\n",
" <td>160.0</td>\n",
" <td>110</td>\n",
" <td>3.90</td>\n",
" <td>2.620</td>\n",
" <td>16.46</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Mazda RX4 Wag</td>\n",
" <td>21.0</td>\n",
" <td>6</td>\n",
" <td>160.0</td>\n",
" <td>110</td>\n",
" <td>3.90</td>\n",
" <td>2.875</td>\n",
" <td>17.02</td>\n",
" <td>0</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>4</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Datsun 710</td>\n",
" <td>22.8</td>\n",
" <td>4</td>\n",
" <td>108.0</td>\n",
" <td>93</td>\n",
" <td>3.85</td>\n",
" <td>2.320</td>\n",
" <td>18.61</td>\n",
" <td>1</td>\n",
" <td>1</td>\n",
" <td>4</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Hornet 4 Drive</td>\n",
" <td>21.4</td>\n",
" <td>6</td>\n",
" <td>258.0</td>\n",
" <td>110</td>\n",
" <td>3.08</td>\n",
" <td>3.215</td>\n",
" <td>19.44</td>\n",
" <td>1</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>1</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Hornet Sportabout</td>\n",
" <td>18.7</td>\n",
" <td>8</td>\n",
" <td>360.0</td>\n",
" <td>175</td>\n",
" <td>3.15</td>\n",
" <td>3.440</td>\n",
" <td>17.02</td>\n",
" <td>0</td>\n",
" <td>0</td>\n",
" <td>3</td>\n",
" <td>2</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" model mpg cyl disp hp drat wt qsec vs am gear \\\n",
"0 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 \n",
"1 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 \n",
"2 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 \n",
"3 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 \n",
"4 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 \n",
"\n",
" carb \n",
"0 4 \n",
"1 4 \n",
"2 1 \n",
"3 1 \n",
"4 2 "
]
},
"execution_count": 30,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"mtcars = pd.read_csv(\"https://raw.githubusercontent.com/focods/WonderfulML/master/data/mtcars.csv\")\n",
"mtcars.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For example you want to replicate this R snippet: "
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>mpg</th>\n",
" <th>cyl</th>\n",
" <th>disp</th>\n",
" <th>hp</th>\n",
" <th>drat</th>\n",
" <th>wt</th>\n",
" <th>qsec</th>\n",
" <th>vs</th>\n",
" <th>am</th>\n",
" <th>gear</th>\n",
" <th>carb</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Mazda RX4</th>\n",
" <td>21.0</td>\n",
" <td>6.0</td>\n",
" <td>160.0</td>\n",
" <td>110.0</td>\n",
" <td>3.90</td>\n",
" <td>2.620</td>\n",
" <td>16.46</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>4.0</td>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Mazda RX4 Wag</th>\n",
" <td>21.0</td>\n",
" <td>6.0</td>\n",
" <td>160.0</td>\n",
" <td>110.0</td>\n",
" <td>3.90</td>\n",
" <td>2.875</td>\n",
" <td>17.02</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>4.0</td>\n",
" <td>4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Datsun 710</th>\n",
" <td>22.8</td>\n",
" <td>4.0</td>\n",
" <td>108.0</td>\n",
" <td>93.0</td>\n",
" <td>3.85</td>\n",
" <td>2.320</td>\n",
" <td>18.61</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>4.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hornet 4 Drive</th>\n",
" <td>21.4</td>\n",
" <td>6.0</td>\n",
" <td>258.0</td>\n",
" <td>110.0</td>\n",
" <td>3.08</td>\n",
" <td>3.215</td>\n",
" <td>19.44</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Hornet Sportabout</th>\n",
" <td>18.7</td>\n",
" <td>8.0</td>\n",
" <td>360.0</td>\n",
" <td>175.0</td>\n",
" <td>3.15</td>\n",
" <td>3.440</td>\n",
" <td>17.02</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Valiant</th>\n",
" <td>18.1</td>\n",
" <td>6.0</td>\n",
" <td>225.0</td>\n",
" <td>105.0</td>\n",
" <td>2.76</td>\n",
" <td>3.460</td>\n",
" <td>20.22</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>1.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" mpg cyl disp hp drat wt qsec vs am \\\n",
"Mazda RX4 21.0 6.0 160.0 110.0 3.90 2.620 16.46 0.0 1.0 \n",
"Mazda RX4 Wag 21.0 6.0 160.0 110.0 3.90 2.875 17.02 0.0 1.0 \n",
"Datsun 710 22.8 4.0 108.0 93.0 3.85 2.320 18.61 1.0 1.0 \n",
"Hornet 4 Drive 21.4 6.0 258.0 110.0 3.08 3.215 19.44 1.0 0.0 \n",
"Hornet Sportabout 18.7 8.0 360.0 175.0 3.15 3.440 17.02 0.0 0.0 \n",
"Valiant 18.1 6.0 225.0 105.0 2.76 3.460 20.22 1.0 0.0 \n",
"\n",
" gear carb \n",
"Mazda RX4 4.0 4.0 \n",
"Mazda RX4 Wag 4.0 4.0 \n",
"Datsun 710 4.0 1.0 \n",
"Hornet 4 Drive 3.0 1.0 \n",
"Hornet Sportabout 3.0 2.0 \n",
"Valiant 3.0 1.0 "
]
},
"execution_count": 44,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"%R head(mtcars)"
]
},
{
"cell_type": "code",
"execution_count": 67,
"metadata": {},
"outputs": [],
"source": [
"%%R -o r_mt\n",
"suppressPackageStartupMessages({\n",
" \n",
"library(dplyr)\n",
"#library(ggplot2)\n",
"#library(tidyr)\n",
"})\n",
"\n",
"r_mt = mtcars %>% mutate(model = rownames(mtcars)) %>%\n",
" select(cyl, model, hp, drat) %>%\n",
" filter(cyl < 8) %>%\n",
" group_by(cyl) %>%\n",
" summarise(\n",
" hp_mean = mean(hp), \n",
" drat_mean = mean(drat),\n",
" drat_std = sd(drat),\n",
" diff = max(drat) - min(drat)\n",
" ) %>% \n",
" arrange(drat_mean) %>%\n",
" as.data.frame #tibble converts weirdly"
]
},
{
"cell_type": "code",
"execution_count": 68,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>cyl</th>\n",
" <th>hp_mean</th>\n",
" <th>drat_mean</th>\n",
" <th>drat_std</th>\n",
" <th>diff</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>6.0</td>\n",
" <td>122.285714</td>\n",
" <td>3.585714</td>\n",
" <td>0.476055</td>\n",
" <td>1.16</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.0</td>\n",
" <td>82.636364</td>\n",
" <td>4.070909</td>\n",
" <td>0.365471</td>\n",
" <td>1.24</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" cyl hp_mean drat_mean drat_std diff\n",
"1 6.0 122.285714 3.585714 0.476055 1.16\n",
"2 4.0 82.636364 4.070909 0.365471 1.24"
]
},
"execution_count": 68,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"r_mt"
]
},
{
"cell_type": "code",
"execution_count": 71,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe thead tr:last-of-type th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th>hp</th>\n",
" <th colspan=\"4\" halign=\"left\">drat</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>mean</th>\n",
" <th>mean</th>\n",
" <th>std</th>\n",
" <th>transform1</th>\n",
" <th>transform2</th>\n",
" </tr>\n",
" <tr>\n",
" <th>cyl</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>122.285714</td>\n",
" <td>3.585714</td>\n",
" <td>0.476055</td>\n",
" <td>1.16</td>\n",
" <td>8.92</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>82.636364</td>\n",
" <td>4.070909</td>\n",
" <td>0.365471</td>\n",
" <td>1.24</td>\n",
" <td>9.93</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" hp drat \n",
" mean mean std transform1 transform2\n",
"cyl \n",
"6 122.285714 3.585714 0.476055 1.16 8.92\n",
"4 82.636364 4.070909 0.365471 1.24 9.93"
]
},
"execution_count": 71,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"def transform1(x):\n",
" return max(x)-min(x)\n",
" \n",
"def transform2(x):\n",
" return max(x)+5\n",
"\n",
"py_mt = (\n",
"mtcars.\n",
" loc[:,[\"cyl\", \"model\", \"hp\", \"drat\"]]. #select\n",
" query(\"cyl < 8\"). #filter\n",
" groupby(\"cyl\"). #group_by\n",
" agg( #summarise, agg is an abbreviation of aggregation\n",
" {\n",
" 'hp':'mean', \n",
" 'drat':['mean', 'std', transform1, transform2] # R wins... this sux for pandas\n",
" }).\n",
" sort_values(by=[(\"drat\", \"mean\")]) #multindex sort (unique to pandas)\n",
")\n",
"py_mt"
]
},
{
"cell_type": "code",
"execution_count": 72,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead tr th {\n",
" text-align: left;\n",
" }\n",
"\n",
" .dataframe thead tr:last-of-type th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr>\n",
" <th></th>\n",
" <th>hp</th>\n",
" </tr>\n",
" <tr>\n",
" <th></th>\n",
" <th>mean</th>\n",
" </tr>\n",
" <tr>\n",
" <th>cyl</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>122.285714</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>82.636364</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" hp\n",
" mean\n",
"cyl \n",
"6 122.285714\n",
"4 82.636364"
]
},
"execution_count": 72,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"new_mtCars.loc[:,[\"hp\"]]"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.4"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment