Instantly share code, notes, and snippets.

Embed
What would you like to do?
R to Python: Data wrangling with dplyr and pandas
R to python useful data wrangling snippets
The dplyr package in R makes data wrangling significantly easier.
The beauty of dplyr is that, by design, the options available are limited.
Specifically, a set of key verbs form the core of the package.
Using these verbs you can solve a wide range of data problems effectively in a shorter timeframe.
Whilse transitioning to Python I have greatly missed the ease with which I can think through and solve problems using dplyr in R.
The purpose of this document is to demonstrate how to execute the key dplyr verbs when manipulating data using Python (with the pandas package).
dplyr is organised around six key verbs
filter: subset a dataframe according to condition(s) in a variable(s)
select: choose a specific variable or set of variables
arrange: order dataframe by index or variable
group_by: create a grouped dataframe
summarise: reduce variable to summary variable (e.g. mean)
mutate: transform dataframe by adding new variables
The excellent pandas package in Python easily allows you to implement all of these actions (and much, much more!). Below are some snippets to highlight some of the more basic conversions.
I'll update this on a regular basis with more complex snippets.
Thanks!
Conor @Conormacd
-------Filter-------------------
R
filter(df, var > 20000 & var < 30000)
filter(df, var == 'string') # df %>% filter(var != 'string')
df %>% filter(var != 'string')
df %>% group_by(group) %>% filter(sum(var) > 2000000)
Python
df[(df['var'] > 20000) & (df['var'] < 30000)]
df[df['var'] == 'string']
df[df['var'] != 'string']
df.groupby('group').filter(lambda x: sum(x['var']) > 2000000)
-------Select-------------------
R
select(df, var1, var2)
select(df, -var3)
Python
df[['var1', 'var2']]
df.drop('var3', 1)
-------Arrange-------------------
R
arrange(df, var1)
arrange(df, desc(var1))
Python
df.sort_values('var1')
df.sort_values('var1', ascending=False)
-------Grouping------------------
R
df %>% group_by(group)
df %>% group_by(group1, group2)
df %>% ungroup()
Python
df.groupby('group1')
df.groupby(['group1', 'group2'])
df.reset_index() / or when grouping: df.groupby('group1', as_index=False)
------Summarise / Aggregate df by group-----------------------------
R
df %>% group_by(group) %>% summarise(mean_var1 = mean(var1))
df %>% group_by(group1, group2) %>% summarise(mean_var1 = mean(var1),
sum_var1 = sum(var1),
count_var1 = n())
Python
df.groupby('group1')['var1'].agg({'mean_col' : np.mean()}) # pass dict to specifiy column name
df.groupby(['group1', 'group2'])['var1]'].agg(['mean', 'sum', 'count']) # for count also consider 'size'. size will return n for NaN values also, whereas 'count' will not.
-------Mutate / transform df by group---------------------------------
R
df %>% group_by(group) %>% mutate(mean_var1 = mean(var1))
Python
df.groupby('group')['var1'].transform(np.mean)
-------Distinct---------------------------------------------------------
# remove duplicate obs from data frame
R
df %>% distinct()
df %>% distinct(col1) # returns dataframe with unique values of col1
Python
df.drop_duplicates()
df.drop_duplicates(subset='col1') # returns dataframe with unique values of col1
------Sample----------------------------------------------------------
# generate random samples of the data by n or by %
R
sample_n(df, 100)
sample_frac(df, 0.5)
Python
df.sample(100)
df.sample(frac=0.5)
@cecilialee

This comment has been minimized.

cecilialee commented Nov 19, 2017

Very helpful! Thanks!

@Success2014

This comment has been minimized.

Success2014 commented Nov 20, 2017

https://github.com/dodger487/dplython
Seem to have a Python version of dplyr now.

@henriquepgomide

This comment has been minimized.

henriquepgomide commented Nov 24, 2017

Great tutorial. Many thanks!

@ravi9884

This comment has been minimized.

ravi9884 commented Jan 27, 2018

@Success2014 that was spot on

@JacksonChou

This comment has been minimized.

JacksonChou commented Feb 25, 2018

@Success2014 Thanks for sharing.

@pez-espada

This comment has been minimized.

pez-espada commented Mar 5, 2018

As an inhabitant of both Python-country a R-land I find this very useful!

@robinsones

This comment has been minimized.

robinsones commented Jun 7, 2018

This is great! Small error: you're missing the var < 3000 on the python side when you compare filter(df, var > 20000 & var < 30000) to df[df['var'] > 20000].

@conormm

This comment has been minimized.

Owner

conormm commented Jun 8, 2018

Thanks! - updated.

All of this code still works in pandas and should ease the transition from R, but for those interested in getting the most out of the package I strongly recommend this series on modern pandas https://tomaugspurger.github.io/modern-1-intro.html

@vincentei

This comment has been minimized.

vincentei commented Jun 11, 2018

Nice work!

line 77:

df.groupby(['group1', 'group2'])['var1]'].agg(['mean', 'sum', 'count'])

should it be?

df.groupby(['group1', 'group2'])['var1'].agg(['mean', 'sum', 'count'])

@adrienpacifico

This comment has been minimized.

adrienpacifico commented Jul 10, 2018

The spirit of filter(df, var > 20000 & var < 30000) in R would be closer to df.query("(var>2000) & (var < 3000)") in pandas than to
df[(df['var'] > 20000) & (df['var'] < 30000)] (see pandas comparison with R doc ).

@conormm big thank you for that gist !

@conormm

This comment has been minimized.

Owner

conormm commented Jul 13, 2018

hey @adrienpacifico Yes, agree on .query() - it wasn't a function when I put this together. I'll update this soon.

@adrienpacifico

This comment has been minimized.

adrienpacifico commented Jul 17, 2018

Great, btw it would be nice to have performance comparison between the two solutions (I'm a pandas user, I don't know a lot about R an dplyer ;-) ).

@rfche704

This comment has been minimized.

rfche704 commented Aug 11, 2018

Thanks so much - really helpful post!!!
With this and GGplot for python - a huge amount of the R functionality can be used

@astrowonk

This comment has been minimized.

astrowonk commented Aug 14, 2018

Super useful. Thank you!

@plnnr

This comment has been minimized.

plnnr commented Sep 3, 2018

Thank you so much! I'm learning Python after having just learned R, and this is really helpful to quickly get me off the ground and keep me feeling motivated! Also wow @rfche704, I had no idea ggplot was available in Python :)

@pybokeh

This comment has been minimized.

pybokeh commented Sep 25, 2018

Could always use plotnine https://github.com/has2k1/plotnine Also by same repo, there is dplyr clone: https://github.com/has2k1/plydata

@Dhirendramohanjha

This comment has been minimized.

Dhirendramohanjha commented Sep 27, 2018

Thank You so much! This is very useful us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment