Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
R to Python: Data wrangling with dplyr and pandas

R to python data wrangling snippets

The dplyr package in R makes data wrangling significantly easier. The beauty of dplyr is that, by design, the options available are limited. Specifically, a set of key verbs form the core of the package. Using these verbs you can solve a wide range of data problems effectively in a shorter timeframe. Whilse transitioning to Python I have greatly missed the ease with which I can think through and solve problems using dplyr in R. The purpose of this document is to demonstrate how to execute the key dplyr verbs when manipulating data using Python (with the pandas package).

dplyr is organised around six key verbs:


  • filter : subset a dataframe according to condition(s) in a variable(s)
  • select : choose a specific variable or set of variables
  • arrange : order dataframe by index or variable
  • group_by : create a grouped dataframe
  • summarise : reduce variable to summary variable (e.g. mean)
  • mutate : transform dataframe by adding new variables

The excellent pandas package in Python easily allows you to implement all of these actions (and much, much more!). Below are some snippets to highlight some of the more basic conversions.

@Conormacd

June 8th 2018 update: All of this code still works in pandas and should ease the transition from R, but for those interested in getting the most out of the package I strongly recommend this series on modern pandas https://tomaugspurger.github.io/modern-1-intro.html

Filter

R

filter(df, var > 20000 & var < 30000) 
filter(df, var == 'string') # df %>% filter(var != 'string')
df %>% filter(var != 'string')
df %>% group_by(group) %>% filter(sum(var) > 2000000)

Python

df[(df['var'] > 20000) & (df['var'] < 30000)]
df[df['var'] == 'string']
df[df['var'] != 'string']
df.groupby('group').filter(lambda x: sum(x['var']) > 2000000)

Select

R

select(df, var1, var2)
select(df, -var3)

Python

df[['var1', 'var2']]
df.drop('var3', 1)

Arrange

R

arrange(df, var1)
arrange(df, desc(var1))

Python

df.sort_values('var1')
df.sort_values('var1', ascending=False)

Grouping

R

df %>% group_by(group) 
df %>% group_by(group1, group2)
df %>% ungroup()

Python

df.groupby('group1')
df.groupby(['group1', 'group2'])
df.reset_index() / or when grouping: df.groupby('group1', as_index=False)

Summarise / Aggregate df by group

R

df %>% group_by(group) %>% summarise(mean_var1 = mean(var1))
df %>% 
  group_by(group1, group2) %>% 
  summarise(mean_var1 = mean(var1), 
            sum_var1 = sum(var1), 
            count_var1 = n())
                                              
df %>% 
  group_by(group1, group2) %>% 
  summarise(mean_var1 = mean(var1),
            sum_2 = sum(var2),
            var3 = first(var3))

Python

df.groupby('group1')['var1'].agg({'mean_col' : np.mean()}) # pass dict to specifiy column name

df.groupby(['group1', 'group2'])['var1].agg(['mean', 'sum', 'count']) # for count also consider 'size'. size will return n for NaN values also, whereas 'count' will not.

# first perform the aggregation
group_agg = df.groupby(["group1", "group2"]).agg({
  "var1" : ["mean"], 
  "var2" : ["sum"], 
  "var3" : ["first"]
  })
# second rename the columns by joining the column name with the agg function (e.g. "var1_mean")
group_agg.columns = ["_".join(x) for x in group_agg.columns.ravel()]

# You can also pass multiple functions to aggregate the same column e.g:
group_agg = df.groupby(["group1", "group2"]).agg({"var1" : ["mean", "std", "sum"]})

Mutate / transform df by group

R

df %>% group_by(group) %>% mutate(mean_var1 = mean(var1))

Python

df.groupby('group').assign(mean_var1 = lambda x: np.mean(x.var1)

Distinct

R

df %>% distinct()
df %>% distinct(col1) # returns dataframe with unique values of col1

Python

df.drop_duplicates()
df.drop_duplicates(subset='col1') # returns dataframe with unique values of col1

Sample

R

sample_n(df, 100)
sample_frac(df, 0.5)

Python

df.sample(100)
df.sample(frac=0.5)       
@cecilialee

This comment has been minimized.

Copy link

@cecilialee cecilialee commented Nov 19, 2017

Very helpful! Thanks!

@Success2014

This comment has been minimized.

Copy link

@Success2014 Success2014 commented Nov 20, 2017

https://github.com/dodger487/dplython
Seem to have a Python version of dplyr now.

@henriquepgomide

This comment has been minimized.

Copy link

@henriquepgomide henriquepgomide commented Nov 24, 2017

Great tutorial. Many thanks!

@ravi9884

This comment has been minimized.

Copy link

@ravi9884 ravi9884 commented Jan 27, 2018

@Success2014 that was spot on

@JacksonChou

This comment has been minimized.

Copy link

@JacksonChou JacksonChou commented Feb 25, 2018

@Success2014 Thanks for sharing.

@pez-espada

This comment has been minimized.

Copy link

@pez-espada pez-espada commented Mar 5, 2018

As an inhabitant of both Python-country a R-land I find this very useful!

@robinsones

This comment has been minimized.

Copy link

@robinsones robinsones commented Jun 7, 2018

This is great! Small error: you're missing the var < 3000 on the python side when you compare filter(df, var > 20000 & var < 30000) to df[df['var'] > 20000].

@conormm

This comment has been minimized.

Copy link
Owner Author

@conormm conormm commented Jun 8, 2018

Thanks! - updated.

All of this code still works in pandas and should ease the transition from R, but for those interested in getting the most out of the package I strongly recommend this series on modern pandas https://tomaugspurger.github.io/modern-1-intro.html

@vincentei

This comment has been minimized.

Copy link

@vincentei vincentei commented Jun 11, 2018

Nice work!

line 77:

df.groupby(['group1', 'group2'])['var1]'].agg(['mean', 'sum', 'count'])

should it be?

df.groupby(['group1', 'group2'])['var1'].agg(['mean', 'sum', 'count'])

@adrienpacifico

This comment has been minimized.

Copy link

@adrienpacifico adrienpacifico commented Jul 10, 2018

The spirit of filter(df, var > 20000 & var < 30000) in R would be closer to df.query("(var>2000) & (var < 3000)") in pandas than to
df[(df['var'] > 20000) & (df['var'] < 30000)] (see pandas comparison with R doc ).

@conormm big thank you for that gist !

@conormm

This comment has been minimized.

Copy link
Owner Author

@conormm conormm commented Jul 13, 2018

hey @adrienpacifico Yes, agree on .query() - it wasn't a function when I put this together. I'll update this soon.

@adrienpacifico

This comment has been minimized.

Copy link

@adrienpacifico adrienpacifico commented Jul 17, 2018

Great, btw it would be nice to have performance comparison between the two solutions (I'm a pandas user, I don't know a lot about R an dplyer ;-) ).

@rfche704

This comment has been minimized.

Copy link

@rfche704 rfche704 commented Aug 11, 2018

Thanks so much - really helpful post!!!
With this and GGplot for python - a huge amount of the R functionality can be used

@astrowonk

This comment has been minimized.

Copy link

@astrowonk astrowonk commented Aug 14, 2018

Super useful. Thank you!

@plnnr

This comment has been minimized.

Copy link

@plnnr plnnr commented Sep 3, 2018

Thank you so much! I'm learning Python after having just learned R, and this is really helpful to quickly get me off the ground and keep me feeling motivated! Also wow @rfche704, I had no idea ggplot was available in Python :)

@pybokeh

This comment has been minimized.

Copy link

@pybokeh pybokeh commented Sep 25, 2018

Could always use plotnine https://github.com/has2k1/plotnine Also by same repo, there is dplyr clone: https://github.com/has2k1/plydata

@Dhirendramohanjha

This comment has been minimized.

Copy link

@Dhirendramohanjha Dhirendramohanjha commented Sep 27, 2018

Thank You so much! This is very useful us

@sanjaynagi

This comment has been minimized.

Copy link

@sanjaynagi sanjaynagi commented Dec 12, 2018

As an R user that is Learning Python, this is so useful. Thanks.

@JerryWho

This comment has been minimized.

Copy link

@JerryWho JerryWho commented Jun 1, 2019

Thank you so much!
How do I summarize several values, i.e.

df %>% group_by(group1, group2) %>% summarise(mean_var1 = mean(var1),
sum_2 = sum(var2),
var3 = first(var3))

@conormm

This comment has been minimized.

Copy link
Owner Author

@conormm conormm commented Jun 5, 2019

In pandas:

# first perform the aggregation
group_agg = df.groupby(["group1", "group2"]).agg({"var1" : ["mean"], "var2" : ["sum"], "var3" : ["first"]})
# second rename the columns by joining the column name with the agg function (e.g. "var1_mean")
group_agg.columns = ["_".join(x) for x in group_agg.columns.ravel()]

# You can also pass multiple functions to aggregate the same column e.g:
group_agg = df.groupby(["group1", "group2"]).agg({"var1" : ["mean", "std", "sum"]})
@epetrovski

This comment has been minimized.

Copy link

@epetrovski epetrovski commented Nov 26, 2019

df.groupby('group').assign(mean_var1 = lambda x: np.mean(x.var1)

Unfortunately, I don't think this will work since grouped data frames do not have an .assign() method. The same thing could be done with .apply() however. And if you want to reassign the resulting column to the original data frame - like dplyr does - you wold also need to do .reset_index(drop = True).

@ffgama

This comment has been minimized.

Copy link

@ffgama ffgama commented Dec 19, 2019

Congrats

@manuel-zambelli-sdg

This comment has been minimized.

Copy link

@manuel-zambelli-sdg manuel-zambelli-sdg commented Mar 26, 2020

I love you

@sandropenha

This comment has been minimized.

Copy link

@sandropenha sandropenha commented Jun 24, 2020

Thank you!
I'm a R user and this is helping me a lot in python.

@paladinic

This comment has been minimized.

Copy link

@paladinic paladinic commented Jun 28, 2020

thumbs up!!

@stephzhang3

This comment has been minimized.

Copy link

@stephzhang3 stephzhang3 commented Jul 8, 2020

Thank you so much! @epetrovski do you think you could give an example of how to group and mutate using .groupby() and .apply()? I'm also running into the issue where .groupby() and .assign() don't work together.
I ended up doing something along these lines:

df['mean_var1']=df.groupby('group').apply(lambda x: np.mean(x.var1)).reset_index(level=0, drop=True)

But am not sure if there is an easier way, and if this is what you had in mind. Thanks!

@epetrovski

This comment has been minimized.

Copy link

@epetrovski epetrovski commented Jul 9, 2020

Thank you so much! @epetrovski do you think you could give an example of how to group and mutate using .groupby() and .apply()? I'm also running into the issue where .groupby() and .assign() don't work together.
I ended up doing something along these lines:

df['mean_var1']=df.groupby('group').apply(lambda x: np.mean(x.var1)).reset_index(level=0, drop=True)

But am not sure if there is an easier way, and if this is what you had in mind. Thanks!

That would be the way to do it with .apply(). The problem is, though, that it's a bit unsafe since using .reset_index() means that you're assigning back to the data frame without keeping track of the index. This will be a problem if you return fewer rows than you input for some reason.

A safer and faster way that I've found is to use the .pipe() method like this:

df['mean_var1'] = df.groupby('group').pipe(lambda x: x.var1.transform('mean'))
@samukweku

This comment has been minimized.

Copy link

@samukweku samukweku commented Aug 1, 2020

In pandas:

# first perform the aggregation
group_agg = df.groupby(["group1", "group2"]).agg({"var1" : ["mean"], "var2" : ["sum"], "var3" : ["first"]})
# second rename the columns by joining the column name with the agg function (e.g. "var1_mean")
group_agg.columns = ["_".join(x) for x in group_agg.columns.ravel()]

# You can also pass multiple functions to aggregate the same column e.g:
group_agg = df.groupby(["group1", "group2"]).agg({"var1" : ["mean", "std", "sum"]})

In Pandas 1.0, you can use named aggregation, reducing the number of steps for new column names:

df.groupby(["group1", "group2"]).agg(var1_mean=("var1","mean"), var2_sum=("var2","sum),var3_first=("var3","first"))

@samukweku

This comment has been minimized.

Copy link

@samukweku samukweku commented Aug 1, 2020

Thank you so much! @epetrovski do you think you could give an example of how to group and mutate using .groupby() and .apply()? I'm also running into the issue where .groupby() and .assign() don't work together.
I ended up doing something along these lines:
df['mean_var1']=df.groupby('group').apply(lambda x: np.mean(x.var1)).reset_index(level=0, drop=True)
But am not sure if there is an easier way, and if this is what you had in mind. Thanks!

That would be the way to do it with .apply(). The problem is, though, that it's a bit unsafe since using .reset_index() means that you're assigning back to the data frame without keeping track of the index. This will be a problem if you return fewer rows than you input for some reason.

A safer and faster way that I've found is to use the .pipe() method like this:

df['mean_var1'] = df.groupby('group').pipe(lambda x: x.var1.transform('mean'))

You dont need pipe for this, simply transform the column and assign to the new column name :

   df['mean_var1'] = df.groupby('group').var1.transform('mean')
@epetrovski

This comment has been minimized.

Copy link

@epetrovski epetrovski commented Aug 1, 2020

Thank you so much! @epetrovski do you think you could give an example of how to group and mutate using .groupby() and .apply()? I'm also running into the issue where .groupby() and .assign() don't work together.
I ended up doing something along these lines:
df['mean_var1']=df.groupby('group').apply(lambda x: np.mean(x.var1)).reset_index(level=0, drop=True)
But am not sure if there is an easier way, and if this is what you had in mind. Thanks!

That would be the way to do it with .apply(). The problem is, though, that it's a bit unsafe since using .reset_index() means that you're assigning back to the data frame without keeping track of the index. This will be a problem if you return fewer rows than you input for some reason.
A safer and faster way that I've found is to use the .pipe() method like this:

df['mean_var1'] = df.groupby('group').pipe(lambda x: x.var1.transform('mean'))

You dont need pipe for this, simply transform the column and assign to the new column name :

   df['mean_var1'] = df.groupby('group').var1.transform('mean')

Fair point. But using .pipe() will allow you to transform several columns at once making it analogous to mutate() which the post is about.

@samukweku

This comment has been minimized.

Copy link

@samukweku samukweku commented Aug 1, 2020

Thank you so much! @epetrovski do you think you could give an example of how to group and mutate using .groupby() and .apply()? I'm also running into the issue where .groupby() and .assign() don't work together.
I ended up doing something along these lines:
df['mean_var1']=df.groupby('group').apply(lambda x: np.mean(x.var1)).reset_index(level=0, drop=True)
But am not sure if there is an easier way, and if this is what you had in mind. Thanks!

That would be the way to do it with .apply(). The problem is, though, that it's a bit unsafe since using .reset_index() means that you're assigning back to the data frame without keeping track of the index. This will be a problem if you return fewer rows than you input for some reason.
A safer and faster way that I've found is to use the .pipe() method like this:

df['mean_var1'] = df.groupby('group').pipe(lambda x: x.var1.transform('mean'))

You dont need pipe for this, simply transform the column and assign to the new column name :

   df['mean_var1'] = df.groupby('group').var1.transform('mean')

Fair point. But using .pipe() will allow you to transform several columns at once making it analogous to mutate() which the post is about.

I respectfully disagree. I'll show some examples below on what I believe should be the way in Pandas to do it; please correct me where the code is wrong or suboptimal :

print(df )
      tag     val1    val2
0	B	0	0
1	A	1	2
2	A	2	4
3	B	3	6
4	B	4	8

grp = df.groupby("tag")

Let's create new columns and assign to the dataframe :

df.assign(val1_mean=grp["val1"].transform("mean"), 
          val2_sum=grp["val2"].transform("sum")
          )
	tag	val1	val2	val1_mean	val2_sum
0	B	0	0	2.333333	14
1	A	1	2	1.500000	6
2	A	2	4	1.500000	6
3	B	3	6	2.333333	14
4	B	4	8	2.333333	14

If you are applying the same function to the columns, you can use the unpacking method :
df[["val1_mean", "val2_mean"]] = grp[["val1", "val2"]].transform("mean")

print(df)

tag	val1	val2	val1_mean	val2_mean
0	B	0	0	2.333333	4.666667
1	A	1	2	1.500000	3.000000
2	A	2	4	1.500000	3.000000
3	B	3	6	2.333333	4.666667
4	B	4	8	2.333333	4.666667

I do not see the need for pipe for these scenarios. Pipe, in my opinion, comes into play when you need to reuse the groupby object. like
grp.pipe(lambda x: x.sum() - x.mean()).

The groupby object is reused in the sum and mean before the difference is applied. I dont see the effectiveness of calling transform inside pipe when transform can do the job easily. Of course, you may have a use case where pipe does the job better than transform - I would love to learn about that. Cheers.

@epetrovski

This comment has been minimized.

Copy link

@epetrovski epetrovski commented Aug 2, 2020

@samukweku, you say that "Pipe, in my opinion, comes into play when you need to reuse the groupby object."

So in my use case I need to do something like this, where I take the sum of one variable and add to it the minimum of another but using the same grouping:

df['res'] = df.groupby('id').pipe(lambda x: x.var1.transform('sum') + x.var2.transform('min'))

Anyways, this is going into too much depth here. I think it's fair to say that there are several ways of accomplishing a group_by() %>% mutate() in pandas but df.groupby('group').assign(mean_var1 = lambda x: np.mean(x.var1) which the post suggests isn't one of them and that's the most important point.

@samukweku

This comment has been minimized.

Copy link

@samukweku samukweku commented Aug 2, 2020

Ahh! @epetrovski, totally correct there. Your example makes sense. Cheers.

@minhsphuc12

This comment has been minimized.

Copy link

@minhsphuc12 minhsphuc12 commented Nov 20, 2020

Hi
df.groupby('group').assign(mean_var1 = lambda x: np.mean(x.var1)
would result in
AttributeError: 'DataFrameGroupBy' object has no attribute 'assign'
Can you update your code? @conormm

@pwwang

This comment has been minimized.

Copy link

@pwwang pwwang commented Apr 30, 2021

datar ports dplyr and other packages to python for you.

It follows the original R packages' API design.

@minhsphuc12
With datar, you can do:

from datar import f
from datar.dplyr import group_by, mutate
from datar.base import mean

df >> group_by(f.group) >> mutate(mean_var1 = mean(f.var1))
# compared to the R syntax:
# df %>% group_by(group) %>% mutate(mean_var1 = mean(var1))
@robertog09

This comment has been minimized.

Copy link

@robertog09 robertog09 commented Jun 27, 2021

For "Mutate / transform df by group" I do the following:

# df example
df = pd.DataFrame(data = {'group': ['A', 'A', 'B', 'B'], 'var1': [1, 4, 2, 2]})

# Mutate / transform df by group
df.assign(mean_var1 = lambda x: x.groupby('group').transform('mean')['var1'])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment