As an R user venturing into Python-land, I often find myself googling 'Python equivalent of X' or 'X in Python' for useful R functions. So when I had to create a programming cheat sheet as part of a course in Data Science Practice, it was a great opportunity to start collecting these functions in one place.
Comments are welcome! What doesn't work? Is there a better way? What is missing?
On 11 June 2014, Hilary Mason sent a question out into the Twittersphere:
Data people: what is the very first thing you do when you get your hands on a new dataset?
Answers came in thick and fast. For example, some tweeters emphasised the importance of figuring out how the dataset was populated in the first place. This is super critical. Understanding the context in which the dataset was created will help to determine how reliable and representative it is. This information might also help to explain anomalies you find later on. Other tweeters raised the necessity of backing up a copy of the dataset. (Definitely back up the raw dataset if you haven't already!)
The majority of responses, however, referenced the need to inspect and visualise the data at a very early stage. This makes intuitive sense -- regardless of the task at hand, you need to know what you're working with. So this section is all about inspecting and visualising data. It's presented as the first section in this guide because it's an ideal (arguably compulsory) first step to take with a new dataset. It's also one that you'll likely find yourself returning to over and over again as part of any data science or data analysis workflow.
R
dim(df)
Python
# import pandas as pd
df.shape
R
colnames(df)
Python
# import pandas as pd
df.columns
R
head(df, 10)
tail(df, 10)
Python
# import pandas as pd
df.head(10)
df.tail(10)
R
# library(dplyr)
sample_n(df, 10)
Python
# import pandas as pd
df.sample(10)
R
# library(dplyr)
glimpse(df)
Python
# import pandas as pd
df.info()
R
summary(df$col)
Python
# import pandas as pd
df['col'].describe()
R
# library(ggpolot2)
ggplot(df) +
geom_histogram(aes(x = col))
Python
# import pandas as pd
# import matplotlib.pyplot as plt
df['col'].plot(kind = 'hist')
plt.show()
R
# library(ggplot2)
ggplot(df) +
geom_point(aes(x = col1, y = col2), alpha = 0.6)
Python
# import pandas as pd
# import matplotlib.pyplot as plt
df.plot(kind = 'scatter', x = 'col1', y = 'col2', alpha = 0.6)
plt.show()
R
summary(df$col)
Python
# import pandas as pd
print(df['col'].value_counts(dropna = False))
R
# library(ggplot2)
ggplot(df) +
geom_bar(aes(x = fct_infreq(col))
Python
# import pandas as pd
# import matplotlib.pyplot as plt
df['col'].value_counts(dropna = False).plot(kind = 'bar')
plt.show()
R
# library(dplyr)
count(df, col1, col2)
Python
# import pandas as pd
df.groupby(['col1', 'col2']).size()
R
# library(ggplot2)
ggplot(df) +
geom_boxplot(aes(x = cat, y = con))
Python
# import pandas as pd
# import matplotlib.pyplot as plt
df.boxplot(column = 'con', by = 'cat')
plt.show()
This section is all about wholesale changes you may need to apply to your dataset. In doing so, remember Hadley Wickham's principles of tidy data:
- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
R
# library(dplyr)
df <- rename(df, newname = oldname)
Python
# import pandas as pd
df = df.rename({'oldname': 'newname'}, axis = 1)
R
df$col <- as.numeric(df$col)
Python
# import pandas as pd
df['col'] = pd.to_numeric(df['col'])
R
df$col <- as.factor(df$col)
Python
# import pandas as pd
df['col'] = df['col'].astype('category')
R
df$col <- factor(df$col,
levels = c('level1',
'level2'),
ordered = TRUE)
Python
# import pandas as pd
df['col'] = df['col'].astype('category')
df['col'].cat.reorder_categories(['level1', 'level2', ordered = True])
R
df$col <- as.character(df$col)
Python
# import pandas as pd
df['col'] = df['col'].astype('str')
R
# library(lubridate)
df$col <- ymd(df$col)
Specify order of seconds (s), minutes (m), hours (h), days (d), months (m) and years (y); separate time and date with an underscore.
Python
# import pandas as pd
df['col'] = pd.to_datetime(df['col'])
This function parses a range of formats; otherwise you can explicitly specify the format with the format argument.
R
# library(dplyr)
df <- select(df, -col)
Python
# import pandas as pd
df = df.drop('col', 1)
R
# library(dplyr)
df <- filter(df, condition)
Specify the condition by referencing the column name and using Boolean operators or functions such as is.na(). For example: col == 0 or !is.na(col). You can also wrap col in a function such as length(). Use & (and) and | (or) to apply multiple conditions.
Python
# import pandas as pd
df = df.query('condition')
Specify the condition by referencing the column name and using Boolean operators or functions such as isna(). For example: 'col == 0' or 'notna(col)'. You can also wrap col in a function such as len(). Use & (and) and | (or) to apply multiple conditions.
R
df <- na.omit(df)
Python
# import pandas as pd
df = df.dropna()
R
# library(dplyr)
df <- distinct(df)
Python
# import pandas as pd
df = df.drop_duplicates()
R
# library(tidyr)
df <- gather(df, key = 'key', value = 'value', col2:col3)
Version 1.0.0 of the tidyr package has supplanted gather() with pivot_longer():
# library(tidyr)
df <- pivot_longer(df, col2:col3, names_to = 'key', values_to = 'value')
Python
# import pandas as pd
df = pd.melt(df, value_vars = ['col2', 'col3'], var_name = 'key', value_name = 'value')
R
# library(tidyr)
df <- spread(df, key, value)
Version 1.0.0 of the tidyr package has supplanted spread() with pivot_wider():
# library(tidyr)
df <- pivot_wider(df, names_from = key, values_from = value)
Python
# import pandas as pd
df = df.pivot_table(index = ['col1'], columns = 'key', values = 'value')
df = df.reset_index()
R
# library(dplyr)
df <- arrange(df, col)
For descending, specify desc(col) in place of col.
Python
# import pandas as pd
df = df.sort_values(by = 'col')
For descending, specify an additional argument ascending = False.
R
# library(dplyr)
df <- bind_rows(df, df2)
Python
# import pandas as pd
df = df.append(df2, sort = False)
R
# library(dplyr)
df <- left_join(df, df2, by = 'col')
Consider also inner_join().
Python
# import pandas as pd
df = df.merge(df2, on = 'col', how = 'left')
Consider also how = 'inner'.
The last section blends granular data cleaning with feature engineering - covering off on some common string manipulation tasks and functions commonly used to derive new values (including when working with dates and times).
R
df$col <- paste0(df$col, 'separator', df$col2)
Python
df['col'] = df['col1'] + 'separator' + df['col2']
R
# library(tidyr)
df <- separate(df, col, c('col1', 'col2'), 'separator')
Python
# import pandas as pd
df[['col1', 'col2']] = df['col'].str.split('separator', expand = True)
R
# library(stringr)
df$col <- str_squish(df$col)
Python
# import pandas as pd
df['col'] = df['col'].str.strip()
R
# library(stringr)
df$col <- str_to_upper(df$col)
df$col <- str_to_lower(df$col)
df$col <- str_to_title(df$col)
Python
# import pandas as pd
df['col'] = df['col'].str.upper()
df['col'] = df['col'].str.lower()
df['col'] = df['col'].str.title()
R
# library(stringr)
df$col <- str_replace_all(df$col, 'a', 'b')
Python
# import pandas as pd
df['col'] = df['col'].str.replace('a', 'b')
R
df$col[df$col == ''] <- NA
Python
# import numpy as np
df['col'] = df['col'].replace('', np.nan)
R
df$col[is.na(df$col)] <- replacement
Python
# import pandas as pd
df['col'] = df['col'].fillna(replacement)
R
# library(dplyr)
df <- df %>%
mutate(col = if_else(condition, replacement, col))
Consider case_when() for multiple condition/replacement pairs.
Python
# import pandas as pd
df.loc[condition, 'col'] = replacement
R
# library(lubridate)
df$month <- month(df$date, label = TRUE)
Alternatives include year(), day(), wday(), hour() and minute().
Python
# import datetime
df['month'] = df['date'].dt.month
Alternatives include .dt.year, .dt.day and .dt.weekday.
R
# library(lubridate)
df$month <- floor_date(df$date, unit = 'month')
Consider also round_date() and ceiling_date().
Python
# import datetime
df['month'] = df['date'].dt.to_period('M').dt.to_timestamp()
Consider .dt.floor(), .dt.round() or .dt.ceil() for more granular time units.
R
df$col <- difftime(df$time1, df$time2, unit = 'hours')
Alternative units include 'mins', 'days' and 'weeks'.
Python
# import datetime
df['col'] = df['time2'] - df['time1']
R
# library(dplyr)
df <- df %>%
group_by(cat) %>%
summarise(mean = mean(con, na.rm = TRUE))
This calculates the mean value for each group specified by the categorical variable. Alternatives include min(), median(), sd() and max().
Python
# import pandas as pd
df = df.groupby('cat', as_index = False)['col'].mean()
This calculates the mean value for each group specified by the categorical variable. Alternatives include min(), median(), std() and max().
In no particular order:
- DataCamp, especially Cleaning Data in R, Cleaning Data in Python, Working with Dates and Times in R and Working with Dates and Times in Python
- RStudio cheatsheets, especially Data transformation with dplyr, String manipulation with stringr and Dates and times with lubridate
- pandas documentation, especially this page
- conormm's R to Python data wrangling basics
- Stack Overflow... too many pages to mention