saljfoskett/data-cleaning-cheat-sheet.md

## data-cleaning-cheat-sheet.md

      
    Raw
  

              data-cleaning-cheat-sheet.md
            
          
    Data cleaning - equivalent code in R and Python

As an R user venturing into Python-land, I often find myself googling 'Python equivalent of X' or 'X in Python' for useful R functions. So when I had to create a programming cheat sheet as part of a course in Data Science Practice, it was a great opportunity to start collecting these functions in one place.
Comments are welcome! What doesn't work? Is there a better way? What is missing?
Inspect and visualise data

On 11 June 2014, Hilary Mason sent a question out into the Twittersphere:

Data people: what is the very first thing you do when you get your hands on a new dataset?

Answers came in thick and fast. For example, some tweeters emphasised the importance of figuring out how the dataset was populated in the first place. This is super critical. Understanding the context in which the dataset was created will help to determine how reliable and representative it is. This information might also help to explain anomalies you find later on. Other tweeters raised the necessity of backing up a copy of the dataset. (Definitely back up the raw dataset if you haven't already!)
The majority of responses, however, referenced the need to inspect and visualise the data at a very early stage. This makes intuitive sense -- regardless of the task at hand, you need to know what you're working with. So this section is all about inspecting and visualising data. It's presented as the first section in this guide because it's an ideal (arguably compulsory) first step to take with a new dataset. It's also one that you'll likely find yourself returning to over and over again as part of any data science or data analysis workflow.
Overview of data

Dataset dimensions

R
dim(df)

Python
# import pandas as pd
df.shape

Column names

R
colnames(df)

Python
# import pandas as pd
df.columns

Top and bottom 10 rows

R
head(df, 10)
tail(df, 10)

Python
# import pandas as pd
df.head(10)
df.tail(10)

Random 10 rows

R
# library(dplyr)
sample_n(df, 10)

Python
# import pandas as pd
df.sample(10)

Column types

R
# library(dplyr)
glimpse(df)

Python
# import pandas as pd
df.info()

Single continuous variable

Summary statistics

R
summary(df$col)

Python
# import pandas as pd
df['col'].describe()

Histogram

R
# library(ggpolot2)
ggplot(df) +
  geom_histogram(aes(x = col))

Python
# import pandas as pd
# import matplotlib.pyplot as plt
df['col'].plot(kind = 'hist')
plt.show()

Two continuous variables - scatterplot

R
# library(ggplot2)
ggplot(df) +
  geom_point(aes(x = col1, y = col2), alpha = 0.6)

Python
# import pandas as pd
# import matplotlib.pyplot as plt
df.plot(kind = 'scatter', x = 'col1', y = 'col2', alpha = 0.6)
plt.show()

Single categorical variable

Count of values per category

R
summary(df$col)

Python
# import pandas as pd
print(df['col'].value_counts(dropna = False))

Bar chart

R
# library(ggplot2)
ggplot(df) +
  geom_bar(aes(x = fct_infreq(col))

Python
# import pandas as pd
# import matplotlib.pyplot as plt
df['col'].value_counts(dropna = False).plot(kind = 'bar')
plt.show()

Two categorical variables - count of values per category combination

R
# library(dplyr)
count(df, col1, col2)

Python
# import pandas as pd
df.groupby(['col1', 'col2']).size()

Continuous variable by categorical variable - boxplot

R
# library(ggplot2)
ggplot(df) +
  geom_boxplot(aes(x = cat, y = con))

Python
# import pandas as pd
# import matplotlib.pyplot as plt
df.boxplot(column = 'con', by = 'cat')
plt.show()

Change data frame structure

This section is all about wholesale changes you may need to apply to your dataset. In doing so, remember Hadley Wickham's principles of tidy data:

Each variable forms a column.
Each observation forms a row.
Each type of observational unit forms a table.

Change column names

R
# library(dplyr)
df <- rename(df, newname = oldname)

Python
# import pandas as pd
df = df.rename({'oldname': 'newname'}, axis = 1)

Change column types

To continuous variable

R
df$col <- as.numeric(df$col)

Python
# import pandas as pd
df['col'] = pd.to_numeric(df['col'])

To unordered categorical variable

R
df$col <- as.factor(df$col)

Python
# import pandas as pd
df['col'] = df['col'].astype('category')

To ordered categorical variable

R
df$col <- factor(df$col,
                 levels = c('level1',
                            'level2'),
                 ordered = TRUE)

Python
# import pandas as pd
df['col'] = df['col'].astype('category')
df['col'].cat.reorder_categories(['level1', 'level2', ordered = True])

To string values

R
df$col <- as.character(df$col)

Python
# import pandas as pd
df['col'] = df['col'].astype('str')

To datetime

R
# library(lubridate)
df$col <- ymd(df$col)

Specify order of seconds (s), minutes (m), hours (h), days (d), months (m) and years (y); separate time and date with an underscore.
Python
# import pandas as pd
df['col'] = pd.to_datetime(df['col'])

This function parses a range of formats; otherwise you can explicitly specify the format with the format argument.
Remove data

Delete column

R
# library(dplyr)
df <- select(df, -col)

Python
# import pandas as pd
df = df.drop('col', 1)

Delete rows where condition is not met

R
# library(dplyr)
df <- filter(df, condition)

Specify the condition by referencing the column name and using Boolean operators or functions such as is.na(). For example: col == 0 or !is.na(col). You can also wrap col in a function such as length(). Use & (and) and | (or) to apply multiple conditions.
Python
# import pandas as pd
df = df.query('condition')

Specify the condition by referencing the column name and using Boolean operators or functions such as isna(). For example: 'col == 0' or 'notna(col)'. You can also wrap col in a function such as len(). Use & (and) and | (or) to apply multiple conditions.
Delete rows with any missing values

R
df <- na.omit(df)

Python
# import pandas as pd
df = df.dropna()

Delete duplicate rows

R
# library(dplyr)
df <- distinct(df)

Python
# import pandas as pd
df = df.drop_duplicates()

Reshape or reorder data

Lengthen data frame

R
# library(tidyr)
df <- gather(df, key = 'key', value = 'value', col2:col3)

Version 1.0.0 of the tidyr package has supplanted gather() with pivot_longer():
# library(tidyr)
df <- pivot_longer(df, col2:col3, names_to = 'key', values_to = 'value')

Python
# import pandas as pd
df = pd.melt(df, value_vars = ['col2', 'col3'], var_name = 'key', value_name = 'value')

Widen data frame

R
# library(tidyr)
df <- spread(df, key, value)

Version 1.0.0 of the tidyr package has supplanted spread() with pivot_wider():
# library(tidyr)
df <- pivot_wider(df, names_from = key, values_from = value)

Python
# import pandas as pd
df = df.pivot_table(index = ['col1'], columns = 'key', values = 'value')
df = df.reset_index()

Sort, arrange, order

R
# library(dplyr)
df <- arrange(df, col)

For descending, specify desc(col) in place of col.
Python
# import pandas as pd
df = df.sort_values(by = 'col')

For descending, specify an additional argument ascending = False.
Add data

Add rows

R
# library(dplyr)
df <- bind_rows(df, df2)

Python
# import pandas as pd
df = df.append(df2, sort = False)

Add columns

R
# library(dplyr)
df <- left_join(df, df2, by = 'col')

Consider also inner_join().
Python
# import pandas as pd
df = df.merge(df2, on = 'col', how = 'left')

Consider also how = 'inner'.
Edit and derive data values

The last section blends granular data cleaning with feature engineering - covering off on some common string manipulation tasks and functions commonly used to derive new values (including when working with dates and times).
Manipulate strings

Concatenate multiple columns into one column

R
df$col <- paste0(df$col, 'separator', df$col2)

Python
df['col'] = df['col1'] + 'separator' + df['col2']

Split one column into multiple columns, based on character

R
# library(tidyr)
df <- separate(df, col, c('col1', 'col2'), 'separator')

Python
# import pandas as pd
df[['col1', 'col2']] = df['col'].str.split('separator', expand = True)

Trim whitespace

R
# library(stringr)
df$col <- str_squish(df$col)

Python
# import pandas as pd
df['col'] = df['col'].str.strip()

Convert to upper, lower or title case

R
# library(stringr)
df$col <- str_to_upper(df$col)
df$col <- str_to_lower(df$col)
df$col <- str_to_title(df$col)

Python
# import pandas as pd
df['col'] = df['col'].str.upper()
df['col'] = df['col'].str.lower()
df['col'] = df['col'].str.title()

Find and replace (a with b)

R
# library(stringr)
df$col <- str_replace_all(df$col, 'a', 'b')

Python
# import pandas as pd
df['col'] = df['col'].str.replace('a', 'b')

Replace empty strings with NA or NaN

R
df$col[df$col == ''] <- NA

Python
# import numpy as np
df['col'] = df['col'].replace('', np.nan)

Conditional replacement

Replace NAs or NaNs

R
df$col[is.na(df$col)] <- replacement

Python
# import pandas as pd
df['col'] = df['col'].fillna(replacement)

Replace column value when condition met

R
# library(dplyr)
df <- df %>%
  mutate(col = if_else(condition, replacement, col))

Consider case_when() for multiple condition/replacement pairs.
Python
# import pandas as pd
df.loc[condition, 'col'] = replacement

Extract dates and times

Extract month (irrespective of year)

R
# library(lubridate)
df$month <- month(df$date, label = TRUE)

Alternatives include year(), day(), wday(), hour() and minute().
Python
# import datetime
df['month'] = df['date'].dt.month

Alternatives include .dt.year, .dt.day and .dt.weekday.
Round down to the nearest month (and year)

R
# library(lubridate)
df$month <- floor_date(df$date, unit = 'month')

Consider also round_date() and ceiling_date().
Python
# import datetime
df['month'] = df['date'].dt.to_period('M').dt.to_timestamp()

Consider .dt.floor(), .dt.round() or .dt.ceil() for more granular time units.
Calculate the difference between two datetimes

R
df$col <- difftime(df$time1, df$time2, unit = 'hours')

Alternative units include 'mins', 'days' and 'weeks'.
Python
# import datetime
df['col'] = df['time2'] - df['time1']

Calculate aggregates

R
# library(dplyr)
df <- df %>%
  group_by(cat) %>%
  summarise(mean = mean(con, na.rm = TRUE))

This calculates the mean value for each group specified by the categorical variable. Alternatives include min(), median(), sd() and max().
Python
# import pandas as pd
df = df.groupby('cat', as_index = False)['col'].mean()

This calculates the mean value for each group specified by the categorical variable. Alternatives include min(), median(), std() and max().
Resources

In no particular order:

DataCamp, especially Cleaning Data in R, Cleaning Data in Python, Working with Dates and Times in R and Working with Dates and Times in Python
RStudio cheatsheets, especially Data transformation with dplyr, String manipulation with stringr and Dates and times with lubridate
pandas documentation, especially this page
conormm's R to Python data wrangling basics
Stack Overflow... too many pages to mention