It is commonly said that data scientists spend 80% of their time cleaning and manipulating data, and only 20% of their time actually analyzing it. This course will equip you with all the skills you need to clean your data in Python, from learning how to diagnose problems in your data, to dealing with missing values and outliers.
Lead by Daniel Chen, Data Science Consultant at Lander Analytics
You've gotten your hands on a brand new dataset and are itching to start exploring it. How can you be sure your dataset is clean? You'll learn how to explore your data with an eye for diagnosing issues such as outliers, missing values, and duplicate rows.
- Data almost never comes in clean, diagnose your data for problems
- Common data problems -- missing data, duplicate rows, unexpected data types/values, outliers
- Inspect the data with
df.head()
,df.shape
,df.columns
anddf.info()
- Frequency counts
df.continent.value_counts(dropna=False)
- Summary statistics
df.describe()
- Bar plots and histograms
import matplotlib.pyplot as plt
df.population.plot('hist')
plt.show()
df.boxplot(column='population', by='continent')
plt.show()
Learn about the principles of tidy data, and more importantly, why you should care about them and how they make data analysis more efficient.
- “Tidy Data” paper by Hadley Wickham, PhD
- Formalize the way we describe the shape of data
- Gives us a goal when formatting our data
- “Standard way to organize data values within a dataset”
- Principles of tidy data
- Columns represent separate variables
- Rows represent individual observations
- Observational units form tables
- Converting to tidy data, the data problem we are trying to fix:
- Columns containing values, instead of variables
- Solution:
pd.melt()
pd.melt(frame=df, id_vars='name',
value_vars=['treatment a', 'treatment b'],
var_name='treatment', value_name='result')
name variable value
Daniel treatment a _
John treatment a 12
Jane treatment a 24
Daniel treatment b 42
John treatment b 31
Jane treatment b 27
- Pivot is opposite of melting
- In melting, we turned columns into rows
- Pivoting: turn unique values into separate columns Analysis-friendly shape to reporting-friendly shape
- Violates tidy data principle: rows contain observations
- Multiple variables stored in the same column
weather_tidy = weather.pivot(index='date',
columns='element',
values='value')
print(weather_tidy)
element tmax tmin date
2010-01-30 27.8 14.5
2010-02-02 27.3 14.4
- Using pivot() when you have duplicate entries results in error
pivot_table()
has a parameter that specifis how to deal with duplicate values- Can aggregate the duplicate values by taking their average with argument
aggfunc=np.mean
- Can aggregate the duplicate values by taking their average with argument
- Beyond melt() and pivot() -- data transformation
Your data may not always come in one monolithic file or table for you to load. A large dataset may be broken into separate datasets to facilitate easier storage and sharing. But it's important to be able to run your analysis on a single dataset. Learn how to combine datasets or clean each dataset separately so you can combine them later for analysis.
- Data may not always come in 1 huge file
- Concatation with
pd.concat([df1, df2, ...])
- With
ignore_index=True
to build new index in the resulting data frame - Concatenating many files with
glob()
- Globbing is pattern matching for file names
import glob
csv_files = glob.glob('*.csv')
list_data = []
for filename in csv_files:
data = pd.read_csv(filename) list_data.append(data)
pd.concat(list_data)
- Combining data with
pd.merge()
- Similar to joining tables in SQL
- Combine disparate datasets based on common columns
- Types of merges -- One-to-one, Many-to-one/One-to-many, Many-to-many
- All use the same function
- Only difference is the DataFrames you are merging
pd.merge(left=state_populations, right=state_codes,
on=None, left_on='state', right_on='name')
Learn about string manipulation and pattern matching to deal with unstructured data, and then explore techniques to deal with missing or duplicate data. You'll also learn the valuable skill of programmatically checking your data for consistency, which will give you confidence that your code is running correctly and that the results of your analysis are reliable.
- Converting data types
- Categorical data
- Can make the DataFrame smaller in memory
- Can make them be utilized by other Python libraries for analysis
- Cleaning bad data
df['treatment b'] = df['treatment b'].astype(str)
df['sex'] = df['sex'].astype('category')
df['treatment a'] = pd.to_numeric(df['treatment a'], errors='coerce')
df.types
- Regular expressions to clean strings
- Much of data cleaning involves string manipulation
- Have to do string manipulation to make datasets consistent with one another
re
library for regular expressions
import re
pattern = re.compile('\$\d*\.\d{2}')
result = pattern.match('$17.89') bool(result)
- Functions to clean data `df.apply(lambda column: column.mean(), axis=0)
def diff_money(row, pattern):
# ...
df_subset['diff'] = df_subset.apply(diff_money,
axis=1, pattern=pattern)
- Duplicate data can be treated with
df.drop_duplicates()
- Missing data -- leave as-is, drop them, fill missing value
- Fill missing values with a test statistic
tips_nan['sex'] = tips_nan['sex'].fillna('missing')
tips_nan[['total_bill', 'size']] =
tips_nan[['total_bill', 'size']].fillna(0)
tips_nan['tip'] = tips_nan['tip'].fillna(mean_value)
- Assert statements
google_0 = google.fillna(value=0)
assert google_0.Close.notnull().all()
Apply all of the data cleaning techniques you've learned in this course toward tidying a real-world, messy dataset obtained from the Gapminder Foundation.
- Dataset consists of life expectancy by country and year