Skip to content

Instantly share code, notes, and snippets.

@misho-kr
Last active November 2, 2021 03:34
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save misho-kr/3ace60834fd8005f19aba7feb1185188 to your computer and use it in GitHub Desktop.
Save misho-kr/3ace60834fd8005f19aba7feb1185188 to your computer and use it in GitHub Desktop.
Summary of "Cleaning Data in Python" course on Datacamp

It is commonly said that data scientists spend 80% of their time cleaning and manipulating data, and only 20% of their time actually analyzing it. This course will equip you with all the skills you need to clean your data in Python, from learning how to diagnose problems in your data, to dealing with missing values and outliers.

Lead by Daniel Chen, Data Science Consultant at Lander Analytics

Exploring your data

You've gotten your hands on a brand new dataset and are itching to start exploring it. How can you be sure your dataset is clean? You'll learn how to explore your data with an eye for diagnosing issues such as outliers, missing values, and duplicate rows.

  • Data almost never comes in clean, diagnose your data for problems
  • Common data problems -- missing data, duplicate rows, unexpected data types/values, outliers
  • Inspect the data with df.head(), df.shape, df.columns and df.info()
  • Frequency counts df.continent.value_counts(dropna=False)
  • Summary statistics df.describe()
  • Bar plots and histograms
import matplotlib.pyplot as plt

df.population.plot('hist')
plt.show()

df.boxplot(column='population', by='continent')
plt.show()

Tidying data for analysis

Learn about the principles of tidy data, and more importantly, why you should care about them and how they make data analysis more efficient.

  • “Tidy Data” paper by Hadley Wickham, PhD
    • Formalize the way we describe the shape of data
    • Gives us a goal when formatting our data
    • “Standard way to organize data values within a dataset”
  • Principles of tidy data
    • Columns represent separate variables
    • Rows represent individual observations
    • Observational units form tables
  • Converting to tidy data, the data problem we are trying to fix:
    • Columns containing values, instead of variables
    • Solution: pd.melt()
pd.melt(frame=df, id_vars='name', 
        value_vars=['treatment a', 'treatment b'],
        var_name='treatment', value_name='result')

name variable value
Daniel treatment a _
John treatment a 12
Jane treatment a 24
Daniel treatment b 42
John treatment b 31
Jane treatment b 27
  • Pivot is opposite of melting
    • In melting, we turned columns into rows
    • Pivoting: turn unique values into separate columns Analysis-friendly shape to reporting-friendly shape
    • Violates tidy data principle: rows contain observations
    • Multiple variables stored in the same column
weather_tidy = weather.pivot(index='date', 
                             columns='element',
                             values='value')
print(weather_tidy)

element tmax tmin date
2010-01-30 27.8 14.5
2010-02-02 27.3 14.4
  • Using pivot() when you have duplicate entries results in error
  • pivot_table() has a parameter that specifis how to deal with duplicate values
    • Can aggregate the duplicate values by taking their average with argument aggfunc=np.mean
  • Beyond melt() and pivot() -- data transformation

Combining data for analysis

Your data may not always come in one monolithic file or table for you to load. A large dataset may be broken into separate datasets to facilitate easier storage and sharing. But it's important to be able to run your analysis on a single dataset. Learn how to combine datasets or clean each dataset separately so you can combine them later for analysis.

  • Data may not always come in 1 huge file
  • Concatation with pd.concat([df1, df2, ...])
  • With ignore_index=True to build new index in the resulting data frame
  • Concatenating many files with glob()
    • Globbing is pattern matching for file names
import glob
csv_files = glob.glob('*.csv')
list_data = []
for filename in csv_files:
  data = pd.read_csv(filename) list_data.append(data)
pd.concat(list_data)
  • Combining data with pd.merge()
    • Similar to joining tables in SQL
    • Combine disparate datasets based on common columns
  • Types of merges -- One-to-one, Many-to-one/One-to-many, Many-to-many
    • All use the same function
    • Only difference is the DataFrames you are merging
pd.merge(left=state_populations, right=state_codes, 
         on=None, left_on='state', right_on='name')

Cleaning data for analysis

Learn about string manipulation and pattern matching to deal with unstructured data, and then explore techniques to deal with missing or duplicate data. You'll also learn the valuable skill of programmatically checking your data for consistency, which will give you confidence that your code is running correctly and that the results of your analysis are reliable.

  • Converting data types
  • Categorical data
    • Can make the DataFrame smaller in memory
    • Can make them be utilized by other Python libraries for analysis
  • Cleaning bad data
df['treatment b'] = df['treatment b'].astype(str)
df['sex'] = df['sex'].astype('category')

df['treatment a'] = pd.to_numeric(df['treatment a'], errors='coerce')

df.types
  • Regular expressions to clean strings
    • Much of data cleaning involves string manipulation
    • Have to do string manipulation to make datasets consistent with one another
    • re library for regular expressions
import re
pattern = re.compile('\$\d*\.\d{2}')
result = pattern.match('$17.89') bool(result)
  • Functions to clean data `df.apply(lambda column: column.mean(), axis=0)
def diff_money(row, pattern):
  # ...
  
df_subset['diff'] = df_subset.apply(diff_money, 
                                    axis=1, pattern=pattern)
  • Duplicate data can be treated with df.drop_duplicates()
  • Missing data -- leave as-is, drop them, fill missing value
  • Fill missing values with a test statistic
tips_nan['sex'] = tips_nan['sex'].fillna('missing')
tips_nan[['total_bill', 'size']] = 
  tips_nan[['total_bill', 'size']].fillna(0)
tips_nan['tip'] = tips_nan['tip'].fillna(mean_value)
  • Assert statements
google_0 = google.fillna(value=0)
assert google_0.Close.notnull().all()

Case study

Apply all of the data cleaning techniques you've learned in this course toward tidying a real-world, messy dataset obtained from the Gapminder Foundation.

  • Dataset consists of life expectancy by country and year
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment