misho-kr/Cleaning Data in Python.md

## Cleaning Data in Python.md

      
    Raw
  

              Cleaning Data in Python.md
            
          
    Cleaning Data in Python

It is commonly said that data scientists spend 80% of their time cleaning and manipulating data, and only 20% of their time actually analyzing it. This course will equip you with all the skills you need to clean your data in Python, from learning how to diagnose problems in your data, to dealing with missing values and outliers.
Lead by Daniel Chen, Data Science Consultant at Lander Analytics
Exploring your data

You've gotten your hands on a brand new dataset and are itching to start exploring it. How can you be sure your dataset is clean? You'll learn how to explore your data with an eye for diagnosing issues such as outliers, missing values, and duplicate rows.

Data almost never comes in clean, diagnose your data for problems
Common data problems -- missing data, duplicate rows, unexpected data types/values, outliers
Inspect the data with df.head(), df.shape, df.columns and df.info()
Frequency counts df.continent.value_counts(dropna=False)
Summary statistics df.describe()
Bar plots and histograms

import matplotlib.pyplot as plt

df.population.plot('hist')
plt.show()

df.boxplot(column='population', by='continent')
plt.show()
Tidying data for analysis

Learn about the principles of tidy data, and more importantly, why you should care about them and how they make data analysis more efficient.

“Tidy Data” paper by Hadley Wickham, PhD

Formalize the way we describe the shape of data
Gives us a goal when formatting our data
“Standard way to organize data values within a dataset”


Principles of tidy data

Columns represent separate variables
Rows represent individual observations
Observational units form tables


Converting to tidy data, the data problem we are trying to fix:

Columns containing values, instead of variables
Solution: pd.melt()


pd.melt(frame=df, id_vars='name', 
        value_vars=['treatment a', 'treatment b'],
        var_name='treatment', value_name='result')

name variable value
Daniel treatment a _
John treatment a 12
Jane treatment a 24
Daniel treatment b 42
John treatment b 31
Jane treatment b 27

Pivot is opposite of melting

In melting, we turned columns into rows
Pivoting: turn unique values into separate columns Analysis-friendly shape to reporting-friendly shape
Violates tidy data principle: rows contain observations
Multiple variables stored in the same column


weather_tidy = weather.pivot(index='date', 
                             columns='element',
                             values='value')
print(weather_tidy)

element tmax tmin date
2010-01-30 27.8 14.5
2010-02-02 27.3 14.4

Using pivot() when you have duplicate entries results in error
pivot_table() has a parameter that specifis how to deal with duplicate values

Can aggregate the duplicate values by taking their average with argument aggfunc=np.mean


Beyond melt() and pivot() -- data transformation

Combining data for analysis

Your data may not always come in one monolithic file or table for you to load. A large dataset may be broken into separate datasets to facilitate easier storage and sharing. But it's important to be able to run your analysis on a single dataset. Learn how to combine datasets or clean each dataset separately so you can combine them later for analysis.

Data may not always come in 1 huge file
Concatation with pd.concat([df1, df2, ...])
With ignore_index=True to build new index in the resulting data frame
Concatenating many files with glob()

Globbing is pattern matching for file names


import glob
csv_files = glob.glob('*.csv')
list_data = []
for filename in csv_files:
  data = pd.read_csv(filename) list_data.append(data)
pd.concat(list_data)

Combining data with pd.merge()

Similar to joining tables in SQL
Combine disparate datasets based on common columns


Types of merges -- One-to-one, Many-to-one/One-to-many, Many-to-many

All use the same function
Only difference is the DataFrames you are merging


pd.merge(left=state_populations, right=state_codes, 
         on=None, left_on='state', right_on='name')
Cleaning data for analysis

Learn about string manipulation and pattern matching to deal with unstructured data, and then explore techniques to deal with missing or duplicate data. You'll also learn the valuable skill of programmatically checking your data for consistency, which will give you confidence that your code is running correctly and that the results of your analysis are reliable.

Converting data types
Categorical data

Can make the DataFrame smaller in memory
Can make them be utilized by other Python libraries for analysis


Cleaning bad data

df['treatment b'] = df['treatment b'].astype(str)
df['sex'] = df['sex'].astype('category')

df['treatment a'] = pd.to_numeric(df['treatment a'], errors='coerce')

df.types

Regular expressions to clean strings

Much of data cleaning involves string manipulation
Have to do string manipulation to make datasets consistent with one another
re library for regular expressions


import re
pattern = re.compile('\$\d*\.\d{2}')
result = pattern.match('$17.89') bool(result)

Functions to clean data `df.apply(lambda column: column.mean(), axis=0)

def diff_money(row, pattern):
  # ...
  
df_subset['diff'] = df_subset.apply(diff_money, 
                                    axis=1, pattern=pattern)

Duplicate data can be treated with df.drop_duplicates()
Missing data -- leave as-is, drop them, fill missing value
Fill missing values with a test statistic

tips_nan['sex'] = tips_nan['sex'].fillna('missing')
tips_nan[['total_bill', 'size']] = 
  tips_nan[['total_bill', 'size']].fillna(0)
tips_nan['tip'] = tips_nan['tip'].fillna(mean_value)

Assert statements

google_0 = google.fillna(value=0)
assert google_0.Close.notnull().all()
Case study

Apply all of the data cleaning techniques you've learned in this course toward tidying a real-world, messy dataset obtained from the Gapminder Foundation.

Dataset consists of life expectancy by country and year