Skip to content

Instantly share code, notes, and snippets.

@Fuad28
Created April 19, 2021 12:42
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Fuad28/de528f89c8e702f3bf39e049a7f0e677 to your computer and use it in GitHub Desktop.
Save Fuad28/de528f89c8e702f3bf39e049a7f0e677 to your computer and use it in GitHub Desktop.
Data cleaning

What is data cleaning?

Data cleaning is simply the process of preparing data for analysis by means of modifying, adding to or removing from it. This process is also commonly referred to as data preprocessing.

Why do we need to clean our data.

A times, data need to be cleaned/ preprocessed before we can extract useful information from them. Most real-life data have so many inconsistencies such as missing values, non-informative features and so on, as such, there's a constant need to clean our data before using it for us to get the best from it .

The machine learning process works with only numbers. If we intend to use a data that has text, datatime features and other non-numeric features, we need to find a way to represent them using numbers without loosing the information they carry.

Say for example:

A simple logistic regression model simply maps a dependent variable to an independent variable using a linear function such as: $y= wX + b$

Where:

​ w is weight, could be any number.

​ and b is bias, which is also a number.

if you're given a set of data; [small, medium, big], you find out that it's impossible to compute:

$y= w* small + b$.

but if you encode small as 1, medium as 2 and big as 3 thereby transforming your data to [1, 2, 3], you discover that you're not only able to compute $y= w* 1+ b$, but the information encoded in them is still retained. cases as this and more are our business in data cleaning and preprocessing.

What are the steps in data cleaning?

Data cleaning is just the collective name to a series of actions we perform on our data in the process of getting it ready for analysis.

Some of the steps in data cleaning are;

  • Handling missing values
  • Encoding categorical features
  • Outliers detection
  • Transformations e.t.c

Handling Missing Values.

One of the most common problems encountered while preprocessing data is the problem of missing values in our data. They are very common and could be as a result of ;

  1. Omission by the person filling the data either intentionally or unintentionally or it simply doesn't apply to them.
  2. Omission by the person that inputted the data into the computer.

A data scientist if not careful could draw a wrong inference from a data that contains missing values and that is why we need to study this phenomenon and learn to effectively tackle it.

Missing values generally fall into three categories which are;

  1. Missing completely at random (MCAR): A missing value missing completely at random if we've no information, reason or anything whatsoever that can help compute it
  2. Missing at random (MAR): A missing value missing at random if we've information, reason or anything whatsoever (especially from other known values) that can help compute it.
  3. Not missing at random (NMAR). the value of the variable that's missing is related to the reason it's missing
How do we detect missing values in python ?

Before we handle missing values, it's only logical that we learn how to detect them and decide on how to handle our data based on the volume of missing values, how much data we've and so on. One implementation I enjoy and use often is to set a threshold for a column to decide if it's repairable or beyond repair, lol. (50 - 60% does it for me).

Below, you'll see a function that implements the idea we discussed in the previous paragraph. yaay!!

def missing_removal(df, thresh, confirm= None):
    holder= {}
    for col in df.columns:
        rate= df[col].isnull().sum() / df.shape[0]
        if rate > thresh:
            holder[col]= rate
    if confirm==True:
        df.drop(columns= [i for i in holder], inplace= True)
        return df
    else:
        print(f'Number of columns that have Nan values above the thresh specified{len(holder)}')
        return holder

Quick note:

if confirm is set to true, all columns that that have their missing values percentage to be higher than the threshold set are dropped and if confirm is set to None or False, the function returns a list of the missing values percentage for all columns in the data. Try it out on your data. It's so cool.

Now that we're able to detect the missing values in our data and have dropped the unredeemable ones and kept the redeemable ones, let's go ahead to discuss further on how to actually fix them. yasss!!!

There are two ways to go about this:

  1. Statistical imputation:

    This is a method that has proven to be effective over time. All you do is simply fill the missing data in a column with either the mean, median, mode of the column. This works guys, trust me.

    Scikit-learn provides a subclass called SimpleImputer to handle our missing values this way.

    Below is short description that would help understand statistical imputation better.

    from sklearn.impute import SimpleImputer
    # store the columns of the dataframe
    cols= df.columns
    
    #instantiate the SimpleImputer subclass
    #Depending on how you want to imput, you can use most_frequent, mean, median and constant
    imputer= SimpleImputer(strategy= 'most_frequent') 
    df= imputer.fit_transform(df)
    
    # convert back to dataframe
    df= pd.DataFrame(df, columns= cols)
  2. Multiple Imputation by Chained Equations (MICE):

In this method, a column and it's missing values are modelled as a function of the other columns in the data. This process is repeated until the tolerance between the previously calculated value and the current one is so small and below a given threshold.

Scikit-learn provides a subclass called IterativeImputer to handle our missing values this way.

Below is a short description that hopefully solidify this concept to you.

from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

# store the columns of the dataframe
cols= df.columns

#instantiate IterativeImputer subclass
imputer= IterativeImputer(max_iter= 3)
df= imputer.fit_transform(df)

# convert back to dataframe
df= pd.DataFrame(df, columns= cols)

Hopefully, all we've discussed so far is enough for you to handle missing values effectively.

Encoding categorical Features.

What are categorical features?

Well my friends, categorical features are features that take on discrete values only. They do not take on continuous values such as 3.45, 2.67 and so on. A categorical feature could have values like big, medium and small, 1- 5 as a ranking, Yes and No, 1 and 0 , yellow, red, blue e.t.c. They basically represent categories e.g age groups, countries, colors, gender e.t.c

Often times, they come in text format, other times, they come in number format (most people often fail to identify them in this format).

How to identify categorical features:

This shouldn't be a big deal, most times, categorical features come in text format and that's all you need to identify them.

What if, it's a form of ranking and they actually come in number format. Well, what to do is check the number of unique values in a column and compare it to the number of rows in the column. Say for example, a column that has 2000 rows and just 5 or 10 unique values, you most likely don't need a sage to tell you that the column is a categorical column. There isn't a rule to this, it's merely intuition, you could be right or wrong. So just be free, try out stuff.

Classes of encoding:

Given a categorical feature, as we've examined above, we're faced with a problem of converting the unique categories in each feature to number whilst not loosing the information encoded in them. There are various methods of encoding categorical features based on some observable features.

There are also two classes of categorical features;

  1. Ordinal categorical features: There is an inherent order or relationship between the categories in this particular feature e.g size (small, big, medium), age groups and so on.
  2. Nominal categorical features: There's no sensible order between the categories of the feature e.g countries, cities, names e.t.c

The two classes above are approached differently. Some of the methods I use in approaching nominal categorical features are;

  • One-hot encoding

  • Frequency/ count encoding

  • Target Mean encoding

  • Ordered Integer Encoding

  • Binary encoding

  • Leave One Out Encoding

  • Weight of evidence encoding

For ordinal categorical features, we only use ;

  • label encoder or ordinal encoder

Now we'll try to work through some of these methods one after the other. We'd also implement them in python where possible and use a python library called category_encoders.

you can install the library using pip install category_encoders

1. One-hot encoding

One-hot encoding is one of the most effective ways of encoding nominal categorical features. This method creates a new binary column for each category in the column. Ideally we drop one of these columns to avoid collinearity between the columns, therefore, a feature with K unique categories will yield an extra K-1 columns in the data.

A downside to this method is that it expands feature space when a feature has many unique categories or there many categorical features in the data.

![onehot](C:\Users\USER\Desktop\Stamp\ML\My projects\My articles\DS Article\Media\onehot.PNG)

The illustration above really explains the concept of one-hot encoding. Encoding features in this manner eliminates all form of hierarchy.

An implementation of this method in python is described below:

import pandas as pd
data= pd.get_dummies(data, columns, drop_first= True)
#check https://pandas.pydata.org/docs/reference/api/pandas.get_dummies.html for more info

Frequency/ count encoding

This method is equally very effective. It introduces hierarchy to nominal categorical features based on how often they occur in in the feature (column). It's very similar to count encoding in the sense that count encoding can take any value while frequency encoding is normalized to range 0 and 1.

![freq](C:\Users\USER\Desktop\Stamp\ML\My projects\My articles\DS Article\Media\freq.PNG)

An implementation of this method in python is described below:

# Frequency encoding
# cols is the columns we wish to encode
#df is the dataFrame
def freq_enc(df, cols):
    for col in cols:
        df[col]= df[col].map(round(df[col].value_counts()/len(df),4))
    return df

# count encoding
def count_enc(df, cols):
    for col in cols:
        df[col]= df[col].map(round(df[col].value_counts()))
    return df

Target Mean Encoding

The idea behind this method is actually very brilliant. This method has something very unique about it in the sense that it uses the target column in its computation, this is something you find very rare in machine learning practices generally. Each category in the categorical feature is replaced by the mean of the target column for that category only.

The method is a very excellent one, but could lead to over-fitting if it encodes too much information about the target, hence, before using this method, we need to make sure this categorical feature isn't highly correlated with the target column.

The test data are encoded with stored values from the train data through mapping.

![mean Target](C:\Users\USER\Desktop\Stamp\ML\My projects\My articles\DS Article\Media\mean Target.PNG)

An implementation of this method in python is described below:

# target mean encoding
#target_col is the name of the target column (str)
def target_mean_enc(df, cols, target_col):
    mean_holder= {}
    for col in cols:
        col_mean= {}
        cat= list(df[col].unique())
        for i in cat:
            data= df[df[col]== i]
            mean= np.mean(data[target_col])
            col_mean[i]= mean
        mean_holder[col]= col_mean
    return mean_holder

The function above returns a dictionary containing the mean values of the columns that are encoded. Then the dictionary is mapped on the data. See below:

![taget apply gan](C:\Users\USER\Desktop\Stamp\ML\My projects\My articles\DS Article\Dsn snaps\taget apply gan.JPG)

Ordered Integer Encoding

This method is very similar to target mean encoding except that it goes a step further to rank the categories based on the magnitude of their target mean.

![1](C:\Users\USER\Desktop\Stamp\ML\My projects\My articles\DS Article\Media\1.JPG)

After implementing the ordered integer encoding, the dataframe looks like this:

![ordered_2](C:\Users\USER\Desktop\Stamp\ML\My projects\My articles\DS Article\Media\ordered_2.JPG)

An implementation of this method in python is described below:

def ordered_interger_encoder(data, cols, target_col):
    mean_holder= {}
    for col in cols:
        labels =  list(enumerate(data.groupby([col])[target_col].mean().sort_values().index))
        col_mean= {value:order for order,value in labels}
        mean_holder[col]= col_mean
    return mean_holder

Leave One Out Encoding

This method is also very similar to target mean encoding except that at every level, it calculates the target mean whilst leaving out that particular level. The target mean encoding is still used on the test data though.

img

An implementation of this method in python is described below:

from category_encoders import leave_one_out
binary= leave_one_out(cols= ['STATUS'])
binary.fit(data)
train= binary.transform(train_data)
test= binary.transform(test_data)

Binary encoding

Binary encoding works uniquely, almost similar to one-hot encoding in the sense that it creates extra features also. Firstly, it assigns ranks to the unique features based on how they occur in the data (fret not!, this rank has no meaning). The ranks are then converted to the base two number system (binary). The individual numbers are then split into different columns.

![binary encoding_description](C:\Users\USER\Desktop\Stamp\ML\My projects\My articles\DS Article\Media\binary encoding_description.png)

A demo of this method using category_encoders library is described below:

from category_encoders import BinaryEncoder
binary= BinaryEncoder(cols= ['STATUS'])
binary.fit(data)
train= binary.transform(train_data)
test= binary.transform(test_data)

Weight of evidence encoding

This is a method that has been used in credit risk analysis for up to seven decades now. it is commonly used to transform features for a logistic regression task as it helps to reveal correlations among features that may not be visible to us before. This method can only be used in a classification task.

It simply transform categorical columns by applying $ln(\frac{p(good)} {p(bad)})$ to each category in the column.

where:

$p(good)$ is one category of the target column say $p(1)$

and $p(bad)$ is the second category which could just be $p(0)$.

![WOE](C:\Users\USER\Desktop\Stamp\ML\My projects\My articles\DS Article\Media\WOE.PNG)

An implementation of this method using category_encoders is described below:

from category_encoders import WOEEncoder
binary= WOEEncoder(cols= ['STATUS'])
binary.fit(data)
train= binary.transform(train_data)
test= binary.transform(test_data)

Label encoding or Ordinal encoding

This method is used for encoding nominal categorical features, we simply assign numbers to each category based on the magnitude we can infer from them.

![ordinal](C:\Users\USER\Desktop\Stamp\ML\My projects\My articles\DS Article\Media\ordinal.PNG)

An implementation of this method in python is described below:

df['size']= df['size'].map({'small':1, 'medium':2, 'big':3})
df

Conclusion

For more on the very useful category_encoders library, check out their documentation

I hope that this article was able to give you a strong insight into the concept of categorical encoding and how to fill missing values.

Which should you use ?

Data cleaning is an iterative process, hence try out as many of them as you like and stick the one that works best for your data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment