Skip to content

Instantly share code, notes, and snippets.

@vijayanandrp
Last active December 7, 2017 12:35
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vijayanandrp/b32eaf060b01f51f4ceec5992946eb65 to your computer and use it in GitHub Desktop.
Save vijayanandrp/b32eaf060b01f51f4ceec5992946eb65 to your computer and use it in GitHub Desktop.
Data Wrangling tool with simple example - https://informationcorners.com/ml-002-data-wrangling-2/

Data wrangling

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations. Wiki

Wrangler is an interactive tool for data cleaning and transformation. Spend less time formatting and more time analyzing your data. stanford

Example - 1

0 - Requirement

I was given a data problem where I have to write a model to auto-clean database values without manual work. This was my first practical ML solution delivered to my client.

1. Analysis

#!/usr/bin/env python3.5
# encoding: utf-8

import random
import csv
from nltk import classify, NaiveBayesClassifier, MaxentClassifier, DecisionTreeClassifier

gender_file = 'gender.csv'
training_percent = 0.8

Analysing the dataset before processing. I was given a column of actual values their corresponding correction values. I have planned to use the same solution similar to name gender prediction in my previous project Github - Name Gender Prediction

import pandas as pd
gender_df = pd.read_csv(gender_file, header=None, usecols=[1,2])
gender_df.rename(columns={1:'actual', 2:'correction'}, inplace=True)
gender_df.shape
(137, 2)
gender_df.describe()
actual correction
count 137 137
unique 137 3
top ???? ??? ??????? Other/Prefer Not To Answer
freq 1 73
gender_df.head()
actual correction
0 Female Female
1 F Female
2 Female 1 Female
3 Female male Female
4 Female1 Female
gender_df.tail()
actual correction
132 Wole nie podawac Other/Prefer Not To Answer
133 Επιλέξτε Other/Prefer Not To Answer
134 선택 Other/Prefer Not To Answer
135 选择 Other/Prefer Not To Answer
136 選擇 Other/Prefer Not To Answer
gender_df.sample(10)
actual correction
53 Mezczyzna Male
20 Ms Female
125 test1 Other/Prefer Not To Answer
94 Karlkyns Other/Prefer Not To Answer
74 ????? Other/Prefer Not To Answer
33 Female
99 Nespecificat Other/Prefer Not To Answer
75 ??????? Other/Prefer Not To Answer
113 Prefer Not To Answer Other/Prefer Not To Answer
61 Άνδρας Male
gender_df['correction'].unique()
array(['Female', 'Male', 'Other/Prefer Not To Answer'], dtype=object)

2. Solution

Making feature matrix X
def feature_extraction(_data):
    """ This function is used to extract features in a given data value"""
    _data = _data.lower()
    f_1, f_2, f_3, f_4, l_1, l_2, l_3, l_4 = None, None, None, None, None, None, None ,None
    
    # extracting first and last 4 characters
    if len(_data) >= 4:
        f_4 = _data[:4]
        l_4 = _data[-4:]
    # extracting first and last 3 characters
    if len(_data) >= 3:
        f_3 = _data[:3]
        l_3 = _data[-3:]
    # extracting first and last 2 characters
    if len(_data) >= 2:
        f_2 = _data[:2]
        l_2 = _data[-2:]
    # extracting first and last 1 character
    if len(_data) >= 1:
        f_1 = _data[:1]
        l_1 = _data[-1:]
    
    feature = {
        'f_1': f_1,
        'f_2': f_2,
        'l_1': l_1,
        'l_2': l_2,
        'f_3': f_3,
        'f_4': f_4,
        'l_3': l_3,
        'l_4': l_4
    }

    return feature
Loading dataset
dataset = []
with open(gender_file, newline='\n') as fp:
    input_data = csv.reader(fp, delimiter=',')
    for row in input_data:
        dataset.append((row[1:]))
feature_sets = [(actual, correction) for (actual, correction) in dataset]
random.shuffle(feature_sets)
creating feature matrix X and response vector y
feature_sets = [(feature_extraction(source), corrected) for (source, corrected) in feature_sets]
Visualizing Feature Matrix X
feature_val = [val[0]  for val in feature_sets]
feature_df = pd.DataFrame(feature_val)
feature_df.shape
(137, 8)
feature_df.sample(10)
f_1 f_2 f_3 f_4 l_1 l_2 l_3 l_4
122 v ve vel velg g lg elg velg
46 m ma mal male 1 1 e 1 le 1
49 b be bez bez a ra ora vora
100 e er err erre k ék nék lnék
59 w wa wan wani a ta ita nita
124 ž že žen žens ý ský nský
91 f fe fem femm e me mme emme
42 ? ?? None None ? ?? None None
34 ? ?? ?? ?? ? ? ?? ??? ????
119 m mr None None r mr None None
Train Test Split
cut_point = int(len(feature_sets) * training_percent)
train_set, test_set = feature_sets[:cut_point], feature_sets[cut_point:]
NaiveBayes Classifier
nb_classifier = NaiveBayesClassifier.train(train_set)
print("Accuracy of NaiveBayesClassifier: {} ".format(classify.accuracy(nb_classifier, test_set)))
Accuracy of NaiveBayesClassifier: 0.75 
print(nb_classifier.show_most_informative_features(10))
Most Informative Features
                     f_1 = 'm'              Male : Other/ =     20.9 : 1.0
                     f_1 = 'f'            Female : Male   =      6.5 : 1.0
                     l_1 = 'a'            Female : Other/ =      6.3 : 1.0
                     f_1 = 'k'            Female : Other/ =      3.9 : 1.0
                     f_1 = 'p'            Other/ : Male   =      3.5 : 1.0
                     f_2 = 'kv'           Female : Other/ =      3.3 : 1.0
                     l_1 = 'k'              Male : Other/ =      2.8 : 1.0
                     l_1 = '1'              Male : Other/ =      2.8 : 1.0
                     l_1 = 'i'              Male : Other/ =      2.8 : 1.0
                     f_1 = 'n'            Other/ : Female =      2.8 : 1.0
None
Maxent Classifier
max_classifier = MaxentClassifier.train(train_set)
  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.495
             2          -0.55492        0.954
             3          -0.39256        0.991
             4          -0.30684        1.000
             5          -0.25292        1.000
             6          -0.21564        1.000
             7          -0.18823        1.000
             8          -0.16719        1.000
             9          -0.15051        1.000
            10          -0.13694        1.000
            11          -0.12568        1.000
            94          -0.01688        1.000
            95          -0.01671        1.000
            96          -0.01654        1.000
            97          -0.01637        1.000
            98          -0.01621        1.000
            99          -0.01605        1.000
         Final          -0.01590        1.000
print("Accuracy of MaxentClassifier: {} ".format(classify.accuracy(max_classifier, test_set)))
Accuracy of MaxentClassifier: 0.75 
print(max_classifier.show_most_informative_features(10))
  -4.701 f_1=='m' and label is 'Other/Prefer Not To Answer'
   3.138 l_1=='f' and label is 'Female'
   3.132 l_1=='男' and label is 'Male'
   3.132 f_1=='男' and label is 'Male'
  -2.761 f_1=='m' and label is 'Female'
   2.704 l_1=='女' and label is 'Female'
   2.704 f_1=='女' and label is 'Female'
   2.640 l_2=='nő' and label is 'Female'
   2.640 l_1=='ő' and label is 'Female'
   2.640 f_2=='nő' and label is 'Female'
None
Decision Tree Classifier
decision_classifier = DecisionTreeClassifier.train(train_set)
print("Accuracy of DecisionTreeClassifier: {} ".format(classify.accuracy(decision_classifier, test_set)))
Accuracy of DecisionTreeClassifier: 0.6428571428571429 

4. Evaluation

print('Enter q (or) quit to end this test module')
while 1:
    data = input('\nEnter data for testing: ')
    if data.lower() == 'q' or data.lower() == 'quit':
        print('End')
        break

    if not len(data):
        continue

    features = feature_extraction(data)
    print(features)
    prediction = [nb_classifier.classify(features),
                  max_classifier.classify(features),
                  decision_classifier.classify(features)]

    print('NaiveBayes Classifier     : ', prediction[0])
    print('Maxent Classifier         : ', prediction[1])
    print('Decision Tree Classifier  : ', prediction[2])
    print('-'*75)
    print('(Best of 3) =              ', max(set(prediction), key=prediction.count))
Enter q (or) quit to end this test module

Enter data for testing: M
{'f_3': None, 'l_4': None, 'f_4': None, 'l_2': None, 'l_1': 'm', 'f_2': None, 'f_1': 'm', 'l_3': None}
NaiveBayes Classifier     :  Male
Maxent Classifier         :  Male
Decision Tree Classifier  :  Male
---------------------------------------------------------------------------
(Best of 3) =               Male

Enter data for testing: F
{'f_3': None, 'l_4': None, 'f_4': None, 'l_2': None, 'l_1': 'f', 'f_2': None, 'f_1': 'f', 'l_3': None}
NaiveBayes Classifier     :  Female
Maxent Classifier         :  Female
Decision Tree Classifier  :  Female
---------------------------------------------------------------------------
(Best of 3) =               Female

Enter data for testing: female
{'f_3': 'fem', 'l_4': 'male', 'f_4': 'fema', 'l_2': 'le', 'l_1': 'e', 'f_2': 'fe', 'f_1': 'f', 'l_3': 'ale'}
NaiveBayes Classifier     :  Female
Maxent Classifier         :  Female
Decision Tree Classifier  :  Female
---------------------------------------------------------------------------
(Best of 3) =               Female

Enter data for testing: MMMMM
{'f_3': 'mmm', 'l_4': 'mmmm', 'f_4': 'mmmm', 'l_2': 'mm', 'l_1': 'm', 'f_2': 'mm', 'f_1': 'm', 'l_3': 'mmm'}
NaiveBayes Classifier     :  Male
Maxent Classifier         :  Male
Decision Tree Classifier  :  Female
---------------------------------------------------------------------------
(Best of 3) =               Male

Enter data for testing: FFFFF
{'f_3': 'fff', 'l_4': 'ffff', 'f_4': 'ffff', 'l_2': 'ff', 'l_1': 'f', 'f_2': 'ff', 'f_1': 'f', 'l_3': 'fff'}
NaiveBayes Classifier     :  Female
Maxent Classifier         :  Female
Decision Tree Classifier  :  Female
---------------------------------------------------------------------------
(Best of 3) =               Female
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment