Skip to content

Instantly share code, notes, and snippets.

@vijayanandrp
Last active December 7, 2017 12:32
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save vijayanandrp/f5548969846af02f3fc621ccb01d0c95 to your computer and use it in GitHub Desktop.
Save vijayanandrp/f5548969846af02f3fc621ccb01d0c95 to your computer and use it in GitHub Desktop.
Data Wrangling tool with simple example - https://informationcorners.com/ml-002-data-wrangling-1/

Data wrangling

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations. Wiki

Wrangler is an interactive tool for data cleaning and transformation. Spend less time formatting and more time analyzing your data. stanford

Example - 1

0 - Requirement

I was given a data problem where I have to write a model to auto-clean database values without manual work. This was my first practical ML solution delivered to my client.

1. Analysis

#!/usr/bin/env python3.5
# encoding: utf-8

import random
import csv
from nltk import classify, NaiveBayesClassifier, MaxentClassifier, DecisionTreeClassifier

age_file = 'age.csv'
training_percent = 0.8

Analysing the dataset before processing. I was given a column of actual values their corresponding correction values. I have planned to use the same solution similar to name gender prediction in my previous project Github - Name Gender Prediction

import pandas as pd
age_df = pd.read_csv(age_file, header=None, usecols=[1,2])
age_df.rename(columns={1:'actual', 2:'correction'}, inplace=True)
age_df.shape
(480, 2)
age_df.describe()
actual correction
count 480 480
unique 480 14
top 34 old 30 to 34
freq 1 51
age_df.head()
actual correction
0 18 to 20 18 to 20
1 18 18 to 20
2 18 - 20 18 to 20
3 18 - 21 18 to 20
4 18 - 22 18 to 20
age_df.tail()
actual correction
475 ?? ?? ?
476 ?? ??? ???? ?
477 ???? ??? ???? ?
478 SHL Bureau 7 ?
479 SMT6 ?
age_df.sample(10)
actual correction
453 17 Under 18
95 26yrs 25 to 29
174 35 35 to 39
457 18岁以下 Under 18
7 18 years old 18 to 20
147 32 30 to 34
370 59 yrs 55 to 59
331 53 years 50 to 54
285 47 YRS 45 to 49
282 46yrs 45 to 49
age_df['correction'].unique()
array(['18 to 20', '21 to 24', '25 to 29', '30 to 34', '35 to 39',
       '40 to 44', '45 to 49', '50 to 54', '55 to 59', '60 to 64', '65+',
       'Declined to Respond', 'Under 18', '?'], dtype=object)

2. Solution

Making feature matrix X
def feature_extraction(_data):
    """ This function is used to extract features in a given data value"""
    # Find the digits in the given string Example - data='18-20' digits = '1820'
    digits = str(''.join(c for c in _data if c.isdigit()))
    # calculate the length of the string
    len_digits = len(digits)
    # splitting digits in to values example - digits = '1820' ages = [18, 20]
    ages = [int(digits[i:i + 2]) for i in range(0, len_digits, 2)]
    # checking for special character in the given data
    special_character = '.+-<>?'
    spl_char = ''.join([c for c in list(special_character) if c in _data])
    # handling decimal age data
    if len_digits == 3:
        spl_char = '.'
        age = "".join([str(ages[0]), '.', str(ages[1])])
        # normalizing
        age = int(float(age) - 0.5)
        ages = [age]
    # Finding the maximum, minimum, average age values
    max_age = 0
    min_age = 0
    mean_age = 0
    if len(ages):
        max_age = max(ages)
        min_age = min(ages)
    if len(ages) == 2:
        mean_age = int((max_age + min_age) / 2)
    else:
        mean_age = max_age
    # specially added for 18 years cases
    only_18 = 0
    is_y = 0
    if ages == [18]:
        only_18 = 1
        if 'y' in _data or 'Y' in _data:
            is_y = 1
    under_18 = 0
    if 1 < max_age < 18:
        under_18 = 1
    above_65 = 0
    if mean_age >= 65:
        above_65 = 1
    # verifying whether digit is found in the given string or not.
    # Example - data='18-20' digits_found=True data='????' digits_found=False
    digits_found = 1
    if len_digits == 1:
        digits_found = 1
        max_age, min_age, mean_age, only_18, is_y, above_65, under_18 = 0, 0, 0, 0, 0, 0, 0
    elif len_digits == 0:
        digits_found, max_age, min_age, mean_age, only_18, is_y, above_65, under_18 = -1, -1, -1, -1, -1, -1, -1, -1
     
    feature = {
        'ages': tuple(ages),
        'len(ages)': len(ages),
        'spl_chr': spl_char,
        'is_digit': digits_found,
        'max_age': max_age,
        'mean_age': mean_age,
        'only_18': only_18,
        'is_y': is_y,
        'above_65': above_65,
        'under_18': under_18
    }

    return feature
Loading dataset
dataset = []
with open(age_file, newline='\n') as fp:
    input_data = csv.reader(fp, delimiter=',')
    for row in input_data:
        dataset.append((row[1:]))
feature_sets = [(actual, correction) for (actual, correction) in dataset]
random.shuffle(feature_sets)
creating feature matrix X and response vector y
feature_sets = [(feature_extraction(source), corrected) for (source, corrected) in feature_sets]
Visualizing Feature Matrix X
feature_val = [val[0]  for val in feature_sets]
feature_df = pd.DataFrame(feature_val)
feature_df.shape
(480, 10)
feature_df.sample(10)
above_65 ages is_digit is_y len(ages) max_age mean_age only_18 spl_chr under_18
183 1 (65,) 1 0 1 65 65 0 0
74 0 (50, 55) 1 0 2 55 52 0 - 0
238 0 (55,) 1 0 1 55 55 0 0
451 0 (20,) 1 0 1 20 20 0 0
157 0 (55,) 1 0 1 55 55 0 0
342 0 (49,) 1 0 1 49 49 0 0
122 0 (34,) 1 0 1 34 34 0 0
99 1 (66,) 1 0 1 66 66 0 0
232 0 (27,) 1 0 1 27 27 0 0
85 0 (28,) 1 0 1 28 28 0 0
Train Test Split
cut_point = int(len(feature_sets) * training_percent)
train_set, test_set = feature_sets[:cut_point], feature_sets[cut_point:]
NaiveBayes Classifier
nb_classifier = NaiveBayesClassifier.train(train_set)
print("Accuracy of NaiveBayesClassifier: {} ".format(classify.accuracy(nb_classifier, test_set)))
Accuracy of NaiveBayesClassifier: 0.9583333333333334 
print(nb_classifier.show_most_informative_features(10))
Most Informative Features
                above_65 = 0              25 to  : 65+    =     25.1 : 1.0
                 max_age = 65                65+ : 60 to  =      9.1 : 1.0
                 max_age = 39             35 to  : 30 to  =      6.9 : 1.0
                 max_age = 59             55 to  : 50 to  =      6.5 : 1.0
               len(ages) = 2              60 to  : Under  =      4.3 : 1.0
                 only_18 = 0              25 to  : Under  =      4.0 : 1.0
                 max_age = 21             21 to  : 18 to  =      3.2 : 1.0
                    ages = (18,)          Under  : 18 to  =      3.0 : 1.0
                 max_age = 18             Under  : 18 to  =      3.0 : 1.0
                mean_age = 18             Under  : 18 to  =      3.0 : 1.0
None
Maxent Classifier
max_classifier = MaxentClassifier.train(train_set)
  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -2.63906        0.055
             2          -1.68058        0.927
             3          -1.25474        0.961
             4          -0.98505        0.977
             5          -0.80323        0.977
             6          -0.67437        0.977
             7          -0.57929        0.977
            22          -0.18180        0.995
            23          -0.17391        0.995
            24          -0.16670        0.995
            25          -0.16008        0.995
            26          -0.15398        0.995
            94          -0.04679        0.995
            95          -0.04637        0.995
            96          -0.04597        0.995
            97          -0.04557        0.995
            98          -0.04518        0.995
            99          -0.04480        0.995
         Final          -0.04442        0.995
print("Accuracy of MaxentClassifier: {} ".format(classify.accuracy(max_classifier, test_set)))
Accuracy of MaxentClassifier: 0.9895833333333334 
print(max_classifier.show_most_informative_features(10))
  -7.058 above_65==0 and label is '65+'
   5.341 spl_chr=='?' and label is '?'
   5.170 is_y==1 and label is '18 to 20'
   4.263 ages==(6,) and label is '?'
   4.263 max_age==0 and label is '?'
   4.263 mean_age==0 and label is '?'
   4.263 ages==(7,) and label is '?'
   4.022 ages==(30, 39) and label is '30 to 34'
   3.913 ages==(50, 59) and label is '50 to 54'
   3.768 ages==(18, 21) and label is '18 to 20'
None
Decision Tree Classifier
decision_classifier = DecisionTreeClassifier.train(train_set)
print("Accuracy of DecisionTreeClassifier: {} ".format(classify.accuracy(decision_classifier, test_set)))
Accuracy of DecisionTreeClassifier: 0.9270833333333334 

4. Evaluation

print('Enter q (or) quit to end this test module')
while 1:
    data = input('\nEnter data for testing: ')
    if data.lower() == 'q' or data.lower() == 'quit':
        print('End')
        break

    if not len(data):
        continue

    features = feature_extraction(data)
    print(features)
    prediction = [nb_classifier.classify(features),
                  max_classifier.classify(features),
                  decision_classifier.classify(features)]

    print('NaiveBayes Classifier     : ', prediction[0])
    print('Maxent Classifier         : ', prediction[1])
    print('Decision Tree Classifier  : ', prediction[2])
    print('-'*75)
    print('(Best of 3) =              ', max(set(prediction), key=prediction.count))
Enter q (or) quit to end this test module

Enter data for testing: 18
{'under_18': 0, 'spl_chr': '', 'ages': (18,), 'is_digit': 1, 'is_y': 0, 'only_18': 1, 'max_age': 18, 'len(ages)': 1, 'mean_age': 18, 'above_65': 0}
NaiveBayes Classifier     :  Under 18
Maxent Classifier         :  Under 18
Decision Tree Classifier  :  Under 18
---------------------------------------------------------------------------
(Best of 3) =               Under 18

Enter data for testing: 78
{'under_18': 0, 'spl_chr': '', 'ages': (78,), 'is_digit': 1, 'is_y': 0, 'only_18': 0, 'max_age': 78, 'len(ages)': 1, 'mean_age': 78, 'above_65': 1}
NaiveBayes Classifier     :  65+
Maxent Classifier         :  65+
Decision Tree Classifier  :  65+
---------------------------------------------------------------------------
(Best of 3) =               65+

Enter data for testing: 34
{'under_18': 0, 'spl_chr': '', 'ages': (34,), 'is_digit': 1, 'is_y': 0, 'only_18': 0, 'max_age': 34, 'len(ages)': 1, 'mean_age': 34, 'above_65': 0}
NaiveBayes Classifier     :  30 to 34
Maxent Classifier         :  30 to 34
Decision Tree Classifier  :  30 to 34
---------------------------------------------------------------------------
(Best of 3) =               30 to 34

Enter data for testing: 39
{'under_18': 0, 'spl_chr': '', 'ages': (39,), 'is_digit': 1, 'is_y': 0, 'only_18': 0, 'max_age': 39, 'len(ages)': 1, 'mean_age': 39, 'above_65': 0}
NaiveBayes Classifier     :  35 to 39
Maxent Classifier         :  35 to 39
Decision Tree Classifier  :  35 to 39
---------------------------------------------------------------------------
(Best of 3) =               35 to 39

Enter data for testing: 55
{'under_18': 0, 'spl_chr': '', 'ages': (55,), 'is_digit': 1, 'is_y': 0, 'only_18': 0, 'max_age': 55, 'len(ages)': 1, 'mean_age': 55, 'above_65': 0}
NaiveBayes Classifier     :  55 to 59
Maxent Classifier         :  55 to 59
Decision Tree Classifier  :  55 to 59
---------------------------------------------------------------------------
(Best of 3) =               55 to 59

Enter data for testing: 1
{'under_18': 0, 'spl_chr': '', 'ages': (1,), 'is_digit': 1, 'is_y': 0, 'only_18': 0, 'max_age': 0, 'len(ages)': 1, 'mean_age': 0, 'above_65': 0}
NaiveBayes Classifier     :  55 to 59
Maxent Classifier         :  ?
Decision Tree Classifier  :  65+
---------------------------------------------------------------------------
(Best of 3) =               65+
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment