vijayanandrp/age_analysis.md

## age_analysis.md

      
    Raw
  

              age_analysis.md
            
          
    Data wrangling

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations. Wiki
Wrangler is an interactive tool for data cleaning and transformation.
Spend less time formatting and more time analyzing your data. stanford
Example - 1

0 - Requirement

I was given a data problem where I have to write a model to auto-clean database values without manual work. This was my first practical ML solution delivered to my client.
1. Analysis

#!/usr/bin/env python3.5
# encoding: utf-8

import random
import csv
from nltk import classify, NaiveBayesClassifier, MaxentClassifier, DecisionTreeClassifier

age_file = 'age.csv'
training_percent = 0.8
Analysing the dataset before processing. I was given a column of actual values their corresponding correction values. I have planned to use the same solution similar to name gender prediction in my previous project Github - Name Gender Prediction
import pandas as pd
age_df = pd.read_csv(age_file, header=None, usecols=[1,2])
age_df.rename(columns={1:'actual', 2:'correction'}, inplace=True)
age_df.shape
(480, 2)

age_df.describe()


      actual
      correction
    
  
      count
      480
      480
    
    
      unique
      480
      14
    
    
      top
      34 old
      30 to 34
    
    
      freq
      1
      51
    
  
age_df.head()


      actual
      correction
    
  
      0
      18 to 20
      18 to 20
    
    
      1
      18
      18 to 20
    
    
      2
      18 - 20
      18 to 20
    
    
      3
      18 - 21
      18 to 20
    
    
      4
      18 - 22
      18 to 20
    
  
age_df.tail()


      actual
      correction
    
  
      475
      ?? ??
      ?
    
    
      476
      ?? ??? ????
      ?
    
    
      477
      ???? ??? ????
      ?
    
    
      478
      SHL Bureau 7
      ?
    
    
      479
      SMT6
      ?
    
  
age_df.sample(10)


      actual
      correction
    
  
      453
      17
      Under 18
    
    
      95
      26yrs
      25 to 29
    
    
      174
      35
      35 to 39
    
    
      457
      18岁以下
      Under 18
    
    
      7
      18 years old
      18 to 20
    
    
      147
      32
      30 to 34
    
    
      370
      59 yrs
      55 to 59
    
    
      331
      53 years
      50 to 54
    
    
      285
      47 YRS
      45 to 49
    
    
      282
      46yrs
      45 to 49
    
  
age_df['correction'].unique()
array(['18 to 20', '21 to 24', '25 to 29', '30 to 34', '35 to 39',
       '40 to 44', '45 to 49', '50 to 54', '55 to 59', '60 to 64', '65+',
       'Declined to Respond', 'Under 18', '?'], dtype=object)

2. Solution

Making feature matrix  X

def feature_extraction(_data):
    """ This function is used to extract features in a given data value"""
    # Find the digits in the given string Example - data='18-20' digits = '1820'
    digits = str(''.join(c for c in _data if c.isdigit()))
    # calculate the length of the string
    len_digits = len(digits)
    # splitting digits in to values example - digits = '1820' ages = [18, 20]
    ages = [int(digits[i:i + 2]) for i in range(0, len_digits, 2)]
    # checking for special character in the given data
    special_character = '.+-<>?'
    spl_char = ''.join([c for c in list(special_character) if c in _data])
    # handling decimal age data
    if len_digits == 3:
        spl_char = '.'
        age = "".join([str(ages[0]), '.', str(ages[1])])
        # normalizing
        age = int(float(age) - 0.5)
        ages = [age]
    # Finding the maximum, minimum, average age values
    max_age = 0
    min_age = 0
    mean_age = 0
    if len(ages):
        max_age = max(ages)
        min_age = min(ages)
    if len(ages) == 2:
        mean_age = int((max_age + min_age) / 2)
    else:
        mean_age = max_age
    # specially added for 18 years cases
    only_18 = 0
    is_y = 0
    if ages == [18]:
        only_18 = 1
        if 'y' in _data or 'Y' in _data:
            is_y = 1
    under_18 = 0
    if 1 < max_age < 18:
        under_18 = 1
    above_65 = 0
    if mean_age >= 65:
        above_65 = 1
    # verifying whether digit is found in the given string or not.
    # Example - data='18-20' digits_found=True data='????' digits_found=False
    digits_found = 1
    if len_digits == 1:
        digits_found = 1
        max_age, min_age, mean_age, only_18, is_y, above_65, under_18 = 0, 0, 0, 0, 0, 0, 0
    elif len_digits == 0:
        digits_found, max_age, min_age, mean_age, only_18, is_y, above_65, under_18 = -1, -1, -1, -1, -1, -1, -1, -1
     
    feature = {
        'ages': tuple(ages),
        'len(ages)': len(ages),
        'spl_chr': spl_char,
        'is_digit': digits_found,
        'max_age': max_age,
        'mean_age': mean_age,
        'only_18': only_18,
        'is_y': is_y,
        'above_65': above_65,
        'under_18': under_18
    }

    return feature
Loading dataset

dataset = []
with open(age_file, newline='\n') as fp:
    input_data = csv.reader(fp, delimiter=',')
    for row in input_data:
        dataset.append((row[1:]))
feature_sets = [(actual, correction) for (actual, correction) in dataset]
random.shuffle(feature_sets)
creating feature matrix X and response vector y

feature_sets = [(feature_extraction(source), corrected) for (source, corrected) in feature_sets]
Visualizing Feature Matrix X

feature_val = [val[0]  for val in feature_sets]
feature_df = pd.DataFrame(feature_val)
feature_df.shape
(480, 10)

feature_df.sample(10)


      above_65
      ages
      is_digit
      is_y
      len(ages)
      max_age
      mean_age
      only_18
      spl_chr
      under_18
    
  
      183
      1
      (65,)
      1
      0
      1
      65
      65
      0
      
      0
    
    
      74
      0
      (50, 55)
      1
      0
      2
      55
      52
      0
      -
      0
    
    
      238
      0
      (55,)
      1
      0
      1
      55
      55
      0
      
      0
    
    
      451
      0
      (20,)
      1
      0
      1
      20
      20
      0
      
      0
    
    
      157
      0
      (55,)
      1
      0
      1
      55
      55
      0
      
      0
    
    
      342
      0
      (49,)
      1
      0
      1
      49
      49
      0
      
      0
    
    
      122
      0
      (34,)
      1
      0
      1
      34
      34
      0
      
      0
    
    
      99
      1
      (66,)
      1
      0
      1
      66
      66
      0
      
      0
    
    
      232
      0
      (27,)
      1
      0
      1
      27
      27
      0
      
      0
    
    
      85
      0
      (28,)
      1
      0
      1
      28
      28
      0
      
      0
    
  
Train Test Split

cut_point = int(len(feature_sets) * training_percent)
train_set, test_set = feature_sets[:cut_point], feature_sets[cut_point:]
NaiveBayes Classifier

nb_classifier = NaiveBayesClassifier.train(train_set)
print("Accuracy of NaiveBayesClassifier: {} ".format(classify.accuracy(nb_classifier, test_set)))
Accuracy of NaiveBayesClassifier: 0.9583333333333334 

print(nb_classifier.show_most_informative_features(10))
Most Informative Features
                above_65 = 0              25 to  : 65+    =     25.1 : 1.0
                 max_age = 65                65+ : 60 to  =      9.1 : 1.0
                 max_age = 39             35 to  : 30 to  =      6.9 : 1.0
                 max_age = 59             55 to  : 50 to  =      6.5 : 1.0
               len(ages) = 2              60 to  : Under  =      4.3 : 1.0
                 only_18 = 0              25 to  : Under  =      4.0 : 1.0
                 max_age = 21             21 to  : 18 to  =      3.2 : 1.0
                    ages = (18,)          Under  : 18 to  =      3.0 : 1.0
                 max_age = 18             Under  : 18 to  =      3.0 : 1.0
                mean_age = 18             Under  : 18 to  =      3.0 : 1.0
None

Maxent Classifier

max_classifier = MaxentClassifier.train(train_set)
  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -2.63906        0.055
             2          -1.68058        0.927
             3          -1.25474        0.961
             4          -0.98505        0.977
             5          -0.80323        0.977
             6          -0.67437        0.977
             7          -0.57929        0.977
            22          -0.18180        0.995
            23          -0.17391        0.995
            24          -0.16670        0.995
            25          -0.16008        0.995
            26          -0.15398        0.995
            94          -0.04679        0.995
            95          -0.04637        0.995
            96          -0.04597        0.995
            97          -0.04557        0.995
            98          -0.04518        0.995
            99          -0.04480        0.995
         Final          -0.04442        0.995

print("Accuracy of MaxentClassifier: {} ".format(classify.accuracy(max_classifier, test_set)))
Accuracy of MaxentClassifier: 0.9895833333333334 

print(max_classifier.show_most_informative_features(10))
  -7.058 above_65==0 and label is '65+'
   5.341 spl_chr=='?' and label is '?'
   5.170 is_y==1 and label is '18 to 20'
   4.263 ages==(6,) and label is '?'
   4.263 max_age==0 and label is '?'
   4.263 mean_age==0 and label is '?'
   4.263 ages==(7,) and label is '?'
   4.022 ages==(30, 39) and label is '30 to 34'
   3.913 ages==(50, 59) and label is '50 to 54'
   3.768 ages==(18, 21) and label is '18 to 20'
None

Decision Tree Classifier

decision_classifier = DecisionTreeClassifier.train(train_set)
print("Accuracy of DecisionTreeClassifier: {} ".format(classify.accuracy(decision_classifier, test_set)))
Accuracy of DecisionTreeClassifier: 0.9270833333333334 

4. Evaluation

print('Enter q (or) quit to end this test module')
while 1:
    data = input('\nEnter data for testing: ')
    if data.lower() == 'q' or data.lower() == 'quit':
        print('End')
        break

    if not len(data):
        continue

    features = feature_extraction(data)
    print(features)
    prediction = [nb_classifier.classify(features),
                  max_classifier.classify(features),
                  decision_classifier.classify(features)]

    print('NaiveBayes Classifier     : ', prediction[0])
    print('Maxent Classifier         : ', prediction[1])
    print('Decision Tree Classifier  : ', prediction[2])
    print('-'*75)
    print('(Best of 3) =              ', max(set(prediction), key=prediction.count))
Enter q (or) quit to end this test module

Enter data for testing: 18
{'under_18': 0, 'spl_chr': '', 'ages': (18,), 'is_digit': 1, 'is_y': 0, 'only_18': 1, 'max_age': 18, 'len(ages)': 1, 'mean_age': 18, 'above_65': 0}
NaiveBayes Classifier     :  Under 18
Maxent Classifier         :  Under 18
Decision Tree Classifier  :  Under 18
---------------------------------------------------------------------------
(Best of 3) =               Under 18

Enter data for testing: 78
{'under_18': 0, 'spl_chr': '', 'ages': (78,), 'is_digit': 1, 'is_y': 0, 'only_18': 0, 'max_age': 78, 'len(ages)': 1, 'mean_age': 78, 'above_65': 1}
NaiveBayes Classifier     :  65+
Maxent Classifier         :  65+
Decision Tree Classifier  :  65+
---------------------------------------------------------------------------
(Best of 3) =               65+

Enter data for testing: 34
{'under_18': 0, 'spl_chr': '', 'ages': (34,), 'is_digit': 1, 'is_y': 0, 'only_18': 0, 'max_age': 34, 'len(ages)': 1, 'mean_age': 34, 'above_65': 0}
NaiveBayes Classifier     :  30 to 34
Maxent Classifier         :  30 to 34
Decision Tree Classifier  :  30 to 34
---------------------------------------------------------------------------
(Best of 3) =               30 to 34

Enter data for testing: 39
{'under_18': 0, 'spl_chr': '', 'ages': (39,), 'is_digit': 1, 'is_y': 0, 'only_18': 0, 'max_age': 39, 'len(ages)': 1, 'mean_age': 39, 'above_65': 0}
NaiveBayes Classifier     :  35 to 39
Maxent Classifier         :  35 to 39
Decision Tree Classifier  :  35 to 39
---------------------------------------------------------------------------
(Best of 3) =               35 to 39

Enter data for testing: 55
{'under_18': 0, 'spl_chr': '', 'ages': (55,), 'is_digit': 1, 'is_y': 0, 'only_18': 0, 'max_age': 55, 'len(ages)': 1, 'mean_age': 55, 'above_65': 0}
NaiveBayes Classifier     :  55 to 59
Maxent Classifier         :  55 to 59
Decision Tree Classifier  :  55 to 59
---------------------------------------------------------------------------
(Best of 3) =               55 to 59

Enter data for testing: 1
{'under_18': 0, 'spl_chr': '', 'ages': (1,), 'is_digit': 1, 'is_y': 0, 'only_18': 0, 'max_age': 0, 'len(ages)': 1, 'mean_age': 0, 'above_65': 0}
NaiveBayes Classifier     :  55 to 59
Maxent Classifier         :  ?
Decision Tree Classifier  :  65+
---------------------------------------------------------------------------
(Best of 3) =               65+
	actual	correction
0	18 to 20	18 to 20
1	18	18 to 20
2	18 - 20	18 to 20
3	18 - 21	18 to 20
4	18 - 22	18 to 20
	actual	correction
475	?? ??	?
476	?? ??? ????	?
477	???? ??? ????	?
478	SHL Bureau 7	?
479	SMT6	?
	actual	correction
453	17	Under 18
95	26yrs	25 to 29
174	35	35 to 39
457	18岁以下	Under 18
7	18 years old	18 to 20
147	32	30 to 34
370	59 yrs	55 to 59
331	53 years	50 to 54
285	47 YRS	45 to 49
282	46yrs	45 to 49
	above_65	ages	is_digit	len(ages)	max_age	mean_age	spl_chr
183	1	(65,)	1	1	65	65
74	0	(50, 55)	1	2	55	52	-
238	0	(55,)	1	1	55	55
451	0	(20,)	1	1	20	20
157	0	(55,)	1	1	55	55
342	0	(49,)	1	1	49	49
122	0	(34,)	1	1	34	34
99	1	(66,)	1	1	66	66
232	0	(27,)	1	1	27	27
85	0	(28,)	1	1	28	28