vijayanandrp/gender_analysis.md

## gender_analysis.md

      
    Raw
  

              gender_analysis.md
            
          
    Data wrangling

Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations. Wiki
Wrangler is an interactive tool for data cleaning and transformation.
Spend less time formatting and more time analyzing your data. stanford
Example - 1

0 - Requirement

I was given a data problem where I have to write a model to auto-clean database values without manual work. This was my first practical ML solution delivered to my client.
1. Analysis

#!/usr/bin/env python3.5
# encoding: utf-8

import random
import csv
from nltk import classify, NaiveBayesClassifier, MaxentClassifier, DecisionTreeClassifier

gender_file = 'gender.csv'
training_percent = 0.8
Analysing the dataset before processing. I was given a column of actual values their corresponding correction values. I have planned to use the same solution similar to name gender prediction in my previous project Github - Name Gender Prediction
import pandas as pd
gender_df = pd.read_csv(gender_file, header=None, usecols=[1,2])
gender_df.rename(columns={1:'actual', 2:'correction'}, inplace=True)
gender_df.shape
(137, 2)

gender_df.describe()


      actual
      correction
    
  
      count
      137
      137
    
    
      unique
      137
      3
    
    
      top
      ???? ??? ???????
      Other/Prefer Not To Answer
    
    
      freq
      1
      73
    
  
gender_df.head()


      actual
      correction
    
  
      0
      Female
      Female
    
    
      1
      F
      Female
    
    
      2
      Female 1
      Female
    
    
      3
      Female male
      Female
    
    
      4
      Female1
      Female
    
  
gender_df.tail()


      actual
      correction
    
  
      132
      Wole nie podawac
      Other/Prefer Not To Answer
    
    
      133
      Επιλέξτε
      Other/Prefer Not To Answer
    
    
      134
      선택
      Other/Prefer Not To Answer
    
    
      135
      选择
      Other/Prefer Not To Answer
    
    
      136
      選擇
      Other/Prefer Not To Answer
    
  
gender_df.sample(10)


      actual
      correction
    
  
      53
      Mezczyzna
      Male
    
    
      20
      Ms
      Female
    
    
      125
      test1
      Other/Prefer Not To Answer
    
    
      94
      Karlkyns
      Other/Prefer Not To Answer
    
    
      74
      ?????
      Other/Prefer Not To Answer
    
    
      33
      女
      Female
    
    
      99
      Nespecificat
      Other/Prefer Not To Answer
    
    
      75
      ???????
      Other/Prefer Not To Answer
    
    
      113
      Prefer Not To Answer
      Other/Prefer Not To Answer
    
    
      61
      Άνδρας
      Male
    
  
gender_df['correction'].unique()
array(['Female', 'Male', 'Other/Prefer Not To Answer'], dtype=object)

2. Solution

Making feature matrix  X

def feature_extraction(_data):
    """ This function is used to extract features in a given data value"""
    _data = _data.lower()
    f_1, f_2, f_3, f_4, l_1, l_2, l_3, l_4 = None, None, None, None, None, None, None ,None
    
    # extracting first and last 4 characters
    if len(_data) >= 4:
        f_4 = _data[:4]
        l_4 = _data[-4:]
    # extracting first and last 3 characters
    if len(_data) >= 3:
        f_3 = _data[:3]
        l_3 = _data[-3:]
    # extracting first and last 2 characters
    if len(_data) >= 2:
        f_2 = _data[:2]
        l_2 = _data[-2:]
    # extracting first and last 1 character
    if len(_data) >= 1:
        f_1 = _data[:1]
        l_1 = _data[-1:]
    
    feature = {
        'f_1': f_1,
        'f_2': f_2,
        'l_1': l_1,
        'l_2': l_2,
        'f_3': f_3,
        'f_4': f_4,
        'l_3': l_3,
        'l_4': l_4
    }

    return feature
Loading dataset

dataset = []
with open(gender_file, newline='\n') as fp:
    input_data = csv.reader(fp, delimiter=',')
    for row in input_data:
        dataset.append((row[1:]))
feature_sets = [(actual, correction) for (actual, correction) in dataset]
random.shuffle(feature_sets)
creating feature matrix X and response vector y

feature_sets = [(feature_extraction(source), corrected) for (source, corrected) in feature_sets]
Visualizing Feature Matrix X

feature_val = [val[0]  for val in feature_sets]
feature_df = pd.DataFrame(feature_val)
feature_df.shape
(137, 8)

feature_df.sample(10)


      f_1
      f_2
      f_3
      f_4
      l_1
      l_2
      l_3
      l_4
    
  
      122
      v
      ve
      vel
      velg
      g
      lg
      elg
      velg
    
    
      46
      m
      ma
      mal
      male
      1
      1
      e 1
      le 1
    
    
      49
      b
      be
      bez
      bez
      a
      ra
      ora
      vora
    
    
      100
      e
      er
      err
      erre
      k
      ék
      nék
      lnék
    
    
      59
      w
      wa
      wan
      wani
      a
      ta
      ita
      nita
    
    
      124
      ž
      že
      žen
      žens
      ý
      ký
      ský
      nský
    
    
      91
      f
      fe
      fem
      femm
      e
      me
      mme
      emme
    
    
      42
      ?
      ??
      None
      None
      ?
      ??
      None
      None
    
    
      34
      ?
      ??
      ??
      ?? ?
      ?
      ??
      ???
      ????
    
    
      119
      m
      mr
      None
      None
      r
      mr
      None
      None
    
  
Train Test Split

cut_point = int(len(feature_sets) * training_percent)
train_set, test_set = feature_sets[:cut_point], feature_sets[cut_point:]
NaiveBayes Classifier

nb_classifier = NaiveBayesClassifier.train(train_set)
print("Accuracy of NaiveBayesClassifier: {} ".format(classify.accuracy(nb_classifier, test_set)))
Accuracy of NaiveBayesClassifier: 0.75 

print(nb_classifier.show_most_informative_features(10))
Most Informative Features
                     f_1 = 'm'              Male : Other/ =     20.9 : 1.0
                     f_1 = 'f'            Female : Male   =      6.5 : 1.0
                     l_1 = 'a'            Female : Other/ =      6.3 : 1.0
                     f_1 = 'k'            Female : Other/ =      3.9 : 1.0
                     f_1 = 'p'            Other/ : Male   =      3.5 : 1.0
                     f_2 = 'kv'           Female : Other/ =      3.3 : 1.0
                     l_1 = 'k'              Male : Other/ =      2.8 : 1.0
                     l_1 = '1'              Male : Other/ =      2.8 : 1.0
                     l_1 = 'i'              Male : Other/ =      2.8 : 1.0
                     f_1 = 'n'            Other/ : Female =      2.8 : 1.0
None

Maxent Classifier

max_classifier = MaxentClassifier.train(train_set)
  ==> Training (100 iterations)

      Iteration    Log Likelihood    Accuracy
      ---------------------------------------
             1          -1.09861        0.495
             2          -0.55492        0.954
             3          -0.39256        0.991
             4          -0.30684        1.000
             5          -0.25292        1.000
             6          -0.21564        1.000
             7          -0.18823        1.000
             8          -0.16719        1.000
             9          -0.15051        1.000
            10          -0.13694        1.000
            11          -0.12568        1.000
            94          -0.01688        1.000
            95          -0.01671        1.000
            96          -0.01654        1.000
            97          -0.01637        1.000
            98          -0.01621        1.000
            99          -0.01605        1.000
         Final          -0.01590        1.000

print("Accuracy of MaxentClassifier: {} ".format(classify.accuracy(max_classifier, test_set)))
Accuracy of MaxentClassifier: 0.75 

print(max_classifier.show_most_informative_features(10))
  -4.701 f_1=='m' and label is 'Other/Prefer Not To Answer'
   3.138 l_1=='f' and label is 'Female'
   3.132 l_1=='男' and label is 'Male'
   3.132 f_1=='男' and label is 'Male'
  -2.761 f_1=='m' and label is 'Female'
   2.704 l_1=='女' and label is 'Female'
   2.704 f_1=='女' and label is 'Female'
   2.640 l_2=='nő' and label is 'Female'
   2.640 l_1=='ő' and label is 'Female'
   2.640 f_2=='nő' and label is 'Female'
None

Decision Tree Classifier

decision_classifier = DecisionTreeClassifier.train(train_set)
print("Accuracy of DecisionTreeClassifier: {} ".format(classify.accuracy(decision_classifier, test_set)))
Accuracy of DecisionTreeClassifier: 0.6428571428571429 

4. Evaluation

print('Enter q (or) quit to end this test module')
while 1:
    data = input('\nEnter data for testing: ')
    if data.lower() == 'q' or data.lower() == 'quit':
        print('End')
        break

    if not len(data):
        continue

    features = feature_extraction(data)
    print(features)
    prediction = [nb_classifier.classify(features),
                  max_classifier.classify(features),
                  decision_classifier.classify(features)]

    print('NaiveBayes Classifier     : ', prediction[0])
    print('Maxent Classifier         : ', prediction[1])
    print('Decision Tree Classifier  : ', prediction[2])
    print('-'*75)
    print('(Best of 3) =              ', max(set(prediction), key=prediction.count))
Enter q (or) quit to end this test module

Enter data for testing: M
{'f_3': None, 'l_4': None, 'f_4': None, 'l_2': None, 'l_1': 'm', 'f_2': None, 'f_1': 'm', 'l_3': None}
NaiveBayes Classifier     :  Male
Maxent Classifier         :  Male
Decision Tree Classifier  :  Male
---------------------------------------------------------------------------
(Best of 3) =               Male

Enter data for testing: F
{'f_3': None, 'l_4': None, 'f_4': None, 'l_2': None, 'l_1': 'f', 'f_2': None, 'f_1': 'f', 'l_3': None}
NaiveBayes Classifier     :  Female
Maxent Classifier         :  Female
Decision Tree Classifier  :  Female
---------------------------------------------------------------------------
(Best of 3) =               Female

Enter data for testing: female
{'f_3': 'fem', 'l_4': 'male', 'f_4': 'fema', 'l_2': 'le', 'l_1': 'e', 'f_2': 'fe', 'f_1': 'f', 'l_3': 'ale'}
NaiveBayes Classifier     :  Female
Maxent Classifier         :  Female
Decision Tree Classifier  :  Female
---------------------------------------------------------------------------
(Best of 3) =               Female

Enter data for testing: MMMMM
{'f_3': 'mmm', 'l_4': 'mmmm', 'f_4': 'mmmm', 'l_2': 'mm', 'l_1': 'm', 'f_2': 'mm', 'f_1': 'm', 'l_3': 'mmm'}
NaiveBayes Classifier     :  Male
Maxent Classifier         :  Male
Decision Tree Classifier  :  Female
---------------------------------------------------------------------------
(Best of 3) =               Male

Enter data for testing: FFFFF
{'f_3': 'fff', 'l_4': 'ffff', 'f_4': 'ffff', 'l_2': 'ff', 'l_1': 'f', 'f_2': 'ff', 'f_1': 'f', 'l_3': 'fff'}
NaiveBayes Classifier     :  Female
Maxent Classifier         :  Female
Decision Tree Classifier  :  Female
---------------------------------------------------------------------------
(Best of 3) =               Female
	actual	correction
count	137	137
unique	137	3
top	???? ??? ???????	Other/Prefer Not To Answer
freq	1	73
	actual	correction
0	Female	Female
1	F	Female
2	Female 1	Female
3	Female male	Female
4	Female1	Female
	actual	correction
132	Wole nie podawac	Other/Prefer Not To Answer
133	Επιλέξτε	Other/Prefer Not To Answer
134	선택	Other/Prefer Not To Answer
135	选择	Other/Prefer Not To Answer
136	選擇	Other/Prefer Not To Answer
	actual	correction
53	Mezczyzna	Male
20	Ms	Female
125	test1	Other/Prefer Not To Answer
94	Karlkyns	Other/Prefer Not To Answer
74	?????	Other/Prefer Not To Answer
33	女	Female
99	Nespecificat	Other/Prefer Not To Answer
75	???????	Other/Prefer Not To Answer
113	Prefer Not To Answer	Other/Prefer Not To Answer
61	Άνδρας	Male
	f_1	f_2	f_3	f_4	l_1	l_2	l_3	l_4
122	v	ve	vel	velg	g	lg	elg	velg
46	m	ma	mal	male	1	1	e 1	le 1
49	b	be	bez	bez	a	ra	ora	vora
100	e	er	err	erre	k	ék	nék	lnék
59	w	wa	wan	wani	a	ta	ita	nita
124	ž	že	žen	žens	ý	ký	ský	nský
91	f	fe	fem	femm	e	me	mme	emme
42	?	??	None	None	?	??	None	None
34	?	??	??	?? ?	?	??	???	????
119	m	mr	None	None	r	mr	None	None