You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one "raw" data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics. A data wrangler is a person who performs these transformation operations. Wiki
Wrangler is an interactive tool for data cleaning and transformation.
Spend less time formatting and more time analyzing your data. stanford
Example - 1
0 - Requirement
I was given a data problem where I have to write a model to auto-clean database values without manual work. This was my first practical ML solution delivered to my client.
Analysing the dataset before processing. I was given a column of actual values their corresponding correction values. I have planned to use the same solution similar to name gender prediction in my previous project Github - Name Gender Prediction
array(['18 to 20', '21 to 24', '25 to 29', '30 to 34', '35 to 39',
'40 to 44', '45 to 49', '50 to 54', '55 to 59', '60 to 64', '65+',
'Declined to Respond', 'Under 18', '?'], dtype=object)
2. Solution
Making feature matrix X
deffeature_extraction(_data):
""" This function is used to extract features in a given data value"""# Find the digits in the given string Example - data='18-20' digits = '1820'digits=str(''.join(cforcin_dataifc.isdigit()))
# calculate the length of the stringlen_digits=len(digits)
# splitting digits in to values example - digits = '1820' ages = [18, 20]ages= [int(digits[i:i+2]) foriinrange(0, len_digits, 2)]
# checking for special character in the given dataspecial_character='.+-<>?'spl_char=''.join([cforcinlist(special_character) ifcin_data])
# handling decimal age dataiflen_digits==3:
spl_char='.'age="".join([str(ages[0]), '.', str(ages[1])])
# normalizingage=int(float(age) -0.5)
ages= [age]
# Finding the maximum, minimum, average age valuesmax_age=0min_age=0mean_age=0iflen(ages):
max_age=max(ages)
min_age=min(ages)
iflen(ages) ==2:
mean_age=int((max_age+min_age) /2)
else:
mean_age=max_age# specially added for 18 years casesonly_18=0is_y=0ifages== [18]:
only_18=1if'y'in_dataor'Y'in_data:
is_y=1under_18=0if1<max_age<18:
under_18=1above_65=0ifmean_age>=65:
above_65=1# verifying whether digit is found in the given string or not.# Example - data='18-20' digits_found=True data='????' digits_found=Falsedigits_found=1iflen_digits==1:
digits_found=1max_age, min_age, mean_age, only_18, is_y, above_65, under_18=0, 0, 0, 0, 0, 0, 0eliflen_digits==0:
digits_found, max_age, min_age, mean_age, only_18, is_y, above_65, under_18=-1, -1, -1, -1, -1, -1, -1, -1feature= {
'ages': tuple(ages),
'len(ages)': len(ages),
'spl_chr': spl_char,
'is_digit': digits_found,
'max_age': max_age,
'mean_age': mean_age,
'only_18': only_18,
'is_y': is_y,
'above_65': above_65,
'under_18': under_18
}
returnfeature
-7.058 above_65==0 and label is '65+'
5.341 spl_chr=='?' and label is '?'
5.170 is_y==1 and label is '18 to 20'
4.263 ages==(6,) and label is '?'
4.263 max_age==0 and label is '?'
4.263 mean_age==0 and label is '?'
4.263 ages==(7,) and label is '?'
4.022 ages==(30, 39) and label is '30 to 34'
3.913 ages==(50, 59) and label is '50 to 54'
3.768 ages==(18, 21) and label is '18 to 20'
None