Skip to content

Instantly share code, notes, and snippets.

@zacstewart
Created July 21, 2012 22:00
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save zacstewart/3157342 to your computer and use it in GitHub Desktop.
Save zacstewart/3157342 to your computer and use it in GitHub Desktop.
EMI Music Data Science Hackathon data cleaning scripts
#!/usr/bin/env sh
# This transforms the 13 answers to the WORKING column
# into integers 0-12, 5 answers to MUSIC into 0-4,
# normailizes LIST_OWN, and LIST_BACK into integer values
# and drops rows with variations of "16+ hours"
# (loses 2.623031698% of the data)
INFILE=$1
sed -e "
s/\"Employed 30+ hours a week\"/0/g;
s/\"Employed 8-29 hours per week\"/1/g;
s/\"Employed part-time less than 8 hours per week\"/2/g;
s/\"Full-time housewife \/ househusband\"/3/g;
s/\"Full-time student\"/4/g;
s/\"In unpaid employment (e.g. voluntary work)\"/5/g;
s/\"Other\"/6/g;
s/\"Part-time student\"/7/g;
s/\"Prefer not to state\"/8/g;
s/\"Retired from full-time employment (30+ hours per week)\"/9/g;
s/\"Retired from self-employment\"/10/g;
s/\"Self-employed\"/11/g;
s/\"Temporarily unemployed\"/12/g;
s/\"Music has no particular interest for me\"/0/g;
s/\"Music is no longer as important as it used to be to me\"/1/g;
s/\"I like music but it does not feature heavily in my life\"/2/g;
s/\"Music is important to me but not necessarily more important than other hobbies or interests\"/3/g;
s/\"Music is important to me but not necessarily more important\"/3/g;
s/\"Music means a lot to me and is a passion of mine\"/4/g;
s/\"\([0-9]*\) [a-zA-Z]*\"/\1/g;
s/\"Less than an hour\"/0/g
" $INFILE | grep -Ev '"More than 16 hours"|"16\+ hours"'
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment