Skip to content

Instantly share code, notes, and snippets.

@ravishchawla
Last active March 20, 2020 16:59
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ravishchawla/f6027dbdbb5edc615b9f20a21d18ff41 to your computer and use it in GitHub Desktop.
Save ravishchawla/f6027dbdbb5edc615b9f20a21d18ff41 to your computer and use it in GitHub Desktop.
quora_data_cleaning
Data Cleaning Procedure Coverage of Vocabulary Coverage of Dataset
Raw Data (all records) 0.18 0.71
Raw Data (on 10% sample) 0.08 0.71
Lower Casing all words (on 10% sample) 0.10 0.87
Removing and Replacing Non-Alpha Numeric Characters (on 10% sample) 0.11 0.98
Replacing Contractions with Full words (on 10% sample) 0.11 0.98
All methods (all records) 0.27 0.98
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment