-
What are you trying to solve?
-
Are you asking the right question?
-
Where is this data coming from?
-
Is this data recent? When? Where? Who?
-
Has this data been changed? How? Who? When?
-
Are there any strings?
- is it formatted for Windows, Mac, or Linux (newlines)
- should it have any non-ASCII?
- how should non-ASCII be handled?
-
What are in each of the columns?
- what is the data?
- what are the units?
- what is the upper, lower bounds?
- what should the average be?
- what is the (null) value?
- how should missing values be handled?
-
Manually scan the data for errors
-
Keep a backup of the raw data
-
Do data cleaning in the native format (preferably)
-
Convert to other formats with a known, shared, versioned, conversion tool
-
Track what changes have been done
-
Automate changes, to later build a data pipeline
-
Get feedback on data daily, at least weekly
Last active
September 16, 2016 15:31
-
-
Save apolloclark/24999f739c2181362d2b2a44a6b81dff to your computer and use it in GitHub Desktop.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment