Skip to content

Instantly share code, notes, and snippets.

@apolloclark
Last active September 16, 2016 15:31
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save apolloclark/24999f739c2181362d2b2a44a6b81dff to your computer and use it in GitHub Desktop.
Save apolloclark/24999f739c2181362d2b2a44a6b81dff to your computer and use it in GitHub Desktop.
  • What are you trying to solve?

  • Are you asking the right question?

  • Where is this data coming from?

  • Is this data recent? When? Where? Who?

  • Has this data been changed? How? Who? When?

  • Are there any strings?

    • is it formatted for Windows, Mac, or Linux (newlines)
    • should it have any non-ASCII?
    • how should non-ASCII be handled?
  • What are in each of the columns?

    • what is the data?
    • what are the units?
    • what is the upper, lower bounds?
    • what should the average be?
    • what is the (null) value?
    • how should missing values be handled?
  • Manually scan the data for errors

  • Keep a backup of the raw data

  • Do data cleaning in the native format (preferably)

  • Convert to other formats with a known, shared, versioned, conversion tool

  • Track what changes have been done

  • Automate changes, to later build a data pipeline

  • Get feedback on data daily, at least weekly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment