Skip to content

Instantly share code, notes, and snippets.

@adamamyl
Created January 29, 2013 18:05
Show Gist options
  • Save adamamyl/4666240 to your computer and use it in GitHub Desktop.
Save adamamyl/4666240 to your computer and use it in GitHub Desktop.
Adam brain-farts on "rules" for data
"it" means datasource(s)
1. is it accessible, without registration, academic credentials or not? what are the requirements, how long does it take
2. what are the licensing arrangements/terms
3. is this a restricted dataset? does it need to be stored/processed in line with iso 27001 [ http://en.wikipedia.org/wiki/ISO/IEC_27001 ] or special rules?
4. is re-use permitted? even commercial? (and subsidiary)
5. where's the documentation/explanatory notes
6. who collected it, why, where from, has it been aggregated, anonymized, (pseudo)randomized.
7. privacy assessment
8. what format(s) is it provided in?
9. are colors used to represent things when a Boolean value would be useable
10. is presentation handled in the datasource?
11. is it machine readable?
12. (spreadsheets) if multiple sheets, is each a separate file, or multiple sheet workbooks. is the column formatting/layout consistent with new things added; are things cumulative?
13. if it's a word processing document with tabulated data, does that extract nicely to TSV?
14. pdfs are bad for machine reading, even if the PDF is an open standard
15. if it deals with names, is it idiotic and the designer a moron who's not read [ http://www.kalzumeus.com/2010/06/17/falsehoods-programmers-believe-about-names/ ]?
16. linked data. how? data dictionary and taxonomies. where. why. license.
17. geography, how? (lat/lon | east/north)
18. ancillary data sources/referenced things, where? what licenses do they have? note countries list etc
19. is the countries list up to date; which list is it? (check Serbia, Macedonia, Ukraine, *slavi*, Korea(s), Vatican, San Marino). is Ilford in Essex or the London Borough of Redbridge (1965). Note there is not a freely available open-source list of countries
20. encoding; what? utf-8 'good'. Are letters accented properly. what about dodgy escapings; presence of ‘&’ usually a bad sign
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment