Skip to content

Instantly share code, notes, and snippets.

@boblannon
Last active August 29, 2015 14:11
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save boblannon/02bd7bf15ead66da5e5b to your computer and use it in GitHub Desktop.
Save boblannon/02bd7bf15ead66da5e5b to your computer and use it in GitHub Desktop.
outline of presentation on why and how to build a shared open data commons. also some pitfalls to avoid.

The importance of an open, shared datacommmons

The work of making data public is hard

  • Policy decisions
  • Privacy concerns
  • Budgeting/resources

The path from "public" to "open", though, presents its own challenges

Often "public" means "viewable in at least one way by the public"

  • Only available as a gigantic scanned pdf

    • I mean, come on
  • Only available through a web search interface

    • eg. Illinois lobbying, SOPR personal transaction reports
    • Useful if:
      • You know what you're looking for
      • You're reasonably sure it's in there
      • You're not looking for, say, all the data
  • Data only available through an API

    • eg. New York Legislative Data API, openFDA API
    • APIs are fantastic for interfacing applications and programs
    • APIs are really terrible for populating databases
      • They're slooooow
      • When conducting research, you almost always need more context than a bunch of individual data points
      • Worse, many APIs in their very design make choices about which observations/attributes a user will care about. if they haven't anticipated your need, then submit a ticket and sit tight.
  • Data is available in bulk, in machine-readable format

    • We did it!
    • Wasn't that easy?
      • …actually, it probably WAS. Way easier than:
        • Building search engines
        • Designing APIs
        • Responding to user feedback and giving technical support
      • In terms of technical development cost, if it wasn't WAY cheaper, you were conned.

So now it's all out and useable: what do we need to break down silos and make data mashup-able?

  • Identifiers

    • eg. for providers, agencies, resources, contractors
    • Key questions:
      • Are there any?
      • Can we trust them to be unique?
      • Can we trust them to be stable? (ie, might they change some day?)
      • Do they work across datasets?
    • If there are no useful identifiers for key entities, everything you potentially might do with this data becomes an order of magnitude more expensive
      • Robots are getting better, but the plain fact is: you need to hire manual labor to fix this
      • If at all possible, solve this before opening your data (it's possible to do while respecting privacy and maintaining security. I promise.)
  • Controlled vocabulary

    • One demerit for every field that says "other, please specify"

    • Statistical models depend on having factors:

      • Limited number of choices for the value of some attribute/field
      • eg: what is your favorite color?
        • If the choices are {blue, green, red, yellow, orange, purple, pink}, we can start making models!
        • If the choices are {blue, green, red, yellow, orange, purple, pink, other - specify}, i guarantee you will lose 25% of your data to unusable and maddening answers like:
          • "fuschia"
          • "not sure, sorry!"
          • "red"
          • and, of course, everyone's favorite: " "
    • Importantly, if your data comes from various sources, do those sources share controlled vocabularies?

      • If you're not sure, and can't imagine the database admins of each source having regularly scheduled bowling nights, then the answer is almost certainly "no."
      • The way we label things matters
        • Across domains, words mean totally different things (eg "at risk" in a healthcare context vs an education context)
        • Even within domains!
          • Merck manual vs WHO ICD
          • ICD 9 vs ICD 10
          • A published, recognized ontology vs something someone made up with Wikipedia
    • What is the solution? Tower of babel? Rosetta stone? Despotic dictionaries?

      • Maybe, but probably not
      • We don't need to get orwellian, either
      • The best case is where each source cites some external, citeable, authoritative vocabulary
      • But at the very least: document! you might know what you mean by "minor rash," "high blood pressure," or "upper arm," we can only guess!
  • Strings attached

    • This is the really tough one, and dangerous pitfalls abound
    • To solve any of the above problems, orgs may have had to resort to some practical solutions, like:
      • Conform to a data schema that has licensing or copyright restrictions
      • Refer to a controlled vocabulary (ontology, glossary, etc) that imposes restrictions on its use
      • Hiring (or contracting) curators or data quality firms
        • Organizations with specialized models that focus specifically on cleaning, coding and reconciling data
        • These services are invaluable, but the issue of intellectual property must be taken seriously, early on.
          • Are the coding/classification methods under a license that prevents commercial use, or use without attribution?
          • Is the product of classification (the actual codes assigned) under restriction?
    • Even if it's already public, open data, you can get into trouble fast when you start to solve problems related to data quality and interchangeability

Procuring technology

What does a good technology provider look like?

  • Develops and makes available open source software
  • Understands and respects the fact that the data they're handling is public domain data
  • The provider can not under any circumstance act as a gatekeeper to the original itemized transactions as disclosed
  • The provider does not use proprietary identifiers or industry classifications that would require any license other than a public domain license on the data.

If the provider doesn't hit literally every one of these marks, you'll regret it. They will entrench themselves and extract endless fees for what could and should be a public service.

Always remember that, in a perfect world with endless funds, the government would be able to budget its own staff to do this. We can grudgingly accept that this might not be politically feasible everywhere, but that doesn't mean we have to accept vendor lock-in for a system that is one of the only ways to protect our democracy.

Early on: be wary of rent-seekers

  • "Dungeon Masters"

    • Firms or individuals hired to create custom software solutions for opening data
    • Incentivized to keep their code closed-source and under restrictive license
    • Almost always more expensive than engaging with an open-source community
    • No guarantee that the tradeoff will be more high-quality software
    • After initial investment, stuck with continuing costs associated with maintenance, support, security updates, etc. (and what happens if they fold in the meantime?!)
  • "Oracles"

    • Expert researchers or annotators hired to enrich, normalize and standardize data
    • To be clear: this level of expertise is CRUCIAL
      • You can't simply throw unskilled crowdsourcing at a task that requires deep domain knowledge
      • In many cases, these experts turn out to have pure motives
    • Nevertheless, hiring a small amount of them full time is an (often painful) bottleneck to publishing quality data. there are more efficient ways of engaging experts.
  • "Wizards"

    • The latest and greatest artificial intelligence breakthrough that's going to solve all of your problems at a third of the cost and before dinnertime
    • Inceasingly common!
    • Technology is amazing, machine learning is often astounding, but no black box solution is going to solve all (or even most) of your problems.
    • Wizardry is useful (I use it myself!), but it's just a tool.
      • Always ask: is it the right tool for this job?
      • When weighing the next Watson's value:
        • DON'T compare it with the previous Watson.
        • DO compare it with the costs involved in good old fashioned manual labor
    • Remember: it's not machines vs humans
      • Artificial intelligence and machine learning don't need to replace human data work.
      • In my opinion, that's a silly goal anyway!
      • The best examples of its use involve a tight interweaving of human expertise and mechanical efficiency
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment