boblannon/data_commons.md

## data_commons.md

      
    Raw
  

              data_commons.md
            
          
Table of Contents


The importance of an open, shared datacommmons

The work of making data public is hard
The path from "public" to "open", though, presents its own challenges

Often "public" means "viewable in at least one way by the public"
So now it's all out and useable: what do we need to break down silos and make data mashup-able?


Procuring technology

What does a good technology provider look like?
Early on: be wary of rent-seekers


The importance of an open, shared datacommmons

The work of making data public is hard


Policy decisions
Privacy concerns
Budgeting/resources

The path from "public" to "open", though, presents its own challenges

Often "public" means "viewable in at least one way by the public"


Only available as a gigantic scanned pdf

I mean, come on


Only available through a web search interface

eg. Illinois lobbying, SOPR personal transaction reports
Useful if:

You know what you're looking for
You're reasonably sure it's in there
You're not looking for, say, all the data


Data only available through an API

eg. New York Legislative Data API, openFDA API
APIs are fantastic for interfacing applications and programs
APIs are really terrible for populating databases

They're slooooow
When conducting research, you almost always need more context than a bunch of individual data points
Worse, many APIs in their very design make choices about which observations/attributes a user will care about. if they haven't anticipated your need, then submit a ticket and sit tight.


Data is available in bulk, in machine-readable format

We did it!
Wasn't that easy?

…actually, it probably WAS. Way easier than:

Building search engines
Designing APIs
Responding to user feedback and giving technical support


In terms of technical development cost, if it wasn't WAY cheaper, you were conned.


So now it's all out and useable: what do we need to break down silos and make data mashup-able?


Identifiers

eg. for providers, agencies, resources, contractors
Key questions:

Are there any?
Can we trust them to be unique?
Can we trust them to be stable? (ie, might they change some day?)
Do they work across datasets?


If there are no useful identifiers for key entities, everything you potentially might do with this data becomes an order of magnitude more expensive

Robots are getting better, but the plain fact is: you need to hire manual labor to fix this
If at all possible, solve this before opening your data (it's possible to do while respecting privacy and maintaining security. I promise.)


Controlled vocabulary


One demerit for every field that says "other, please specify"


Statistical models depend on having factors:

Limited number of choices for the value of some attribute/field
eg: what is your favorite color?

If the choices are {blue, green, red, yellow, orange, purple, pink}, we can start making models!
If the choices are {blue, green, red, yellow, orange, purple, pink, other - specify}, i guarantee you will lose 25% of your data to unusable and maddening answers like:

"fuschia"
"not sure, sorry!"
"red"
and, of course, everyone's favorite: " "


Importantly, if your data comes from various sources, do those sources share controlled vocabularies?

If you're not sure, and can't imagine the database admins of each source having regularly scheduled bowling nights, then the answer is almost certainly "no."
The way we label things matters

Across domains, words mean totally different things (eg "at risk" in a healthcare context vs an education context)
Even within domains!

Merck manual vs WHO ICD
ICD 9 vs ICD 10
A published, recognized ontology vs something someone made up with Wikipedia


What is the solution? Tower of babel? Rosetta stone? Despotic dictionaries?

Maybe, but probably not
We don't need to get orwellian, either
The best case is where each source cites some external, citeable,  authoritative vocabulary
But at the very least: document! you might know what you mean by "minor rash," "high blood pressure," or "upper arm," we can only guess!


Strings attached

This is the really tough one, and dangerous pitfalls abound
To solve any of the above problems, orgs may have had to resort to some practical solutions, like:

Conform to a data schema that has licensing or copyright restrictions
Refer to a controlled vocabulary (ontology, glossary, etc) that imposes restrictions on its use
Hiring (or contracting) curators or data quality firms

Organizations with specialized models that focus specifically on cleaning, coding and reconciling data
These services are invaluable, but the issue of intellectual property must be taken seriously, early on.

Are the coding/classification methods under a license that prevents commercial use, or use without attribution?
Is the product of classification (the actual codes assigned) under restriction?


Even if it's already public, open data, you can get into trouble fast when you start to solve problems related to data quality and interchangeability


Procuring technology

What does a good technology provider look like?


Develops and makes available open source software
Understands and respects the fact that the data they're handling is public domain data
The provider can not under any circumstance act as a gatekeeper to the original itemized transactions as disclosed
The provider does not use proprietary identifiers or industry classifications that would require any license other than a public domain license on the data.

If the provider doesn't hit literally every one of these marks, you'll regret it. They will entrench themselves and extract endless fees for what could and should be a public service.
Always remember that, in a perfect world with endless funds, the government would be able to budget its own staff to do this. We can grudgingly accept that this might not be politically feasible everywhere, but that doesn't mean we have to accept vendor lock-in for a system that is one of the only ways to protect our democracy.
Early on: be wary of rent-seekers


"Dungeon Masters"

Firms or individuals hired to create custom software solutions for opening data
Incentivized to keep their code closed-source and under restrictive license
Almost always more expensive than engaging with an open-source community
No guarantee that the tradeoff will be more high-quality software
After initial investment, stuck with continuing costs associated with maintenance, support, security updates, etc. (and what happens if they fold in the meantime?!)


"Oracles"

Expert researchers or annotators hired to enrich, normalize and standardize data
To be clear: this level of expertise is CRUCIAL

You can't simply throw unskilled crowdsourcing at a task that requires deep domain knowledge
In many cases, these experts turn out to have pure motives


Nevertheless, hiring a small amount of them full time is an (often painful) bottleneck to publishing quality data. there are more efficient ways of engaging experts.


"Wizards"

The latest and greatest artificial intelligence breakthrough that's going to solve all of your problems at a third of the cost and before dinnertime
Inceasingly common!
Technology is amazing, machine learning is often astounding, but no black box solution is going to solve all (or even most) of your problems.
Wizardry is useful (I use it myself!), but it's just a tool.

Always ask: is it the right tool for this job?
When weighing the next Watson's value:

DON'T compare it with the previous Watson.
DO compare it with the costs involved in good old fashioned manual labor


Remember: it's not machines vs humans

Artificial intelligence and machine learning don't need to replace human data work.
In my opinion, that's a silly goal anyway!
The best examples of its use involve a tight interweaving of human expertise and mechanical efficiency