Skip to content

Instantly share code, notes, and snippets.

@chriswhong
Created July 1, 2016 20:08
Show Gist options
  • Save chriswhong/8efd249a58abfa8b39b68bca198e1072 to your computer and use it in GitHub Desktop.
Save chriswhong/8efd249a58abfa8b39b68bca198e1072 to your computer and use it in GitHub Desktop.
Idea for git-powered distributed dataset management

The Problem:

If you follow the open data scene, you'll often hear about how the "feedback loop" for making corrections, comments, or asking questions about datasets is either fuzzy, disjointed, or nonexistent. If I know for a fact that something in a government dataset is wrong, how do I get that record fixed? Do I call 311? Will the operator even know what I am talking about if I say I want to make a correction to a single record in a public dataset? There's DAT. There's storing your data as a CSV in github. These approaches work, but are very much developer-centric. (pull requests and diffs are hard to wrap your head around if you spend your day analyzing data in excel or desktop GIS. The fact of the matter is that most of the people managing datasets in government organizations are not DBAs, data scientists, or programmers.

Idea:

It's basically git for data plus a simple UI for exploration, management, and editing. Users would have to use Github SSO to edit in the UI, and behind the scenes they would actually be doing pull requests on individual files.

GithubDB:

Imagine a dataset sliced into a single file for each row, and organized into a nice logical hierarchy. I have an immediate use of this workflow for spatial data, so in this case, each file is a geoJSON feature. There's no database, github is the database.

SimpleUI:

On the UI front, each row gets its own landing page... you can do a simple call to the github API from javascript and render all of the contents of that row, including a map, and even show edits made to that row by checking the commit history. You can add all the logic for validating edits here, and turn each row of data into a simple web form that anyone can modify. You cna also add row-specific discussion (with soemthing like disquss or whatever, for commentary that needs to follow the row of data but doesn't necessarily belong in the git/publishing workflow) When the user makes changes and hits submit,a pull request is created behind the scenes with their name on it! The data manager can review and comment on the PR, and eventually merge it if everything looks good. The dialogue is done in the open, and the transactions are all logged for the world to review. The devs and techies can just use github for their workflow, the less technical can use the simple UI, everybody wins and there is no confusion about who validates a change and where it came from.

Build:

The dataset repo can be "built" into file-based data we are used to seeing on agency websites and open data portals. A script will scoop up all of the files and turn them into a CSV or shapefile or geojson featurecollection on a nightly basis, for publishing elsewhere, etc.

More to the UI: Slicing out subsets of the data for validation by others. In our use case, we maintain data that needs to be updated by various different agencies. This usually means sending a form to someone in that agency asking them to fill in everything from scratch whether we already have it or not... or, sending them a spreadsheet and telling them to make corrections and send it back (if you don't think this is an AWFUL workflow, stop reading, this idea is not for you) A human then has to curate all of the changes, and finally we get a "version" of the dataset that can be released. Repeat 6 months later, etc etc.

I imagine using the UI to cordon off slices of the larger dataset for curation by the "authroized editor" in the other agency. Basically, if the row has your agency tag, it's on your list, and you'll get a nice curated view of your slice of the pie. We can add a little dashbaord to help that agency moderator understand what has changed since they last logged in, and who changed it. They can download just their slice, and they can make changes just like anyone else can... again, when they submit, it's just a big pull request. Just because an edit came from an agency and not a citizen doesn't mean we give it special treatment. It's subject to the same workflow as everyone else.

If you are reading this, please tell me your thoughts. Blog post coming soon.

@themightychris
Copy link

@chriswhong @technickle @datapolitain one of the things that really got me excited about this idea are some of the new possibilities it opens up in terms of ETL and moving data back and forth between external systems of record.

One of the really cool aspects of git's commit DAG is that enables everyone to have their own version of the truth simultaneously in the same repository. So imagine you implement your ETL process as a git remote helper which can provide push and fetch implementations for custom remote protocols. At the minimum, you implement fetch so that you can pull the complete remote state of the external data source, converting it to a git tree, and then commit it with timestamp+metadata against the tip of the remote tracking branch for that external data source. That remote tracking branch can just continuously and belligerently updated to track the remote state, and you can compare it with and merge it into other branches to mark reconciliations. Then if you're able to implement push you can record a merge commit into to remote tracking branch (that doesn't actually exist in a remote repo but is synthesized by the remote helper)

@themightychris
Copy link

Just came across this, git-inspired but not git-based: https://github.com/attic-labs/noms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment