Skip to content

Instantly share code, notes, and snippets.

@chriswhong
Created July 1, 2016 20:08
Show Gist options
  • Save chriswhong/8efd249a58abfa8b39b68bca198e1072 to your computer and use it in GitHub Desktop.
Save chriswhong/8efd249a58abfa8b39b68bca198e1072 to your computer and use it in GitHub Desktop.
Idea for git-powered distributed dataset management

The Problem:

If you follow the open data scene, you'll often hear about how the "feedback loop" for making corrections, comments, or asking questions about datasets is either fuzzy, disjointed, or nonexistent. If I know for a fact that something in a government dataset is wrong, how do I get that record fixed? Do I call 311? Will the operator even know what I am talking about if I say I want to make a correction to a single record in a public dataset? There's DAT. There's storing your data as a CSV in github. These approaches work, but are very much developer-centric. (pull requests and diffs are hard to wrap your head around if you spend your day analyzing data in excel or desktop GIS. The fact of the matter is that most of the people managing datasets in government organizations are not DBAs, data scientists, or programmers.

Idea:

It's basically git for data plus a simple UI for exploration, management, and editing. Users would have to use Github SSO to edit in the UI, and behind the scenes they would actually be doing pull requests on individual files.

GithubDB:

Imagine a dataset sliced into a single file for each row, and organized into a nice logical hierarchy. I have an immediate use of this workflow for spatial data, so in this case, each file is a geoJSON feature. There's no database, github is the database.

SimpleUI:

On the UI front, each row gets its own landing page... you can do a simple call to the github API from javascript and render all of the contents of that row, including a map, and even show edits made to that row by checking the commit history. You can add all the logic for validating edits here, and turn each row of data into a simple web form that anyone can modify. You cna also add row-specific discussion (with soemthing like disquss or whatever, for commentary that needs to follow the row of data but doesn't necessarily belong in the git/publishing workflow) When the user makes changes and hits submit,a pull request is created behind the scenes with their name on it! The data manager can review and comment on the PR, and eventually merge it if everything looks good. The dialogue is done in the open, and the transactions are all logged for the world to review. The devs and techies can just use github for their workflow, the less technical can use the simple UI, everybody wins and there is no confusion about who validates a change and where it came from.

Build:

The dataset repo can be "built" into file-based data we are used to seeing on agency websites and open data portals. A script will scoop up all of the files and turn them into a CSV or shapefile or geojson featurecollection on a nightly basis, for publishing elsewhere, etc.

More to the UI: Slicing out subsets of the data for validation by others. In our use case, we maintain data that needs to be updated by various different agencies. This usually means sending a form to someone in that agency asking them to fill in everything from scratch whether we already have it or not... or, sending them a spreadsheet and telling them to make corrections and send it back (if you don't think this is an AWFUL workflow, stop reading, this idea is not for you) A human then has to curate all of the changes, and finally we get a "version" of the dataset that can be released. Repeat 6 months later, etc etc.

I imagine using the UI to cordon off slices of the larger dataset for curation by the "authroized editor" in the other agency. Basically, if the row has your agency tag, it's on your list, and you'll get a nice curated view of your slice of the pie. We can add a little dashbaord to help that agency moderator understand what has changed since they last logged in, and who changed it. They can download just their slice, and they can make changes just like anyone else can... again, when they submit, it's just a big pull request. Just because an edit came from an agency and not a citizen doesn't mean we give it special treatment. It's subject to the same workflow as everyone else.

If you are reading this, please tell me your thoughts. Blog post coming soon.

@talos
Copy link

talos commented Jul 1, 2016

Interesting... "one file per row" solves some problems with diffs, file size, and noise in general, but definitely raises some issues with # of rows that can be stored. Even if you do deep nesting, you'll run into some pretty insoluble slowdowns above a few hundred thousand rows using git as transport and the file system as a DB (which is what everything resolves to at the moment you pull from the git repo & reassemble the db).

You definitely should take a look at the architecture of Who's on First, which bears some major similarities -- they store each geometry as a single piece of GeoJSON, then have big tabular metadata tables that tie all those pieces of GeoJSON together. Most editing could be done on the GeoJSON as it is. Definitely worth talking to Mapzen. What you're looking for may be able to borrow large parts of their architecture under-the-hood, plus some editing features (and removing the focus on geo, as your pipeline is not limited to geodata).

@technickle
Copy link

jkan.io kind of does this (one file per row) along with a UI editor, though it's not particularly geographic. It also relies heavily on jekyll processing to transform the file collection into a composite json file. I'm a really, really big fan of the architecture. It's stunningly elegant and flexible.

@chriswhong
Copy link
Author

I should note that we actually turned on JKAN this week and the "githubDB" architecture is what inspired this line of thinking. @timwis

@datapolitan
Copy link

I think this is a great idea. I forked this and added links, both to references @chriswhong made in his draft and those of @technickle, @talos, and @auremoser, as well as some minor text changes. Feel free to merge those in.

Additionally, +1 to @technickle on the system of record concern. When this and the official data store diverge, which is to be believed? I think this is probably a superior means of auditing changes to these systems but understand that might not be apparent to the lawyers who'd be involved. I think bulk uploading commits (making changes to persistent, recurring problems with the data) might be a challenge worth looking at and solving. I could see this as a good intermediary to implementing data stores in government that do this natively (baking in a flexible data update and management interface that can be exposed programmatically to users). I wonder whether this is a healthy antidote to or downward spiral into the memory hole problem in government. This makes me think about how Twitter is changing the nature of public feedback, replacing the direct closed connection with an open transparent means of engagement.

I look forward to where this goes and am happy to help in any way I can.

@chriswhong
Copy link
Author

chriswhong commented Jul 3, 2016

DANGIT - DAta Nudged into GIT

Here's a first run at a builder script:
https://github.com/chriswhong/dangit/
Dataset repos should have a dangit.json in their root that includes the type of build to do. For now, the only type is geojson, where it expects rows to be Features and the build to be a FeatureCollection.

I made a sample dataset repo called nyc-pizzashops. It's repo is here, and it's built geoJSON FeatureCollection is here.

If you are so inclined, please add rows of data to nyc-pizzashops, or edit existing rows, build, and PR! Open issues or add TODOs in the dangit code. Once we have a nice collection of edits to track, we can think about the UI to show edit history.

@chriswhong
Copy link
Author

@technickle @datapolitain I share your concern on getting edits back into a system of record, but I do think it ends up just being an ETL challenge like any other. As far as "officialness" goes, I think the "system of record" and the "dataset most people trust and use" aren't always the same thing, and if a maintainer is not doing their job, the community has the option to fork the repo and go down a different path, just like OSS. I think it is on par with pulling in OSM data edits into official city spatial databases. If a high quality source of truth exists outside of government, government should put the hooks in place to tap into it and use it wherever possible. I could see a dangit repo existing as a community-driven source of updates, and some ETL to check for changes in the dangit repo and merge them into the database of record.

@mheadd
Copy link

mheadd commented Jul 5, 2016

For parts of the data that are stewarded by other agencies, I wonder if you could leverage Git submodules for this. The various agencies that maintain their slice of the data would only see their data in their UI and they would all get rolled up into one master version (which gets maintained in the system of record) via submodules...

Not sure if this would work but its what I immediately thought of when reading your comments about slicing off bits of the data for different agencies to review and update.

@davidread
Copy link

I like it - your tool concept is certainly a step forward.

Of course, git is a bit of a shaky foundation, with its file-based system having limitations of scale & performance, compared to any database. But one day there will be something better. And what you've done makes a lot of sense - to make use of the versioning, PRs and github hooks.

There is also a need for PRs to be filed by officials in different parts of government to be able to share a dataset.

@ejaxon
Copy link

ejaxon commented Jul 7, 2016

I think your idea of implementing dataset corrections as pull requests is outstanding, though I'm also hesitant to use git as a foundation long-term.

I've just started playing with the idea of using GraphQL as the basis for a standard connector interface for civic data (going beyond individual datasets here). It has a number of characteristics that make it interesting as a building block. It allows clients to determine shape of the data they want, allows inclusion of nested data queries without additional round-trips, has both read and write capabilities and, particularly usefully, automatically builds in query validation and the ability to introspect - to ask a server what queries it supports, which makes for the possibility of building really nice tooling.

The dream would be to get to the point where internal government systems are required to use something like this as standard glue between SORs (especially internally), then we have the connection back to the SOR. My feeling is that over the long term we need to rethink internal systems so that making data available (whether for internal consumption across siloes or externally) becomes a simple, standard operation rather than a project and we need to do it in a way that allows applications developed by different groups and companies to be integrated in standardized ways without a lot of cost or complexity.

There are a few things missing from GraphQL alone which would be needed to make it work. Obviously layers for authentication, authorization and more sophisticated data validation need to be added. I also think it's important to be able to chain GraphQL servers - I think this could be part of the basis for addressing the issues with @mheadd's microservices idea. The idea you're proposing here makes obvious that we need support for pull request semantics for mutations.

@timwis
Copy link

timwis commented Jul 25, 2016

This is awesome! @themightychris recently proposed a really similar idea but using columns as the files instead of rows (there's probably way more to the idea and I've butchered it).

Also reminds me of an idea I posted about prose a couple months ago. I think there's definitely something here!

@themightychris
Copy link

@chriswhong @technickle @datapolitain one of the things that really got me excited about this idea are some of the new possibilities it opens up in terms of ETL and moving data back and forth between external systems of record.

One of the really cool aspects of git's commit DAG is that enables everyone to have their own version of the truth simultaneously in the same repository. So imagine you implement your ETL process as a git remote helper which can provide push and fetch implementations for custom remote protocols. At the minimum, you implement fetch so that you can pull the complete remote state of the external data source, converting it to a git tree, and then commit it with timestamp+metadata against the tip of the remote tracking branch for that external data source. That remote tracking branch can just continuously and belligerently updated to track the remote state, and you can compare it with and merge it into other branches to mark reconciliations. Then if you're able to implement push you can record a merge commit into to remote tracking branch (that doesn't actually exist in a remote repo but is synthesized by the remote helper)

@themightychris
Copy link

Just came across this, git-inspired but not git-based: https://github.com/attic-labs/noms

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment