chriswhong/idea.md

## idea.md

      
    Raw
  

              idea.md
            
          
    The Problem:
If you follow the open data scene, you'll often hear about how the "feedback loop" for making corrections, comments, or asking questions about datasets is either fuzzy, disjointed, or nonexistent.  If I know for a fact that something in a government dataset is wrong, how do I get that record fixed?  Do I call 311?  Will the operator even know what I am talking about if I say I want to make a correction to a single record in a public dataset?  There's DAT.  There's storing your data as a CSV in github.  These approaches work, but are very much developer-centric.  (pull requests and diffs are hard to wrap your head around if you spend your day analyzing data in excel or desktop GIS.  The fact of the matter is that most of the people managing datasets in government organizations are not DBAs, data scientists, or programmers.
Idea:
It's basically git for data plus a simple UI for exploration, management, and editing.  Users would have to use Github SSO to edit in the UI, and behind the scenes they would actually be doing pull requests on individual files.
GithubDB:
Imagine a dataset sliced into a single file for each row, and organized into a nice logical hierarchy.  I have an immediate use of this workflow for spatial data, so in this case, each file is a geoJSON feature.  There's no database, github is the database.
SimpleUI:
On the UI front, each row gets its own landing page... you can do a simple call to the github API from javascript and render all of the contents of that row, including a map, and even show edits made to that row by checking the commit history.  You can add all the logic for validating edits here, and turn each row of data into a simple web form that anyone can modify.  You cna also add row-specific discussion (with soemthing like disquss or whatever, for commentary that needs to follow the row of data but doesn't necessarily belong in the git/publishing workflow) When the user makes changes and hits submit,a pull request is created behind the scenes with their name on it!  The data manager can review and comment on the PR, and eventually merge it if everything looks good.  The dialogue is done in the open, and the transactions are all logged for the world to review.  The devs and techies can just use github for their workflow, the less technical can use the simple UI, everybody wins and there is no confusion about who validates a change and where it came from.
Build:
The dataset repo can be "built" into file-based data we are used to seeing on agency websites and open data portals.  A script will scoop up all of the files and turn them into a CSV or shapefile or geojson featurecollection on a nightly basis, for publishing elsewhere, etc.
More to the UI:  Slicing out subsets of the data for validation by others.  In our use case, we maintain data that needs to be updated by various different agencies.  This usually means sending a form to someone in that agency asking them to fill in everything from scratch whether we already have it or not... or, sending them a spreadsheet and telling them to make corrections and send it back (if you don't think this is an AWFUL workflow, stop reading, this idea is not for you)  A human then has to curate all of the changes, and finally we get a "version" of the dataset that can be released.  Repeat 6 months later, etc etc.
I imagine using the UI to cordon off slices of the larger dataset for curation by the "authroized editor" in the other agency.  Basically, if the row has your agency tag, it's on your list, and you'll get a nice curated view of your slice of the pie.  We can add a little dashbaord to help that agency moderator understand what has changed since they last logged in, and who changed it.  They can download just their slice, and they can make changes just like anyone else can... again, when they submit, it's just a big pull request.  Just because an edit came from an agency and not a citizen doesn't mean we give it special treatment.  It's subject to the same workflow as everyone else.
If you are reading this, please tell me your thoughts.  Blog post coming soon.