Skip to content

Instantly share code, notes, and snippets.

@luigi
Created July 15, 2009 17:22
Show Gist options
  • Save luigi/147848 to your computer and use it in GitHub Desktop.
Save luigi/147848 to your computer and use it in GitHub Desktop.

Proposal for the National Data Catalog

Prepared by David James and Luigi Montanez

Overview

The National Data Catalog aims to be a complete catalog of all data sets and APIs that are either put out by the government or are derived from the government. Scoped to all government levels (federal, state, and local), and all branches (executive, legislative, judicial), NDC will be the one-stop shop for developers, researchers, and investigative journalists interested in government data.

NDC will tap into the social benefits of having users come together around common interests. More than just a catalog, it will be a place for community-supported documentation about government data.

What problem are we solving?

This is no central repository of government data. No single government entity will be able to build such a catalog, as the separation of powers between branches and among agencies will hamper such an effort. Although the name "data.gov" might imply broad governmental coverage, it only focuses on federal, executive branch data.

This is a huge inconvenience for developers, researchers, and investigative journalists trying to discover and find government data, often resorting to Google or relying on some institutional knowledge of what data is available. When that data is ultimately found, documentation can be poor or non-existent, further delaying any productive use of it.

Is this aligned with Sunlight's mission?

Yes. This project will make it dramatically easier for citizens to find data produced by the government and data about the government. It also will provide community support tools for working with the data. This can empower citizens to build their own applications and mash-ups on top of existing, previously hard-to-find government data sources.

Will this get media attention?

Yes. The site makes for an interesting story because it can be directly contrasted with Data.gov. It will garner attention from journalists and researchers who want to use it as a tool, and cite usage in their articles.

What other services like this exist?

What will we learn from the project?

  • Identify what entities are publishing what kind of data
  • Compare and contrast quality of data put out by those entities
  • Build new connections with government entities that are putting out data
  • Gather business intelligence on what exactly Sunlight stakeholders are most interested in
  • Identify overlaps among data sources
  • Gain experience in categorization and creating ontologies

What are the measurable project goals?

The primary goal is coverage: How much of the government data out there is in our catalog? We should strive for 100% coverage. To get to that point, we need to define a process for attaining complete coverage.

General to all projects, targets need to be set for:

  • Press hits
  • Page views
  • User totals
  • Contributed knowledge (via the discussion/comment feature)

Is this something we are good at?

Yes. The Sunlight Labs API and Data Commons are similar in spirit to this project. We know a lot of the people who put out these data sources. At the Labs, we've already worked with many of these data sources. We'd be our own users of this product.

Who are the customers?

  1. Developers
  2. Investigative Journalists/Researchers
  3. Government Employees

Sunlight can also use this project internally to determine the community energy behind data sources.

Deliverables

For version 1.0, to launch at the end of Q3 2009, we will deliver a data catalog with the following features:

  • Searching
  • Browsing
  • Commenting
  • Data Previews and Basic Visualization
  • Categorization
  • Annotation and Tagging
  • Collection and Display of Usage Examples
  • Watch Lists
  • Submissions and Flagging for Bad Data
  • Recommendations for Similar Data Sources

Generally speaking, these features can be thought of as one or more of the following buckets:

  • Categorization of similar data sources
  • Annotation of individual data sources
  • Social sharing of knowledge about data sources

Includes data sources from and about:

  • U.S. Congress
  • U.S. Executive Branch
  • U.S. Judicial Branch
  • All 50 States
  • Major U.S. cities (DC, NYC, San Francisco, Chicago, LA, Seattle)

Timeline and Resource Allocation

Project is slated for Q3 2009. The resources available are:

  • Two developers, full-time (David and Luigi)
  • Designer, shared time (Ali)
  • Management (Clay)
  • Server setup (Tim)

Potential resources:

  • Intern, for research and data collection/entry (defined as curation)

Tentative Timeline:

  • Week of July 6 - Sunlight Staff Meeting on the Project
  • Last week of July - Alpha
    • Basic catalog and search available
    • Cover Data.gov and several Congressional sources
    • Design wrapper complete
    • Basic curation tools available
  • First week of September - Beta
    • User/community tools available (discussions, watch lists, submissions)
    • Cover entirety of federal government, several states, and several cities
    • U/I complete
    • Command line interface available
  • End of September - Launch
    • Cover all known state and local governments
    • Any additional features gleaned from alpha/beta feedback

Personas

Personas are a useful way to think about the stakeholders for our project. We have included a few below.

Bill, civic hacker

Bill got his start by parsing FEC data, and he's played with THOMAS as well. He's getting bored of hacking on Congress, so he wants to get some obscure data from the Executive branch's agencies, and maybe try some things on the state and local level. As a hacker with ADHD, Bill has little patience, so he wants to find datasets fast, wants to be sure that the data is what he expects, and wants to be able to get it on his computer with minimal hassle.

Maude, investigative journalist

Maude works for a non-profit news organization. She has some pretty good skills with Excel and Access, but doesn't consider herself to be a developer. Maude's focus is on the influence of money in politics, at all levels of government. She wants to see what kind of data is out there, keep tabs on useful data sets that can support her work, and ultimately use that data to produce groundbreaking stories.

Sam, employee for a state's CIO office

Sam wants to see his state's government open up and become more transparent. As an employee of his state's CIO, he's been pushing to revamp the websites, participate in social media, and publish data sets. He's thinking of running his own contest like Apps for Democracy and Apps for America. Sam wants to make sure that his hard work is reaching the right people.

Joanna, staffer at the Sunlight Labs

Joanna curates the data catalog. As such, she wants an easy way to enter in new data sources and mark them up appropriately. Also, she handles the processing of submissions from government agencies and non-profits who publish their own data sources. She also does light content management, such as setting "Featured Data Sources" for viewing on the homepage and updating copy from time to time.

Features

Feature stories are written in a Cucumber-friendly format. These are only high-level descriptions of a feature, and correspond to several screens of the actual application. When development begins, step-by-step details will be written (called Scenarios).

Browse and Search the Catalog

Feature: Searching for a particular data source

In order to find U.S. demographic data on race/ethnicity to build a mashup
As Bill, civic hacker
I want to type "race ethnicity" in a search field and get back a list of matching data sets

Feature: Browsing for data sources

In order to get an overview on what kind of data the State of Idaho publishes
As Bill, civic hacker
I want to click on a few links, and eventually see a list of matching data sets

Feature: Gauging the utility of a data source

In order to gauge whether a particular data set will be useful  
As Maude, investigative journalist  
I want to be able to see a preview of the data set, some usage examples, comments from others, and general information like source entity, file size, file format, and URL. 

Feature: Adding data source to watch list

In order to remember a particular data set for later use
As Maude, investigative journalist
I want to be able to choose "Watch" on a data source when viewing its page

Feature: Viewing updates from watch list

In order to view updates on my watched data sources
As Bill, civic hacker 
I want to see a feed on my home page with recent discussion comments and any updates to the annotations of the data sources I currently watch

Contribute Knowledge

Feature: Leaving a comment about a data source

In order to let others know that the Census data is a pain to parse
As Bill, civic hacker
I want to leave some detailed notes with some tips and tricks I learned along the way when working with the data

Feature: Report a data source as bad

In order to report that Texas's Longhorn Population Data Set is no longer available at the current URL
As Sam, state employee and good samaritan
I want to be able to flag the data set, and leave a comment describing the problem

Feature: Submit a data source

In order to submit a new labor statistics data set
As Sam, state employee
I want to be able to fill in a form, giving details about the data source and leaving some additional comments that I'd like the curator to be able to read

Update the Catalog

Feature: Add new data source

In order to add a newly discovered data set on Alaska's oil production
As Joanna, catalog curator
I want to enter the data source into the system, categorizing and annotating it, and ensuring that the data preview is correct and that the page looks good

Feature: Process submission of data source

In order to process an incoming submission of a new labor statistics data set
As Joanna, catalog curator
I want to be able to view the submission, do my own research based on the information provided, and enter the data source into the system if it meets some established criteria

Feature: Process data source that has been flagged

In order to process a data source that someone flagged as being bad
As Joanna, catalog curator
I want to be able to view the flagged entry and comment from the user, send a message to that user if I have any more questions, and resolve the situation by either updating the data source, or removing it

Feature: Remove data source

In order to remove a data source that is no longer valid
As Joanna, catalog curator
I want to be able to unpublish a data source from the catalog, but still be able to view it and bring it back to the catalog at later time if needed

Feature: Mark data source as "Not Machine Readable"

In order to process a PDF data source is not machine parsable
As Joanna, catalog curator
I want to mark it as "Not Machine Readable", and be able to mark it as a candidate for OCR or for TransparencyCorps

Technical Architecture

Architectures considered:

  • RADAR
  • Standard Rails approach (combined web application and RESTful API)

RADAR:

RADAR

Standard Rails approach:

Traditional Architecture

Choice of Database

We have several options:

  • Relational
    • MySQL
    • PostgreSQL
  • Document-based
    • CouchDB
    • MongoDB
  • RDF / triple-based

Some diagrams of what those solutions may look like:

Relational:

Data Model.graffle: Relational

Document-based:

Data Model.graffle: Document

RDF-triple-based:

Data Model.graffle: Triples

Despite recent interest in document-based and triple-based data storage, the underlying relationships in the data model are quite similar across the three styles. We are currently leaning toward a hybrid of relational (via ActiveRecord) and document data storage (CouchDB/MongoDB). Specifically, the "DataSource" object is a prime candidate to fit into the document-based data store, while the supporting objects (Users, Categories) could be quickly created with existing Rails plugins, all of which require ActiveRecord. This plays to the strengths of each technology.

Here's an example of what the DataSource object might look like as a JSON document, the first is an example of an API, and the second is an example of a data set culled from Data.gov:

{ 
  "data_source" : {
    "title" : "Sunlight Labs API",
    "source_type" : "API",
    "entity" : {
      "name" : "Sunlight Labs",
      "url" : "http://sunlightlabs.com",
      "type" : "Non-Profit Organization",
      "description" : "Sunlight Labs is part of the Sunlight Foundation, a non-profit that..."
    },    
    "api_description" : "RESTful XML/JSON",
    "access_restrictions" : "Requires API Key",
    "documentation_url" : "http://services.sunlightlabs.com/api/",
    "category_id" : 15, // points to a Category ActiveRecord object that acts_as_tree
    "description" : "The Sunlight Labs API provides methods for obtaining basic information on Members of Congress, legislator IDs used by various websites, and lookups between places and the politicians that represent them. The primary purpose of the API is to facilitate mashups involving politicians and the various other APIs that are out there.",
    "tags" : ["Congress", "Members of Congress", "U.S. House", "U.S. Senate"],
    "comments" : {
      "comment3903" : {
        "user_id" : 323 // points to a User ActiveRecord object
        "body" : "This API is awesome! Works great!",
        "rating" : 5,
        "status" : "green",
        "published_at" : "2009-01-01 00:00:00",
        "created_at" : "2009-01-01 00:00:00",
        "updated_at" : "2009-01-01 00:00:00" 
      },
      "comment3955" : {
        "user_id" : 666
        "body" : "This API sucks...\n\nI can do better.",
        "rating" : 2,
        "published_at" : "2009-01-01 00:00:00",
        "created_at" : "2009-01-01 00:00:00",
        "updated_at" : "2009-01-01 00:00:00" 
      },
    },
    "user_id" : 2 // points to a User ActiveRecord object (using authlogic, clearance, or restful_authentication)
    "published_at" : "2009-01-01 00:00:00",
    "created_at" : "2009-01-01 00:00:00",
    "updated_at" : "2009-01-01 00:00:00",
  }
}

{
  "data_source" : {
    "title": "2005 Toxics Release Inventory data for Puerto Rico",
    "source_type": "File",
    "entity" : {
      "name" : "U.S. Environmental Protection Agency",
      "type" : "Federal Agency",
      "level" : "Federal",
      "branch" : "Executive",
      "url" : "http://epa.gov"
    },
    "category_id" : 22,
    "date_released" : "2007-03-22",
    "date_updated" : "2008-11-05",
    "time_period" : "2005",
    "frequency" : "Annual",
    "description" : "The Toxics Release Inventory (TRI) is a publicly available EPA database that contains information on toxic chemical releases and waste management activities reported annually by certain industries as well as federal facilities.",
    "found_from_url" : "http://www.data.gov/details/177"
    "program_url" : "http://www.epa.gov/tri/",
    "description_url" : "http://www.epa.gov/tri/tridata/tri05/data/index.htm",
    "documentation_url" : "http://www.epa.gov/tri/tridata/tri05/data/States_Doc_2005_v05.pdf",
    "granularity" : "Longitude/Latitude",
    "geographic_coverage" : "Puerto Rico", 
    "tags" : ["TRI", "TRI Data", "TRI Reporting", "Toxic", "Toxic Release"],
    "user_id" : 2
    "published_at" : "2009-01-01 00:00:00",
    "created_at" : "2009-01-01 00:00:00",
    "updated_at" : "2009-01-01 00:00:00",
  }
}

Command Line Interface

Offering a console-driven interface will a strong signal that we are engaging the developer community.

Because its feature set will largely mirror the features of the web app, the CLI will be worked on after the initial web app is created.

Development Practices

We plan to use outside-in development, using a Cucumber, RSpec, Machinist, Mocha, Webrat, and other similar tools.

Next Steps

  • Define goals and appropriate metrics.
  • Map out SEO strategy.
  • Define "data source". Our tentative definition is: data sets and APIs that are either put out by the government or are derived from the government. We intend to use a broad definition of government that includes quasi-government agencies such as the Tennessee Valley Authority and Fannie Mae.
  • Discuss real-world dirtiness of data: overlapping data, out-of-date, bad formats.
  • Determine a process for curation of data sources. Allocate resources as needed, possibly including intern support.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment