Skip to content

Instantly share code, notes, and snippets.

@chengsoonong
Created March 20, 2015 05:41
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save chengsoonong/acd100b67a55bc4983d4 to your computer and use it in GitHub Desktop.
Save chengsoonong/acd100b67a55bc4983d4 to your computer and use it in GitHub Desktop.
Data Provenance working paper
## Motivation
Data provenance is metadata which keeps track of where a piece of data comes from,
and what operations have been done to it. This metadata provides a way to assess authenticity,
enable trust and allows reproducibility. This is particularly important when the data users,
data producers, and data wranglers are different groups of people.
## Overall idea
We have two types of objects, data and programs, of which we need to keep track of. Each dataset
may have interesting properties (metadata) which can be extracted. The goal is to keep track of
the provenance of a dataset, i.e. to be able to easily identify the original sources of data and
which operations were performed on it.
## Short term goal
Simplifying assumptions:
* Operations on data all occur via source code (i.e. no manual operations), and the programs are version controlled.
* All data and programs are publicly available.
* Focus on data that can be represented in tabular format. Perhaps even just one single CSV file.
* Central system for managing provenance
* A particular version of data is immutable.
Test case: [National Map](http://nationalmap.nicta.com.au/)
We would like to record how a particular derived dataset is obtained, i.e. the original sources,
and the workflow used to arrive at the new dataset. Note that multiple datasets could be merged,
and different programs could be applied to a particular dataset. Furthermore, the same program version
can be applied to different datasets, resulting in different derived datasets.
One important feature for the end user would be an interactive visualisation of the provenance.
This includes the graph corresponding to the workflow used to arrive at a particular dataset,
the metadata (e.g. summary statsitics) corresponding to each dataset along the way, and
an indication of the quality and complexity of the programs used in the transformations.
## Minor extensions
* Support non-program operations, such as [LOOM](http://www.revelytix.com/?q=content/loom) or [BURRITO](http://pgbovine.net/projects/pubs/guo_burrito_tapp_2012.pdf)
* Login to view private data
* Allow user ratings of data and programs
## Longer term ideas
* Distributed trust management
* Deriving ratings of new data based on ratings of data sources and programs
* Automatically verify provenance claims
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment