chengsoonong/data-provenance

## data-provenance
## Motivation

Data provenance is metadata which keeps track of where a piece of data comes from,
and what operations have been done to it. This metadata provides a way to assess authenticity,
enable trust and allows reproducibility. This is particularly important when the data users,
data producers, and data wranglers are different groups of people.

## Overall idea

We have two types of objects, data and programs, of which we need to keep track of. Each dataset
may have interesting properties (metadata) which can be extracted. The goal is to keep track of
the provenance of a dataset, i.e. to be able to easily identify the original sources of data and
which operations were performed on it.

## Short term goal

Simplifying assumptions:
* Operations on data all occur via source code (i.e. no manual operations), and the programs are version controlled.
* All data and programs are publicly available.
* Focus on data that can be represented in tabular format. Perhaps even just one single CSV file.
* Central system for managing provenance
* A particular version of data is immutable.

Test case: [National Map](http://nationalmap.nicta.com.au/)

We would like to record how a particular derived dataset is obtained, i.e. the original sources,
and the workflow used to arrive at the new dataset. Note that multiple datasets could be merged,
and different programs could be applied to a particular dataset. Furthermore, the same program version
can be applied to different datasets, resulting in different derived datasets.

One important feature for the end user would be an interactive visualisation of the provenance.
This includes the graph corresponding to the workflow used to arrive at a particular dataset,
the metadata (e.g. summary statsitics) corresponding to each dataset along the way, and
an indication of the quality and complexity of the programs used in the transformations.

## Minor extensions

* Support non-program operations, such as [LOOM](http://www.revelytix.com/?q=content/loom) or [BURRITO](http://pgbovine.net/projects/pubs/guo_burrito_tapp_2012.pdf)
* Login to view private data
* Allow user ratings of data and programs

## Longer term ideas

* Distributed trust management
* Deriving ratings of new data based on ratings of data sources and programs
* Automatically verify provenance claims
	## Motivation

	Data provenance is metadata which keeps track of where a piece of data comes from,
	and what operations have been done to it. This metadata provides a way to assess authenticity,
	enable trust and allows reproducibility. This is particularly important when the data users,
	data producers, and data wranglers are different groups of people.

	## Overall idea

	We have two types of objects, data and programs, of which we need to keep track of. Each dataset
	may have interesting properties (metadata) which can be extracted. The goal is to keep track of
	the provenance of a dataset, i.e. to be able to easily identify the original sources of data and
	which operations were performed on it.

	## Short term goal

	Simplifying assumptions:
	* Operations on data all occur via source code (i.e. no manual operations), and the programs are version controlled.
	* All data and programs are publicly available.
	* Focus on data that can be represented in tabular format. Perhaps even just one single CSV file.
	* Central system for managing provenance
	* A particular version of data is immutable.

	Test case: [National Map](http://nationalmap.nicta.com.au/)

	We would like to record how a particular derived dataset is obtained, i.e. the original sources,
	and the workflow used to arrive at the new dataset. Note that multiple datasets could be merged,
	and different programs could be applied to a particular dataset. Furthermore, the same program version
	can be applied to different datasets, resulting in different derived datasets.

	One important feature for the end user would be an interactive visualisation of the provenance.
	This includes the graph corresponding to the workflow used to arrive at a particular dataset,
	the metadata (e.g. summary statsitics) corresponding to each dataset along the way, and
	an indication of the quality and complexity of the programs used in the transformations.

	## Minor extensions

	* Support non-program operations, such as [LOOM](http://www.revelytix.com/?q=content/loom) or [BURRITO](http://pgbovine.net/projects/pubs/guo_burrito_tapp_2012.pdf)
	* Login to view private data
	* Allow user ratings of data and programs

	## Longer term ideas

	* Distributed trust management
	* Deriving ratings of new data based on ratings of data sources and programs
	* Automatically verify provenance claims