Skip to content

Instantly share code, notes, and snippets.

@smrgeoinfo
Last active August 29, 2015 13:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save smrgeoinfo/9295725 to your computer and use it in GitHub Desktop.
Save smrgeoinfo/9295725 to your computer and use it in GitHub Desktop.
GIST of the ERAV data model

I've tried to extract the 'gist' of this interesting paper about managing crowd-sourced geospatial data. I think it has interesting applicability to dealing with a data system for geologic field observations.

from S. Andrew Sheppard, Andrea Wiggins, and Loren Terveen, 2014, Capturing Quality: Retaining Provenance for Curated Volunteer Monitoring Data: accessed at http://wq.io/media/papers/provenance_cscw14.pdf 2014-03-01

from the abstract:

a general outline of the workflow tasks common to field-based data collection, and a novel data model for preserving provenance metadata that allows for ongoing data exchange between disparate technical systems and participant skill levels.

Some key points (my take):

[observers] head out into the field to collect data. This step is referred to as the event in our proposed data model. As noted above, this is “the intersection of a person, a bird, a time, and a place” for eBird. Similarly, a River Watch sampling event is the combination of a sampling team, a time, and a predetermined site along a stream. [strangely like O&M...]

In practice, the conversion of data from field notes into digital form often does not happen instantly. Data entry is sometimes substantially delayed because participants consider it an unpleasant or undesirable task; [sounds like me and my field notes when I used to work on mapping projects]

It might seem that replacing paper-based data entry with a mobile app would streamline the process and facilitate instant validation and provenance tracking, but there are some notable barriers to consider. Some contributors will always be more comfortable manually recording data in the field, or hard copy record retention may be required for quality assurance or legal purposes, making mobile entry an unwanted extra step. Not all contributors own smartphones or other technologies such as GPS devices. In some projects, the primary contributor group is older adults, the demographic with lowest smartphone adoption [24]. Under a variety of circumstances, bulk upload can be the most feasible way to entice volunteers to contribute larger volumes of data. [sounds like a bunch of field geologists to me...]

most of the review to identify and remove outliers still happens in Excel, and reviewers generally prefer to wait to import anything until after they have fully reviewed the data they receive. As a result, potentially valuable information about the review process is lost, as is the ability to easily reverse a decision to discard an outlier when reversal is warranted. [this is the story of our life in the NGDS data project]

we suggest that rather than building a custom platform that supports every conceivable data review operation, it may be more valuable to build a system that is robust against multiple imports and exports to and from external formats.

In particular, the good old spreadsheet is how many scientific project contributors prefer to work with data [20]. Normalized data models and sophisticated apps are not necessarily seen as useful, even if there is a demonstrable overall benefit. Volunteers are rarely eager to learn new software, and data management tasks are a hurdle to participation for many individuals. Generally, our internal data model should account for the needs of contributors who are unconcerned about internal data models and just want to participate in science.

An ideal data model would track changes to data and task definitions, allowing accurate analysis of historical data. Importantly, the model must handle the data import task robustly and repeatedly, by matching incoming records to data already in the database.

we argue that the flexibility and provenance capabilities ERAV provides are valuable enough to merit the additional complexity for many, if not most, volunteer monitoring projects. In addition, we have released the source code for a generic implementation of ERAV (http://wq.io/vera) in an effort to mitigate this complexity and provide a bootstrapping platform for new projects interested in getting started with this approach.

summary

ERAV is most likely to be useful in cases where:

  1. Structured data is being exchanged and revised between multiple parties or data management platforms,
  2. The selected (or de facto) exchange format does not include complete provenance information, and
  3. The entities being described (i.e. events) can be uniquely identified with a stable natural key that does not need to be centrally assigned. [like fuzzy geolocation and geologic time?]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment