Skip to content

Instantly share code, notes, and snippets.

@kbraak
Last active August 29, 2015 14:18
Show Gist options
  • Save kbraak/c861646eba7476e0381c to your computer and use it in GitHub Desktop.
Save kbraak/c861646eba7476e0381c to your computer and use it in GitHub Desktop.

Specification: DwC Term updates

This documents the process for bringing GBIF.org and related infrastructure up to the latest version DwC terms.

Table of Contents

Summary of changes needed

The changes that need to occur are:

  1. New definitions of the occurrence, taxon, and event cores must be created
  2. The cores and extensions need to be made available on a versioned URL
  3. The IPT (2.3) needs to accommodate versioned extensions, and provide data migration utilities for terms that have changed
  4. The Darwin Core Archive Validator needs to validate extensions against their latest version
  5. The GBIF.org data store needs to include the new terms. Certain terms have to be added as interpreted and typed properties in the Occurrence.java class such as organismQuantityType and sampleSizeUnit. Non-interpreted terms will be available immediately in Occurrence.java without any change. In order to persist the new terms, however, the HBase table and Hive tables will need to be adjusted, and coordinated data migrations.
  6. Interpretation of incoming data needs to populate the GBIF.org occurrence store correctly with the new terms, and gracefully handle data using old terms
  7. Data stored in GBIF.org using old terms needs to be migrated
  8. DwC-A produced by GBIF.org should use terms and reference the newly versioned core and extension files

Deprecated terms

The following summarises the terms that were included in either the occurrence or taxon core definitions, but have since been deprecated:

Should be removed from the occurrence and taxon cores. Replaced by dc:references. The IPT should migrate old mappings to the new term, and offer only the new term when performing a mapping to new cores. Data in GBIF.org using dc:source (2M records, 15 datasets) should be migrated to use dc:references, and any incoming data using dc:source or dc:references be used to populate dc:references. Occurrence.java needs no change (all DC terms are included).

Should be removed from the occurrence and taxon cores, since this term has been deprecated in favor of dc:license. The IPT should nullify values using this term, removing the mapping. GBIF.org needs to ignore dc:rights on incoming data. Data populated in dc:rights (21M records, 236 datasets) currently should be set to NULL. Occurrence.java needs no change (all DC terms are included). A reminder that the GBIF Licensing Policy will need updating to reflect our handling of dc:rights.

Should be removed from the occurrence core. Replaced by dwc:organismID. The IPT should migrate old mappings to the new term, and offer only the new term when performing a mapping to the new core. Data in GBIF.org using dwc:individualID (8M records, 89 datasets) should be migrated to use dwc:organismID, and any incoming data using dwc:individualID or dwc:organismID be used to populate dwc:organismID. Occurrence.java can have individualID removed and dwc:OrganismID added.

Should be removed from the occurrence core. Replaced by dc:references. The IPT should migrate old mappings to the new term, and offer only the new term when performing a mapping to the new core. Data in GBIF.org using dwc:occurrenceDetails (1.3M records, 31 datasets) should be migrated to use dc:references, and any incoming data using dwc:occurrenceDetails or dc:references be used to populate dc:references. Occurrence.java can have occurrenceDetails removed.

New terms

The following summarises the terms that have been newly ratified, and need to be added to either the occurrence, taxon, or event core definitions:

The IPT should offer mapping of the new term. The IPT should allow publishers to choose whether they want the license value to be auto-populated from the URI of the license applied to the dataset as a whole. GBIF.org needs to ignore dc:license on incoming data. When the registry supports a license field, GBIF.org will auto-populate this field on each record with the URI of the license applied to the dataset. Occurrence.java does not need changing, nor do the HBase table and Hive tables since all DC terms exist already.

The IPT should offer mapping of the new term. This term is essential for event-based sample data, and needs to be included in the new Occurrence core (not the Event core). GBIF.org needs to index dwc:organismQuantity from incoming occurrence records. Occurrence.java does not need changing, but the HBase table and Hive tables will need to be adjusted.

The IPT should offer mapping of the new term. This term is essential for event-based sample data, and needs to be included in the new Occurrence core (not the Event core). GBIF.org needs to index dwc:organismQuantityType from incoming occurrence records. Occurrence.java needs changing (since this will be an interpreted and typed property) and the HBase table and Hive tables will need to be adjusted.

The IPT should offer mapping of the new term. This term is NOT essential for event-based sample data, therefore this term only needs to be included in the new Occurrence core. GBIF.org needs to index dwc:organismID from incoming occurrence records. Occurrence.java does not need changing, but the HBase table and Hive tables will need to be adjusted.

The IPT should offer mapping of the new term. This term is NOT essential for event-based sample data, therefore this term only needs to be included in the new Occurrence core. GBIF.org needs to index dwc:organismName from incoming occurrence records. Occurrence.java does not need changing, but the HBase table and Hive tables will need to be adjusted.

The IPT should offer mapping of the new term. This term is NOT essential for event-based sample data, therefore this term only needs to be included in the new Occurrence core. GBIF.org needs to index dwc:organismScope from incoming occurrence records. Occurrence.java does not need changing, but the HBase table and Hive tables will need to be adjusted.

The IPT should offer mapping of the new term. This term is NOT essential for event-based sample data, therefore this term only needs to be included in the new Occurrence core. GBIF.org needs to index dwc:associatedOrganisms from incoming occurrence records. Occurrence.java does not need changing, but the HBase table and Hive tables will need to be adjusted.

The IPT should offer mapping of the new term. This term is NOT essential for event-based sample data, therefore this term only needs to be included in the new Occurrence core. GBIF.org needs to index dwc:organismRemarks from incoming occurrence records. Occurrence.java does not need changing, but the HBase table and Hive tables will need to be adjusted.

The IPT should offer mapping of the new term. This term is essential for event-based sample data, and needs to be included in both the new Occurrence and Event cores. GBIF.org needs to index dwc:parentEventID from incoming occurrence records. GBIF.org needs to allow searching occurrence records by both dwc:eventID and dwc:parentEventID to filter all occurrence records related to a single Event. Occurrence.java does not need changing, but the HBase table and Hive tables will need to be adjusted.

The IPT should offer mapping of the new term. This term is essential for event-based sample data, and needs to be included in both the new Occurrence and Event cores. GBIF.org needs to index dwc:sampleSizeValue from incoming occurrence records. Occurrence.java does not need changing, but the HBase table and Hive tables will need to be adjusted.

The IPT should offer mapping of the new term. This term is essential for event-based sample data, and needs to be included in both the new Occurrence and Event cores. GBIF.org needs to index dwc:sampleSizeUnit from incoming occurrence records. Occurrence.java needs changing (since this will be an interpreted and typed property) and the HBase table and Hive tables will need to be adjusted.

New core definitions

New event, occurrence, and taxon core definitions have been released in the sandbox. These definitions do not include any deprecated terms (e.g. dcterms:source).

These definitions are made available on versioned URLs that include the date they were released (e.g. http://rs.gbif.org/sandbox/core/dwc_event_2015-04-14.xml). A versioning policy has subsequently been published with instructions on how to create new versions of extensions and vocabularies.

rs.gbif.org changes

  1. The extension and vocabulary definitions should be extended to add a version attribute that takes the date. See Jira 2736.
  2. Extend the extensions.json and thesauri.json files under http://rs.gbif.org/ indicating the latest versions per rowType, including information about URL, title, issued date, etc. See Jira 2737.

IPT 2.3 specifications/changes

The IPT 2.3 needs to accommodate versioned extensions, and provide data migration utilities for terms that have changed. To do so, the following specifications are required:

  1. A versioning policy for extensions and vocabularies must be defined. See Issue 1156.
  2. Only one extension/version per rowType can be installed at a time (already the case).
  3. The IPT will lookup the latest versions of cores, extensions, and vocabularies by downloading the index files from rs.gbif.org. See Issue 1157.
  4. IPT admins are responsible for migrating to new (extension) versions. The IPT admin can migrate to a new version at the click of a button which will automate: a) downloading the new version, b) migrating all extension mappings to use the new version, c) uninsalling the old version, and d) installing the new version. Extension migration transitions all deprecated terms (listed above) to their new counterpart (listed above). If no new counterpart exists, the term mapping is simply removed. See Issue 1158.
  5. Instead of offering multiple extension/versions per rowType, the IPT will make it easier to do mappings by hiding redundant classes. A redundant class is defined as a class present in the extension that is already present in the core. For example, the classes dwc:Event, dwc:Location and dwc:GeolocicalContext are redundant in the Occurrence extension, when used as an extension to the Event core that already contains these classes of terms. See Issue 1159.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment