Skip to content

Instantly share code, notes, and snippets.

@cnrdh
Last active October 30, 2020 13:36
Show Gist options
  • Save cnrdh/1cecf4b7147abbc1a5a39105bd56502d to your computer and use it in GitHub Desktop.
Save cnrdh/1cecf4b7147abbc1a5a39105bd56502d to your computer and use it in GitHub Desktop.
Data management plan

Standardising Akvaplan-niva's zooplankton data as Darwin Core

Data management plan

Author: Conrad Helgeland License: Public domain Date: 2020-10-30

1. Introduction

Akvaplan-niva has an open data policy and plans to publish its primary biodiversity data on the web, using the Darwin Core vocabulary.

In response to GBIF Norway's data mobilization 2020, Akvaplan-niva intends to constribute zooplankton occurrences from several projects, all linked to sampling events.

2. Deliverables

For data traceability and data reproducability we will publish the following on https://github.com/akvaplan-niva, for each contributed dataset:

  • Input data in untouched form
  • Output data in UTF-8 encoded Darwin Core text files
  • Source code used for data processing, quality control, and publishing
  • Rejected data, log files, and quality control reports

3. Data

Source data (input)

The input data is diverse:

Most newer input data is stored in EcoTaxa, a web tool for taxonomic classification of images. Data is exported as tab-separated text files where each line corresponds to a Darwin Core occurrence record. A large portion (~ 50–75 %) of detected objects in EcoTaxa re not classified as organisms, but as artefacts, detritus, feces.

Older data derives from the ZooProcess software, also used for zooplankton image analysis. Data is exported in a a special text format (.pid-file) that contains metadata headers followed by regular CSV after a [Data] line.

A third type of input data is Excel documents in Akvaplan-niva's internal storage.

Darwin Core (output)

Data modeling is based on GBIF's requirements for occurrence data and sampling events and informed by GBIF's best practices for sampling event data.

All records will have universally unique identifiers UUIDs (occurrenceID/eventID). For Taxonomy

Organism occurrence quantification will vary depending on the input data, but we will strive to deliver primary non-aggregated data when available.

For net-based plankton sampling, a common challenge is to estimate the volume of the sampled water. The consequence is that sampleSizeValue is empty if volume is unkown or incalculable.

For taxonomy, we will validate all scientificNames against GBIF's REST API v1 with World Register of Marine Species as prefered dataset (example: Calanus).

{ "object_id": "n1_12m_dive_autumn_2017_large_tot_1_2",
"object_lat": "78.3754166666667",
"object_lon": "-14.78215",
"object_date": "11111111",
"object_time": "111100",
"object_link": "",
"object_depth_min": "12.0",
"object_depth_max": "12.0",
"object_annotation_status": "validated",
"object_annotation_person_name": "Colm O'Leary",
"object_annotation_person_email": "olearycolm1@gmail.com",
"object_annotation_date": "20180514",
"object_annotation_time": "094812",
"object_annotation_category": "Copepoda",
"object_annotation_hierarchy":
"living>Eukaryota>Opisthokonta>Holozoa>Metazoa>Arthropoda>Crustacea>Maxillopoda>Copepoda",
"object_lat_end": "",
"object_lon_end": "",
"object_area": "948.0",
"object_mean": "194.75",
"object_stddev": "41.118",
"object_mode": "242.0",
"object_min": "114.0",
"object_max": "250.0",
"object_x": "21.54",
"object_y": "23.61",
"object_xm": "21.59",
"object_ym": "24.11",
"object_perim.": "168.17",
"object_bx": "4356.0",
"object_by": "675.0",
"object_width": "56.0",
"object_height": "42.0",
"object_major": "45.3",
"object_minor": "26.6",
"object_angle": "144.1",
"object_circ.": "0.421",
"object_feret": "58.5",
"object_intden": "184621.0",
"object_median": "206.0",
"object_skew": "-0.511",
"object_kurt": "-1.12",
"object_%area": "0.74",
"object_xstart": "4369.0",
"object_ystart": "675.0",
"object_area_exc": "941.0",
"object_fractal": "1.099",
"object_skelarea": "101.0",
"object_slope": "0.074",
"object_histcum1": "162.0",
"object_histcum2": "204.0",
"object_histcum3": "231.0",
"object_xmg5": "0.0",
"object_ymg5": "0.0",
"object_nb1": "1.0",
"object_nb2": "2.0",
"object_nb3": "0.0",
"object_compentropy": "0.0",
"object_compmean": "0.0",
"object_compslope": "0.0",
"object_compm1": "0.0",
"object_compm2": "0.0",
"object_compm3": "0.0",
"object_symetrieh": "3.244",
"object_symetriev": "3.392",
"object_symetriehc": "4.0",
"object_symetrievc": "4.0",
"object_convperim": "192.0",
"object_convarea": "1317.0",
"object_fcons": "2.755",
"object_thickr": "2.214",
"object_tag": "1.0",
"object_esd": "602.0",
"object_elongation": "1.66666666666667",
"object_range": "136.0",
"object_meanpos": "-0.684210526315789",
"object_centroids": "4.0",
"object_cv": "21.0526315789474",
"object_sr": "30.1470588235294",
"object_perimareaexc": "0.178533475026567",
"object_feretareaexc": "0.0626992561105207",
"object_perimferet": "2.84745762711864",
"object_perimmajor": "3.73333333333333",
"object_circex": "69.5585573418352",
"object_cdexc": "0.00425079702444208",
"sample_id": "n1_12m_dive_autumn_2017_large",
"sample_dataportal_descriptor": "",
"sample_scan_operator": "col",
"sample_ship": "rubber_boat",
"sample_program": "",
"sample_stationid": "n1",
"sample_bottomdepth": "12",
"sample_ctdrosettefilename": "",
"sample_other_ref": "",
"sample_tow_nb": "99999",
"sample_tow_type": "0",
"sample_net_type": "dive",
"sample_net_mesh": "100",
"sample_net_surf": "99999",
"sample_zmax": "12",
"sample_zmin": "12",
"sample_tot_vol": "99999",
"sample_comment": "no",
"sample_tot_vol_qc": "",
"sample_depth_qc": "",
"sample_sample_qc": "",
"sample_barcode": "",
"sample_duration": "",
"sample_ship_speed": "",
"sample_cable_length": "",
"sample_cable_angle": "",
"sample_cable_speed": "",
"sample_nb_jar": "",
"sample_open": "",
"process_id": "zooprocess_n1_12m_dive_autumn_2017_large",
"process_date": "20180306",
"process_time": "093100",
"process_img_software_version": "7.21_picheral_cnrs",
"process_img_resolution": "2400",
"process_img_od_grey": "",
"process_img_od_std": "0",
"process_img_background_img": "20180306_0853_background_large_manual.tif",
"process_particle_version": "6.15_2009/10/25_picheral_cnrs",
"process_particle_threshold": "243",
"process_particle_pixel_size_mm": "0.0106",
"process_particle_min_size_mm": "0.3",
"process_particle_max_size_mm": "100",
"process_particle_sep_mask": "unused",
"process_particle_bw_ratio": "",
"process_software": "zooprocess_pid_to_ecotaxa_7.24_2017/04/03",
"process_particle_pixel_size_µm": "",
"acq_id": "tot_n1_12m_dive_autumn_2017_large",
"acq_instrument": "zooscan",
"acq_min_mesh": "1000",
"acq_max_mesh": "999999",
"acq_sub_part": "1",
"acq_sub_method": "motoda",
"acq_hardware": "hydroptic_v3_window7",
"acq_software": "vuescan9.0.51",
"acq_author": "col",
"acq_imgtype": "zooscan",
"acq_scan_date": "20180306",
"acq_scan_time": "090700",
"acq_quality": "",
"acq_bitpixel": "3",
"acq_greyfrom": "2",
"acq_scan_resolution": "3",
"acq_rotation": "3",
"acq_miror": "1",
"acq_xsize": "29208",
"acq_ysize": "51072",
"acq_xoffset": "6172",
"acq_yoffset": "2686",
"acq_lut_color_balance": "manual",
"acq_lut_filter": "no",
"acq_lut_min": "654",
"acq_lut_max": "54645",
"acq_lut_odrange": "1.8",
"acq_lut_ratio": "1.15",
"acq_lut_16b_median": "47518.05"
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment