Skip to content

Instantly share code, notes, and snippets.

@jduckles
Created August 21, 2018 07:57
Show Gist options
  • Save jduckles/e811cc4b0e6b3e962b5ea0cb9b360593 to your computer and use it in GitHub Desktop.
Save jduckles/e811cc4b0e6b3e962b5ea0cb9b360593 to your computer and use it in GitHub Desktop.
Abstract
This presentation will outline the untapped potential of Information and
Library Science (ILS) programs as an integral space for the long-term
training and support of biodiversity informatics work. It will also
outline the specific proposed steps taken at Indiana University,
Bloomington (IU), to provide long-term, systematized training of
students focused on information work within this broad domain.
As a discipline, ILS has long been preoccupied with the organization,
description, curation, and access to a wide variety of information and
data sources. ILS curriculum necessarily emphasizes a broad range of
information topics given that many different kinds of institutions
require these particular skillsets. Typical ILS curriculums focus on
topics such as, knowledge organization, metadata, ontologies, database
design, scholarly communication, intellectual property, information
ethics, interface design, data analytics, online publishing, museum
studies, data curation, and collection management/administration. Given
this broad range of training, students graduating from ILS programs are
perfectly situated to support biodiversity informatics broadly
conceived, especially as it relates to the standardization and
normalization of data sources across geographically and temporally
distributed locations and sources within specific institutional
environments.
Yet, despite the overlaps between ILS departments, biodiversity
informatics, and museum environments, no ILS program has officially
taken steps to support this intersectional space. Using concrete
examples, this talk will show how the ILS program at IU is building on
top of already-existing capacities to more robustly support biodiversity
work. The proposed way forward is a tightly integrated approach to
biodiversity informatics that integrates theoretical experience and
technical training with hands-on internships in museum and biodiversity
environments. Through close partnerships with on-campus institutes, such
as the Indiana Geological & Water Survey and the Center for Biological
Research Collections, as well as larger, external institutions such as
the Smithsonian National Museum of Natural History, students will be
provided intense fieldwork experience in data management and
standards-driven work specific to the museum and biodiversity world. A
tiered approach to this training will be suggested, as this kind of
training should proceed at both the professional level (for example,
master's level work), as well as more advanced levels focused on more
research-driven activity (such as postdoctoral work).
Part of this new approach to biodiversity informatics training requires
the rearticulation of ILS courses, as well as the addition of new
courses that can provide domain-specific knowledge. This presentation,
then, will outline a proposed curriculum to support this kind of
collaborative training and work. A distributed training structure will
be suggested, utilizing expertise from across the globe. In addition, it
will show how a more project- and field work-centric approach to ILS
education can more quickly and deeply train students to enter the
quickly changing field.
Part of the difficulty with training biodiversity informatics
specialists is that building such programs from the ground up is often
costly and requires the building of new workflows and practices. An
integrated approach, such as that proposed in this presentation,
however, will leverage the respective strengths of ILS program and
museum environments in ways that are sustainable and resilient for the
long term. The goal here is for institutions to support each other in
ways that strengthen their core missions, as well as push the discipline
forward in systematic and unique ways.
Abstract
As rapid advances in sequencing technology result in more branches of
the tree of life being illuminated, there has actually been a decrease
in the percentage of sequence records that are backed by voucher
specimens Trizna 2018b. The good news is that there are tools Trizna
(2017), NCBI (2005), Biocode LLC (2014) to enable well-databased museum
vouchers to automatically validate and format specimen and collection
metadata for high quality sequence records. Another problem is that
there are millions of existing sequence records that are known to
contain either incorrect or incomplete specimen data. I will show an
end-to-end example of sequencing specimens from a museum, depositing
their sequence records in NCBI\'s (National Center for Biotechnology
Information) GenBank database, and then providing updates to GenBank as
the museum database revises identifications. I will also talk about
linking records from specimen databases as well. Over one million
records in the Global Biodiversity Information Facility (GBIF) Trizna
(2018a) contain a value in the Darwin Core term \"associatedSequences\",
and I will examine what is currently contained in these entries, and how
best to format them to ensure that a tight connection is made to
sequence records.
Abstract
SOCCOMAS is a ready-to-use Semantic Ontology-Controlled Content
Management System (http://escience.biowikifarm.net/wiki/SOCCOMAS). Each
web content management system (WCMS) run by SOCCOMAS is controlled by a
set of ontologies and an accompanying Java-based middleware with the
data housed in a Jena tuple store. The ontologies describe the behavior
of the WCMS, including all of its input forms, input controls, data
schemes and workflow processes (Fig. 1).
Data is organized into different types of data entries, which represent
collections of data referring to a particular material entity, for
instance an individual specimen. SOCCOMAS implements a suite of general
processes, which can be used to manage and organize all data entry
types. One category of processes manages the life-cycle of a data entry,
including all required for changing between the following possible entry
states:
current draft version;
backup draft version;
recycle bin draft version;
deleted draft version;
current published version;
previously published version.
The processes also allow a user to create a revised draft based on the
current published version. Another category of processes automatically
tracks the overall provenance (i.e. creator, authors, creation and
publication date, contributers, relation between different versions,
etc.) for each particular data entry. Additionally, on a significantly
finer level of granularity, SOCCOMAS also tracks in a detailed
change-history log all changes made to a particular data record at the
level of individual input fields. All information (data, provenance
metadata, change-history metadata) is stored based on Resource
Description Framework (RDF) compliant data schemes into different named
graphs (i.e. a URI under which triple statements are stored in the tuple
store). All recorded information can be accessed through a SPARQL
endpoint. All data entries are Linked Open Data and thus provide access
to an HTML representation of the data for visualization in a web-browser
or as a machine-readable RDF file. The ontology-controlled design of
SOCCOMAS allows administrators to easily customize already existing
templates for input forms of data entries, define new templates for new
types of data entries, and define underlying RDF-compliant data schemes
and apply them to each relevant input field. SOCCOMAS provides an engine
for running and developing semantic WCMSs, where only ontology editing,
but no middleware and front end programming, are required for adapting
the WCMS to one\'s own specific requirements.
Abstract
Taxonomic names are ambiguous as identifiers of biodiversity data, as
they refer to a particular concept of a taxon in an expert's mind
(Kennedy et al. 2005). This ambiguity is particularly problematic when
attempting to reconcile taxonomic names from disparate sources with
clades on a phylogeny. Currently, such reconciliation requires expert
interpretation, which is necessarily subjective, difficult to reproduce,
and refractory to scaling. In contrast, phylogenetic clade definitions
are a well-developed method for unambiguously defining the semantics of
a clade concept in terms of shared evolutionary ancestry (Queiroz and
Gauthier 1990, Queiroz and Gauthier 1994), and these semantics allow
locating clades on any phylogeny. Although a few software tools have
been created for resolving clade definitions, including for definitions
expressed in the Mathematical Markup Language (e.g. Names on Nodes in
Keesey 2007) and as lists of GenBank accession numbers (e.g. mor in
Hibbett et al. 2005), these are application-specific representations
that do not provide formal definitions with well-defined semantics for
every component of a clade definition. Being able to create such
machine-interpretable definitions would allow computers to store,
compare, distribute and resolve semantically-rich clade definitions.
To this end, the Phyloreferencing project (http://phyloref.org,
Cellinese and Lapp 2015) is working on a specification for encoding
phylogenetic clade definitions as ontologies using the Web Ontology
Language (OWL in W3C OWL Working Group 2012). Our specification allows
the semantics of these definitions, which we call phyloreferences, to be
described in terms of shared ancestor and excluded lineage properties.
The aim of this effort is to allow any OWL-DL reasoner to resolve
phyloreferences on a phylogeny that has itself been translated into a
compatible OWL representation. We have developed a workflow that allows
us to curate phyloreferences from phylogenetic clade definitions
published in natural language, and to resolve the curated phyloreference
against the phylogeny upon which the definition was originally created,
allowing us to validate that the phyloreference reflects the authors'
original intent. We have started work on curating dozens of
phyloreferences from publications and the clade definition database
RegNum (http://phyloregnum.org), which will provide an online catalog of
all clade definitions that are part of the Phylonym Volume, to be
published together with the PhyloCode (https://www.ohio.edu/phylocode/).
We will comprehensively curate these definitions into a reusable and
fully computable ontology of phyloreferences.
In our presentation, we will provide an overview of phyloreferencing and
will describe the model and workflow we use to encode clade definitions
in OWL, based on concepts and terms taken from the Comparative Data
Analysis Ontology (Prosdocimi et al. 2009), Darwin-SW (Baskauf and Webb
2016) and Darwin Core (Wieczorek et al. 2012). We will demonstrate how
phyloreferences can be visualized, resolved and tested on the phylogeny
that they were originally described on, and how they resolve on one of
the largest synthetic phylogenies available, the Open Tree of Life
(Hinchliff et al. 2015). We will conclude with a discussion of the
problems we faced in referring to taxonomic units in phylogenies, which
is one of the key challenges in enabling better integration of
phylogenetic information into biodiversity analyses.
Abstract
Parasitism can be defined as an interaction between species in which one
of the interaction partners, the parasite, lives in or on the other, the
host. The parasite draws food from its host and harms it in the process.
According to some estimates, over 40% of all eukaryotes are parasites.
Nevertheless, it is difficult to obtain information about a particular
taxon is a parasite computationally making it difficult to query large
sets of taxa.
Here we test to what extend it is possible to use the Open Tree of Life
(OTL), a synthesis of phylogenetic trees on a backbone taxonomy
(resulting in unresolved nodes), to expand available information via
phylogenetic trait prediction. We use the Global Biotic Interactions
(GloBI) database to categorise 25,992 and 34,879 species as parasites
and free-living, respectively, and predict states for over \~2.3 million
(97.34%) leaf nodes without state information.
We estimate the accuracy of our maximum parsimony based predictions
using cross-validation and simulation at roughly 60-80% overall, but
strongly varying between clades. The cross-validation resulted in an
accuracy of 98.17% which is explained by the fact that the data are not
uniformly distributed. We describe this variation across taxa as
associated with available state and topology information. We compare our
results with several smaller scale studies, which used manual expert
curation and conclude that computationally inferred state changes
largely agree in number and placement with those. In clades in which
available state information is biased (mostly towards parasites, e.g. in
Nematodes) phylogenetic prediction is bound to provide results
contradicting conventional wisdom.
This represents, to our knowledge, the first comprehensive computational
reconstruction of the emergence of parasitism in eukaryotes. We argue
that such an approach is necessary to allow further incorporation of
parasitism as an important trait in species interaction databases and in
individual studies on eukaryotes, e.g. in the microbiome.
Abstract
The Open Tree of Life project is a collaborative effort to synthesize,
share and update a comprehensive tree of life Fig. 1. We have completed
a draft synthesis of a tree summarizing digitally available taxonomic
and phylogenetic knowledge for all 2.6 million named species, available
at tree.opentreeoflife.org Hinchliff et al. 2015. . . This tree provides
ready access to phylogenetic information which can link together
biodiversity data on the basis of what we know about relevant
evolutionary history. Both the unified reference taxonomy Rees and
Cranston 2017 and the published phylogenetic statements underlying the
tree McTavish et al. 2015 are available and accessible online. Taxa in
the phylogenies are mapped to the the reference taxonomy, which aligns
Open Tree taxon identifiers to those from NCBI and GBIF, among several
other taxonomy resources. The synthesis tree is revised as new data
become available, and captures conflict and consensus across different
published phylogenetic estimates. This undertaking requires both
development of novel infrastructure and analysis tools, as well as
community engagement with the Open Tree of Life project. I will discuss
the challenges in and the progress towards achieving these goals.
Abstract
Connecting biodiversity data across databases is not as easy as one
might think. Different databases use different identifiers and
taxonomies and connecting these data often results in loss of
information and precision. Here we present some of the challenges we
faced with integrating multiple biodiversity data sets, including
specimen data from the scientific collections, during a hackathon hosted
by the Phenoscape project in December of 2017. The hackathon brought
together a diverse group of participants, including biologists and
software developers, to explore ways of using the computable phenotype
data in the Phenoscape Knowledgebase (KB) (Edmunds et al. 2015). The KB
contains ontology-annotated data that links evolutionary phenotypes from
the comparative literature to model organism phenotypes enabling, e.g.,
the retrieval of candidate genes for evolutionary phenotypes and the
generation of synthetic supermatrices of presence/absence characters.
During this hackathon, our team explored how to link phenotype data in
the KB to museum specimen data in iDigBio (Matsunaga et al. 2013) with
the hope of creating visualizations including world maps showing species
distributions with different character states and their phylogenetic
relationships. We visualized lineage relationships by querying the Open
Tree of Life (OT) (Hinchliff et al. 2015) website using data integrated
by another group at the hackathon that linked KB and OT taxonomic
identifiers.
Phenoscape uses terms from anatomy, quality, and taxonomy ontologies to
annotate characters and taxonomic information from the phylogenetic
literature along with specimen information. When populating the KB,
specimen identifiers such as occurrence identifiers, collector's number,
and catalog numbers were preserved if present in the literature. We
found that these identifiers, although standard in the biodiversity
domain, were mostly insufficient to uniquely identify the source
specimen in iDigBio. As an alternative, we instead mapped all the
occurrences of taxa using string matches of the genus and species from
Vertebrate Taxonomy Ontology identifiers. Without specimen identifiers
that are consistent across databases, we lost the ability to explore
spatial and temporal variation of characters within genera and were only
able to explore phenotypes and geographic distributions among genera. We
look forward to discussing these issues with the collections community
represented at this meeting by the Society for the Preservation of
Natural History Collections (SPNHC).
We developed an R Shiny application that integrates characters and taxa
from Phenoscape with specimen records from iDigBio and phylogenies from
OT, to visualize phenotypic characters and taxon distributions in three
interactive panels. The app allows a user to visualize OT phylogenies
and place presence/absence character data on the tree. Specifically,
users can: select taxa or specific characters to visualize their
geographic distributions, navigate a phylogeny browser which displays
character and specimen data available for taxa under consideration, and
view a heatmap of characters available for character and taxon
combinations. Because of our challenges joining data, our distribution
map leaves users with the impression that all individuals in a genus
exhibit a character whereas the KB was populated with data describing
individuals. We hope that with improved data standards and their use by
more people, constructing applications like ours will become easier.
Abstract
There is a large amount of publicly available biodiversity data from
many different data sources. When doing research, one ideally interacts
with biodiversity data programmatically so their work is reproducible.
The entry point to biodiversity data records is largely through
taxonomic names, or common names in some cases (e.g., birds). However,
many researchers have a phylogeny focused project, meaning taxonomic
names are not the ideal interface to biodiversity data. Ideally, it
would be simple to programmatically go from a phylogeny to biodiversity
records through a phylogeny based query.
I\'ll discuss a new project \'phylodiv\'
(https://github.com/ropensci/phylodiv/) that attempts to facilitate
phylogeny based biodiversity data collection (see Fig. 1). The project
takes the form of an R software package. The idea is to make the user
interface take essentially two inputs: a phylogeny and a phylogeny based
question. Behind the scenes we\'ll do many things, including gathering
taxonomic names and hierarchies for the taxa in the phylogeny, send
queries to GBIF (or other data sources), and map the results. The user
will of course have control over the behind the scenes parts, but I
imagine the majority use case will be to input a phylogeny and a
question and expect an answer back.
We already have R tools to do nearly all parts of the work-flow shown
above: there\'s a large number of phylogeny tools,
\'taxize\'/\'taxizedb\' can handle taxonomic name collection, while
\'rgbif\' can handle interaction with GBIF, and there\'s many mapping
options in R. There are a few areas that need work still however.
First, there\'s not yet a clear way to do a phylogeny based query.
Ideally a user will be able to express a simple query like \"taxon A vs.
its sister group\". That\'s simple to imagine, but to implement that in
software is another thing.
Second, users ideally would like answers back - in this case a map of
occurrences - relatively quickly to be able to iterate on their research
work-flow. The most likely solution to this will be to use GBIF\'s map
tile service to visualize binned occurrence data, but we\'ll need to
explore this in detail to make sure it works.
Abstract
Xper3 (Vignes Lebbe et al. 2016) is a collaborative knowledge base
publishing platform that, since its launch in november 2013, has been
adopted by over 2 thousand users (Pinel et al. 2017). This is mainly due
to its user friendly interface and the simplicity of its data model. The
data are stored in MySQL Relational DBs, but the exchange format uses
the TDWG standard format SDD (Structured Descriptive Data Hagedorn et
al. 2005). However, each Xper3 knowledge base is a closed world that the
author(s) may or may not share with the scientific community or the
public via publishing content and/or identification key (Kopfstein
2016). The explicit taxonomic, geographic and phenotypic limits of a
knowledge base are not always well defined in the metadata fields.
Conversely terminology vocabularies, such as Phenotype and Trait
Ontology PATO and the Plant Ontology PO, and software to edit them, such
as Protégé and Phenoscape, are essential in the semantic web, but
difficult to handle for biologist without computer skills. These
ontologies constitute open worlds, and are expressed themselves by RDF
triples (Resource Description Framework). Protégé offers vizualisation
and reasoning capabilities for these ontologies (Gennari et al. 2003,
Musen 2015).
Our challenge is to combine the user friendliness of Xper3 with the
expressive power of OWL (Web Ontology Language), the W3C standard for
building ontologies. We therefore focused on analyzing the
representation of the same taxonomic contents under Xper3 and under
different models in OWL. After this critical analysis, we chose a
description model that allows automatic export of SDD to OWL and can be
easily enriched. We will present the results obtained and their
validation on two knowledge bases, one on parasitic crustaceans
(Sacculina) and the second on current ferns and fossils (Corvez and
Grand 2014). The evolution of the Xper3 platform and the perspectives
offered by this link with semantic web standards will be discussed.
Abstract
Anthropogenic-induced climate change has already altered the conditions
to which species have adapted locally, and consequently, shifts of
occurrence areas have been previously reported (Chen et al. 2011).
Anticipating the results of climate change is urgent, and using these
results efficiently to guide decision-making can help to build
strategies to protect species from those changes. Therefore, our
objective is to propose the use of climate change impact assessments,
obtained through species distribution models (SDMs), to guide decision
making. The emphasis will be on data that could help determine the
potentially vulnerable species and the priority areas, which could act
as climate refuges, as well as wildlife corridors. SDMs are based on
species occurrence points, available mainly from biological collections
and observations (Franklin 2010). When combined with geospatially
explicit layers of abiotic or biotic data (e. g. temperature,
precipitation, land use), which defines the ecological requirements of
species under study, it can generate species distribution models. These
models are projected in the form of maps indicating areas where the
species can find the most suitable habitats and, therefore, where one
can most likely find them. To support public policies decision, the
generation of robust and reliable model is an important factor. A
minimum number of six occurrence points is a mandatory requirement, with
non-overlapping area as a filter criteria. Unfortunatelly, in Brasil, as
well as in Latin America in general, this type of data is scarce.
Thus, with SDMs, four types of decision making information data
regarding priority species and areas could be obtained (Fig. 1).
Size of potential occurrence areas: species that have a small area of
occurrence are potentially vulnerable, since they present endemism,
usually living in restricted environmental conditions. In this case, any
small change in environmental conditions can result in the extinction of
the impacted species. Thus, this region needs to be protected.
Difference between current and future area: species presenting the most
significant reduction in potential areas should be prioritized by
decision-makers. This measurement could be used as an indication of
vulnerability.
Even species that have no predicted area reduction or an increase could
be prioritized in management programs due to its role in the complex
interaction networks of ecosystem services, such as pollinators, seed
dispersers or disease control. These species could be more resilient to
network interaction changes due climate, and possibly are better able to
provide their services in the extreme unfavorable climate scenarios.
Areas that maintain higher species diversity in future scenarios: their
protection could be prioritized in restoration and conservation
programs. Especially in cases involving multiple species, those areas
could be considered as climate refuges by decision-makers. Additionally,
for the reconstruction and use of SDM published in peer-reviewed
journals, it is necessary that all pieces of information about models,
its generation, ensemble methods, data cleaning and data quality
criteria applied should be available.
The availability of the four above mentioned types of information can
help on decision-making strategies aiming the protection of priority
species and areas. In conclusion, SDMs present essential information
about the present and future impacts of projected climate change and
their derived data could be preserved using a standard controlled
vocabulary.
Abstract
Can Essential Biodiversity Variables (EBVs) be developed to monitor
changes in species interactions? That was the difficult question asked
at the GLOBIS-B workshop in February, 2017 in which \>50 experts
participated. EBVs can be defined as harmonized measurements that allow
us to inform policy about essential changes in biodiversity. They can be
seen as biological state variables from which more refined indicators
may be derived. They have been presented as a means to monitor global
biodiversity change and as a concept to drive the gathering, sharing,
and standardisation of data on our biota (Geijzendorffer et al. 2015,
Kissling et al. 2017, Pereira et al. 2013).
There are different classes of EBVs that characterize, for example, the
state of species populations, species traits and ecosystem structure and
function. It has also been proposed that there should be EBVs related to
species interactions. However, until now there has been little progress
formulating what these should be, even though species interactions are
central to ecology. Species interactions cover a wide range of important
processes, from mutualisms, such as pollination, to different forms of
heterotrophic nutrition, such as the predator-prey relationship. Indeed,
ecological interactions are critical to understand why an ecosystem is
more than the sum of its parts. Nevertheless, direct observation of
species interactions is often difficult and time consuming work, which
makes it difficult to monitor them in the long-term. For this reason the
workshop focused on those species interactions that are feasible to
study and are most relevant to policy. To bring focus to our discussions
we concentrated on pollination, predation and microbial interactions.
Taking pollination as an example, there was recognition of the
importance of ecological networks and that network metrics may be a
sensitive indicator of change. Potential EBVs might be the number of
pairwise interactions between species or the modularity and interaction
diversity of the whole network. This requires standardised data
collection and reporting (e.g. standardization of measures of
interaction strength or minimum data specifications for ecological
networks) and sufficient data across time to regularly calculate these
metrics. Other simpler surrogates for pollination might also prove
useful, such as flower visitation rates or the proportion of fruit set.
Finally, there was a recognition that we do not yet have enough tools to
monitor some important interactions. Many interactions, particular among
microbes, can currently only be inferred from the co-occurrence of taxa.
However, technology is rapidly developing and it is possible to foresee
a future where even these interactions can be monitored efficiently.
Species interactions are essential to understanding ecology, but they
are also difficult to monitor. Yet, delegates at the workshop left with
a positive outlook that it is valuable to develop standardisation and
harmonization of species interaction data to make them suitable for EBV
production.
Abstract
Understanding the role that species play in their environment is a
fundamental goal of biodiversity research, bringing knowledge on
ecosystem maintenance and in provision of ecosystem services. Different
types of interaction that different species establish with their
partners regulate the functioning of ecosystems (McCann 2007).
Interactions between plants and pollinators (Potts et al. 2016) and
between plants and seed dispersers (Wang and Smith 2002) are examples of
mutualism, crucial to the maintenance of the floristic composition and
overall biodiversity in different biomes. They also illustrate well the
nature\'s contribution to people, supporting ecosystem services with key
economic consequences, such as pollination of agricultural crops (Klein
et al. 2007) and seed dispersal of natural or assisted restoration of
degraded areas (Wunderle 1997).
Interactions are mediated by different functional traits (morphological
and/or behavioral characteristics of organisms that influence their
performance) (Ball et al. 2015). As the zoochorous transfer of pollen
grains and seeds usually involves contact, the success of pollination
and seed dispersal depends to a large extend on the relationship of size
and morphology between flower/fruit and their respective pollinator/seed
disperser. Selected over a long history of shared evolutionary history,
it is feasible to rely on the predictive potential these traits may have
to determine if a certain animal is able to transfer pollen grains
and/or seeds of specific plants in the landscape (Howe 2016).
Biodiversity is facing constant negative impacts, especially related to
climate and habitat changes. They are threatening the provision of
ecosystem services, jeopardizing the basic premise of sustainable
development, which is to guarantee resources for future generations. The
novel landscapes that result from these impacts will certainly be
dependent of these ecosystem services, but will they persist in face of
extinctions and invasive competitors? Ultimately, will these services be
predicable by functional traits, in landscapes where shared evolutionary
history is reduced? Strategies that help our understanding of the
interactions and their role in the provision of services are urgent
(Corlett 2011). Given this context, our objective here is to present the
type of data that, if made available, could assist in determining the
role of species in terms of the interactions they make and the provision
of ecosystem services. Moreover, we aimed to elucidate how this role can
be associated with functional traits.
The current work focuses on the following groups: plants, birds, bats
and bees (Fig. 1). Of particular interest are interactions involving:
pollination, which is carried out predominantly by bees, but also by
nectarivorous birds and bats; and
seed dispersal, mainly carried out by frugivorous birds and bats.
These interactions are mediated by key traits. In plants, common flower
traits are the aperture, color, odor strength and type, shape
orientation, size and symmetry, nectar guide and sexual organ, and
reward. Fruit or seed traits, such as fleshy nutrient, chemical
attractant and clinging structures, are also relevant for seed
dispersal. In animals the most common traits are the body size (for
bees, the intertegular distance; for bats, forearm length; and for
birds, the weight), gape-width for birds and the feeding habit
(nectarivorous, frugivorous, omnivorous) for bats and birds. Providing
standardized data on traits involving interactions between fauna and
flora is important to fill knowledge gaps, which could help in the
decision making processes aiming conservation, restoration and
management programs for protecting ecosystem services based on
biodiversity.
Abstract
The Brazilian Plant-Pollinator Interactions Network\*1 (REBIPP) aims to
develop scientific and teaching activities in plant-pollinator
interaction. The main goals of the network are to:
generate a diagnosis of plant-pollinator interactions in Brazil;
integrate knowledge in pollination of natural, agricultural, urban and
restored areas;
identify knowledge gaps;
support public policy guidelines aimed at the conservation of
biodiversity and ecosystem services for pollination and food production;
and encourage collaborative studies among REBIPP participants.
To achieve these goals the group has resumed and built on previous works
in data standard definition done under the auspices of the IABIN-PTN
(Etienne Américo et al. 2007) and FAO (Saraiva et al. 2010) projects
(Saraiva et al. 2017). The ultimate goal is to standardize the ways data
on plant-pollinator interactions are digitized, to facilitate data
sharing and aggregation. A database will be built with standardized data
from Brazilian researchers members of the network to be used by the
national community, and to allow sharing data with data aggregators.
To achieve those goals three task groups of specialists with similar
interests and background (e.g botanists, zoologists, pollination
biologists) have been created. Each group is working on the definition
of the terms to describe plants, pollinators and their interactions. The
glossary created explains their meaning, trying to map the suggested
terms into Darwin Core (DwC) terms, and following the TDWG Standards
Documentation Standard\*2 in definition.
Reaching a consensus on terms and their meaning among members of each
group is challenging, since researchers have different views and
concerns about which data are important to be included into a standard.
That reflects the variety of research questions that underlie different
projects and the data they collect. Thus, we ended up having a long list
of terms, many of them useful only in very specialized research
protocols and experiments, sometimes rarely collected or measured.
Nevertheless we opted to maintain a very comprehensive set of terms, so
that a large number of researchers feel that the standard meets their
needs and that the databases based on it are a suitable place to store
their data, thus encouraging the adoption of the data standard.
An update of the work will soon be available at REBIPP website and will
be open for comments and contributions. This proposal of a data standard
is also being discussed within the TDWG Biological Interaction Data
Interest Group\*3 in order to propose an international standard for
species interaction data.
The importance of interaction data for guiding conservation practices
and ecosystem services provision management has led to the proposal of
defining Essential Biodiversity Variables (EBVs) related to biological
interactions. Essential Biodiversity Variables (Pereira et al. 2013)
were developed to identify key measurements that are required to
monitoring biodiversity change. EBVs act as intermediate abstract layer
between primary observations (raw data) and indicators (Niemeijer 2002).
Five EBV classes have been defined in an initial stage: genetic
composition, species populations, species traits, community composition,
ecosystem function and ecosystem structure. Each EBV class defines a
list of candidate EBVs for biodiversity change monitoring (Fig. 1).
Consequently, digitalization of such data and making them available
online are essential. Differences in sampling protocols may affect data
scalability across space and time, hence imposing barriers to the full
use of primary data and EBVs calculation (Henry et al. 2008). Thus,
common protocols and methods should be adopted as the most
straightforward approach to promote integration of collected data and to
allow calculation of EBVs (Jürgens et al. 2011). Recently a Workshop was
held by GLOBIS B\*4 (GLOBal Infrastructures for Supporting Biodiversity
research) to discuss Species Interactions EBVs (February, 26-28, Bari,
Italy). Plant-pollinator interactions deserved a lot of attention and
REBIPP\'s work was presented there. As an outcome we expect to define
specific EBVs for interactions, and use plant-pollinators as an example,
considering pairwise interactions as well as interaction network related
variables.
The terms in the plant-pollinator data standard under discussion at
REBIPP will provide information not only on EBV related with
interactions, but also on other four EBV classes: species populations,
species traits, community composition, ecosystem function and ecosystem
structure. As we said, some EBVs for specific ecosystem functions (e.g.
pollination) lay beyond interactions network structures. The EBV
\'Species interactions\' (EBV class \'Community composition\') should
incorporate other aspects such as frequency (Vázquez et al. 2005),
duration and empirical estimates of interaction strengths (Berlow et al.
2004).
Overall, we think the proposed plant-pollinator interaction data
standard which is currently being developed by REBIPP will contribute to
data aggregation, filling many data gaps and can also provide indicators
for long-term monitoring, being an essential source of data for EBVs.
Abstract
The cTAKES package (using the ClearTK Natural Language Processing
toolkit Bethard et al. 2014, http://cleartk.github.io/cleartk/) has been
successfully used to automatically read clinical notes in the medical
field (Albright et al. 2013, Styler et al. 2014). It is used on a daily
basis to automatically process clinical notes and extract relevant
information by dozens of medical institutions. ClearEarth is a
collaborative project that brings together computational linguistics and
domain scientists to port Natural Language Processing (NLP) modules
trained on the same types of linguistic annotation to the fields of
geology, cryology, and ecology. The goal for ClearEarth in the ecology
domain is the extraction of ecologically-relevant terms, including
eco-phenotypic traits from text and the assignment of those traits to
taxa. Four annotators used Anafora (an annotation software;
https://github.com/weitechen/anafora) to mark seven entity types
(biotic, aggregate, abiotic, locality, quality, unit, value) and six
reciprocal property types (synonym of/has synonym, part of/has part,
subtype/supertype) in 133 documents from primarily Encyclopedia of Life
(EOL) and Wikipedia according to project guidelines
(https://github.com/ClearEarthProject/AnnotationGuidelines).
Inter-annotator agreement ranged from 43% to 90%. Performance of
ClearEarth on identifying named entities in biology text overall was
good (precision: 85.56%; recall: 71.57%). The named entities with the
best performance were organisms and their parts/products (biotic
entities - precision: 72.09%; recall: 54.17%) and systems and
environments (aggregate entities - precision: 79.23%; recall: 75.34%).
Terms and their relationships extracted by ClearEarth can be embedded in
the new ecocore ontology after vetting
(http://www.obofoundry.org/ontology/ecocore.html). This project enables
use of advanced industry and research software within natural sciences
for downstream operations such as data discovery, assessment, and
analysis. In addition, ClearEarth uses the NLP results to generate
domain-specific ontologies and other semantic resources.
Abstract
There are many ways to capture data from herbarium specimen labels. Here
we compare the results of in-house verses out-sourced data transcription
with the aim of evaluating the pros and cons of each approach and
guiding future projects that want to do the same.
In 2014 Meise Botanic Garden (BR) embarked on a mass digitization
project. We digitally imaged of some 1.2 million herbarium specimens
from our African and Belgian Herbaria. The minimal data for a third of
these images was transcribed in-house, while the remainder was
out-sourced to a commercial company. The minimal data comprised the
fields: specimen's herbarium location, barcode, filing name, family,
collector, collector number, country code and phytoregion (for the
Democratic Republic of Congo, Rwanda & Burundi). The out-sourced data
capture consisted of three types:
additional label information for central African specimens having
minimal data;
complete data for the remaining African specimens; and,
species filing name information for African and Belgian specimens
without minimal data. As part of the preparation for out-sourcing, a
strict protocol had to be established as to the criteria for acceptable
data quality levels.
Also, the creation of several lookup tables for data entry was necessary
to improve data quality. During the start-up phase all the data were
checked, feedback given, compromises made and the protocol amended.
After this phase, an agreed upon subsample was quality controlled. If
the error score exceeded the agreed level, the batch was returned for
retyping. The data had three quality control checks during the process,
by the data capturers, the contractor's project managers and ourselves.
Data quality was analysed and compared in-house versus out-sourced modes
of data capture. The error rate by our staff versus the external company
was comparable. The types of error that occurred were often linked to
the specific field in question. These errors include problems of
interpretation, legibility, foreign languages, typographic errors, etc.
A significant amount of data cleaning and post-capture processing was
required prior to import into our database, despite the data being of
good quality according to protocol (error \< 1%). By improving the
workflow and field definitions a notable improvement could be made in
the "data cleaning" phase.
The initial motivation for capturing some data in-house was financial.
However, after analysis, this may not have been the most cost effective
approach. Many lessons have been learned from this first mass
digitisation project that will implemented in similar projects in the
future.
Abstract
Recent developments in digitisation technologies and equipment have
enabled advances in the rate of natural history specimen digitisation.
However Europe's Natural History Collection Institutions are home to
over one billion specimens and currently only a small fraction of these
have been digitally catalogued with fewer imaged. It is clear that
institutions still face huge challenges when digitising the vast number
of specimens in their collections.
I will present the results of two surveys that aimed to discover the
main successes and challenges facing institutions in their digitisation
programmes. The first survey was undertaken in 2014 within the SYNTHESYS
3 project and gathered information from project partners on their
current digitisation facilities, equipment and workflows providing some
key recommendations based on these findings. The second survey was
completed more recently in 2017, through the Consortium of European
Taxonomic Facilities (CETAF) Digitisation Working Group. This survey
aimed to discover the successful protocols and implementation of
digitisation, and to identify the shortfalls in resources and protocols.
Results from both surveys will be fed into the future programme of the
CETAF Digitisation Working Group as well as forthcoming and proposed EU
projects, including Innovation and Consolidation for large-scale
Digitisation of natural heritage (ICEDIG).
Abstract
On herbarium sheets, data elements such as plant name, collection site,
collector, barcode and accession number are found mostly on labels glued
to the sheet. The data are thus visible on specimen images. With
continuously improving technologies for collection mass-digitisation it
has become easier and easier to produce high quality images of herbarium
sheets and in the last few years herbarium collections worldwide have
started to digitize specimens on an industrial scale (Tegelberg et al.
2014). To use the label data contained in these massive numbers of
images, they have to be captured and databased. Currently, manual data
entry prevails and forms the principal cost and time limitation in the
digitization process. The StanDAP-Herb Project has developed a standard
process for (semi-) automatic detection of data on herbarium sheets.
This is a formal extensible workflow integrating a wide range of
automated specimen image analysis services, used to replace
time-consuming manual data input as far as possible. We have created
web-services for OCR (Optical Character Recognition); for identifying
regions of interest in specimen images and for the context-sensitive
extraction of information from text recognized by OCR. We implemented
the workflow as an extension of the OpenRefine platform (Verborgh and De
Wilde 2013).
Abstract
Globally there are a number of citizen science portals to support
digitisation of biodiversity collections. Digitisation not only involves
imaging of the specimen itself, but also includes the digital
transcription of label and ledger data, georeferencing and linking to
other digital resources. Making use of the skills and enthusiasm of
volunteers is potentially a good way to reduce the great backlog of
specimens to be digitised.
These citizen science portals engage the public and are liberating data
that would otherwise remain on paper. There is also considerable scope
for expansion into other countries and languages. Therefore, should we
continue to expand? Volunteers give their time for free, but the
creation and maintenance of the platform is not without costs. Given a
finite budget, what can you get for your money? How does the quality
compare with other methods? Is crowdsourcing of label transcription
faster, better and cheaper than other forms of transcription system?
We will summarize the use of volunteer transcription from our own
experience and the reports of other projects. We will make our
evaluation based on the costs, speed and quality of the systems and
reach conclusions on why you should or should not use this method.
Abstract
The Atlas of Living Costa Rica (http://www.crbio.cr/) is a biodiversity
data portal, based on the Atlas of Living Australia (ALA), which
provides integrated, free, and open access to data and information about
Costa Rican biodiversity in order to support science, education, and
conservation. It is managed by the Biodiversity Informatics Research
Center (CRBio) and the National Biodiversity Institute (INBio).
Currently, the Atlas of Living Costa Rica includes nearly 8 million
georeferenced species occurrence records, mediated by the Global
Biodiversity Information Facility (GBIF), which come from more than 900
databases and have been published by research centers in 36 countries.
Half of those records are published by Costa Rican institutions. In
addition, CRBio is making a special effort to enrich and share more than
5000 species pages, developed by INBio, about Costa Rican vertebrates,
arthropods, molluscs, nematodes, plants and fungi. These pages contain
information elements pertaining to, for instance, morphological
descriptions, distribution, habitat, conservation status, management,
nomenclature and multimedia. This effort is aligned with collaboration
established by Costa Rica with other countries such as Spain, Mexico,
Colombia and Brazil to standarize this type of information through
Plinian Core (https://github.com/PlinianCore), a set of vocabulary terms
that can be used to describe different aspects of biological species.
The Biodiversity Information Explorer (BIE) is one of the modules made
available by ALA which indexes taxonomic and species content and
provides a search interface for it. We will present how CRBio is
implementing BIE as part of the Atlas of Living Costa Rica in order to
share all the information elements contained in the Costa Rican species
pages.
Abstract
Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global
Biodiversity Information Facility (GBIF) node of Australia. They
developed an open and free platform for sharing and exploring
biodiversity data. All the modules are publicly available for reuse and
customization on their GitHub account
(https://github.com/AtlasOfLivingAustralia).
GBIF Benin, hosted at the University of Abomey-Calavi, has published
more than 338 000 occurrence records from 87 datasets and 2 checklists.
Through the GBIF Capacity Enhancement Support Programme
(https://www.gbif.org/programme/82219/capacity-enhancement-support-programme),
GBIF Benin, with the help of GBIF France, is in the process of deploying
the Beninese data portal using the GBIF France back-end architecture.
GBIF Benin is the first African country to implement this module of the
ALA infrastructure.
In this presentation, we will show you an overview of the registry and
the occurrence search engine using the Beninese data portal. We will
begin with the administration interface and how to manage metadata, then
we will continue with the user interface of the registry and how you can
find Beninese occurrences through the hub.
Abstract
Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global
Biodiversity Information Facility (GBIF) node of Australia. In 2010,
they launched an open and free platform for sharing and exploring
biodiversity data. Thanks to this new infrastructure, they have been
able to drastically increase the number of occurrences published through
the GBIF.org . In order to help other GBIF nodes or institutions, they
made all of their modules publicly available for reuse and customization
through GitHub (https://github.com/AtlasOfLivingAustralia).
Since 2013, the community created by developers interested by ALA tools,
organized, with the help of GBIF, 8 technical workshops around the
world. These workshops helped the launch of at least 13 data portals.
The last training session, funded through the GBIF Capacity Enhancement
Support Programme
(https://www.gbif.org/programme/82219/capacity-enhancement-support-programme),
was been attended by 23 participants from 19 countries on 6 continents.
Moreover, on the new GBIF website, a section has been dedicated to this
programme (https://www.gbif.org/programme/82953/living-atlases), the
Living Atlases community official website has been launched in 2017
(https://living-atlases.gbif.org) and the technical documentation has
been improved and translated in several languages. All of these
achievements would not have been possible without a huge effort from the
ALA developer community.
After a brief introduction of the Living Atlases community, we will
present you the work done by ALA to simplify the process of getting a
living atlas up and running. We will also show you how ALA developers
managed to help the community members to create their own version by
performing simple HTML/CSS customizations.
Abstract
Atlas of Living Australia (ALA) (https://www.ala.org.au/) is the Global
Biodiversity Information Facility (GBIF) node of Australia. Since 2010,
they have developed and improved a platform for sharing and exploring
biodiversity information. All the modules are publicly available for
reuse and customization on their GitHub account
(https://github.com/AtlasOfLivingAustralia).
The National Biodiversity Network, a registered charity, is the UK GBIF
node and has been sharing biodiversity data since 2000. They published
more than 79 million occurrences from 818 datasets. In 2016, they
launched the NBN Atlas Scotland (https://scotland.nbnatlas.org/) based
on the Atlas of Living Australia infrastructure. Since then, they
released the NBN Atlas (https://nbnatlas.org/), the NBN Atlas Wales
(https://wales.nbnatlas.org/) and soon the NBN Atlas Isle of Man. In
addition to the occurrence/species search engine and the metadata
registry, they put in place several tools that help users to work with
data published in the network: the spatial portal and \"explore your
region\" module. Both elements are based on Atlas of Living Australia
developments.
Because the Atlas of Living Australia platform is really powerful an
reusable, we want to show you these two applications used to make
geographical analyses. In order to perform this, we will present you the
specificities of each component by giving examples of some
functionalities.
Abstract
During the last few years, a large number of countries have deployed
national customized versions of The Atlas of Living Australia (ALA)
(https://www.ala.org.au/), which is a collaboratively developed, open
infrastructure for collecting and presenting biodiversity data
nationally and for sharing it globally through GBIF (https://gbif.org).
The increasing number of national nodes deploying this free and open
source software platform has built a worldwide community involving more
than 17 countries, that collaborate openly in a decentralized way
(https://living-atlases.gbif.org/), helping each other out by organizing
technical workshops and by developing and sharing new software modules
using GitHub.
One of these modules in the Living Atlases infrastructure is an R
package called ALA4R originally created by Ben Raymond
(https://github.com/AtlasOfLivingAustralia/ALA4R). It provides the
research community with programmatic data access to many of the Living
Atlases data services using R.
This presentation will show how ALA4R can be used to access data from
different national Living Atlases nodes and how this R package can
enable research studies that utilize methods and practices for
reproducible workflows that are being increasingly established within
the research community
(https://www.britishecologicalsociety.org/wp-content/uploads/2017/12/guide-to-reproducible-code.pdf).
Abstract
Many, if not most, countries have several official or widely used
languages. And most, if not all, of these countries have herbaria.
Furthermore, specimens have been exchanged between herbaria from many
countries, so herbaria are often polylingual collections. It is
therefore useful to have label transcription systems that can attract
users proficient in a wide variety of languages. Belgium is a typical
polylingual country at the boundary between the Romance and Franconian
languages (French, Dutch & German). Yet, currently there are few
non-English transcription platforms for citizen science. This is why in
Belgium we built DoeDat, from the Digivol system of the Atlas of Living
Australia.
We will be demonstrating DoeDat and its multilingual features. We will
explain how we enter translations, both for the user interface and for
the dynamic parts of the website. We will share our experiences of
running a multilingual site and the challenges it brings. Translating
and running such a website requires skilled personnel and patience.
However, our experience has been positive and the number and quality of
our volunteer transcriptions has been rewarding. We look forward to the
further use of DoeDat to transcribe data in many other languages. There
are no reasons anymore to exclude willing volunteers in any language.
Abstract
MapBio is a project initiated by the Chinese Academy of Sciences, which
aims at integrating species distribution data from different sources and
mapping the biodiversity of China to support biodiversity research and
biodiversity conservation decisions. Species distribution data may be
found in journal articles, books and different databases in various
formats, and most species distributions are described in free text.
MapBio is trying to build up a workflow for collecting this free text,
parsing it into standardized data and projecting distributions onto a
map for each species in China. A map module of MapBio is designed and
implemented based on Web GIS to visualize species distributions on a map
at different levels, e.g., occurrence points, county, province,
distribution range, protected area, waterbody, biogeographic realm.
Since the completeness of distribution data is very important for
assessing biodiversity, we developed a tool in MapBio for analysis of
the gaps in distribution data. Based on the species distribution data,
especially the occurrence data, MapBio provides an integrated modeling
tool for helping users to build species niche models. MapBio is an open
access project. Users can get data and services from it easily for
biodiversity research and conservation, and also can contribute their
own biodiversity data to MapBio.
Abstract
For more than a decade, the biodiversity informatics community has
recognised the importance of stable resolvable identifiers to enable
unambiguous references to data objects and the associated concepts and
entities, including museum/herbarium specimens and, more broadly, all
records serving as evidence of species occurrence in time and space.
Early efforts built on the Darwin Core institutionCode, collectionCode
and catalogueNumber terms, treated as a triple and expected to uniquely
to identify a specimen. Following review of current technologies for
globally unique identifiers, TDWG adopted Life Science Identifiers
(LSIDs) (Pereira et al. 2009). Unfortunately, the key stakeholders in
the LSID consortium soon withdrew support for the technology, leaving
TDWG committed to a moribund technology. Subsequently, publishers of
biodiversity data have adopted a range of technologies to provide unique
identifiers, including (among others) HTTP Universal Resource
Identifiers (URIs), Universal Unique Identifiers (UUIDs), Archival
Resource Keys (ARKs), and Handles. Each of these technologies has merit
but they do not provide consistent guarantees of persistence or
resolvability. More importantly, the heterogeneity of these solutions
hampers delivery of services that can treat all of these data objects as
part of a consistent linked-open-data domain.
The geoscience community has established the System for Earth Sample
Registration (SESAR) that enables collections to publish standard
metadata records for their samples and for each of these to be
associated with an International Geo Sample Number (IGSN
http://www.geosamples.org/igsnabout). IGSNs follow a standard format,
distribute responsibility for uniqueness between SESAR and the
publishing collections, and support resolution via HTTP URI or Handles.
Each IGSN resolves to a standard metadata page, roughly equivalent in
detail to a Darwin Core specimen record. The standardisation of
identifiers has allowed the community to secure support from some
journal publishers for promotion and use of IGSNs within articles.
The biodiversity informatics community encompasses a much larger number
of publishers and greater pre-existing variation in identifier formats.
Nevertheless, it would be possible to deliver a shared global identifier
scheme with the same features as IGSNs by building off the aggregation
services offered by the Global Biodiversity Information Facility (GBIF).
The GBIF data index includes normalised Darwin Core metadata for all
data records from registered data sources and could serve as a platform
for resolution of HTTP URIs and/or Handles for all specimens and for all
occurrence records. The most significant trade-off requiring
consideration would be between autonomy for collections and other
publishers in how they format identifiers within their own data and the
benefits that may arise from greater consistency and predictability in
the form of resolvable identifiers.
Abstract
A simple, permanent and reliable specimen identifier system is needed to
take the informatics of collections into a new era of interoperability.
A system of identifiers based on HTTP URI (Uniform Resource
Identifiers), endorsed by the Consortium of European Taxonomic
Facilities (CETAF), has now been rolled out to 14 member organisations
(Güntsch et al. 2017).
CETAF-Identifiers have a Linked Open Data redirection mechanism for both
human- and machine-readable access and, if fully implemented, provide
Resource Description Framework (RDF) -encoded specimen data following
best practices continuously improved by members of the initiative. To
date, more than 20 million physical collection objects have been
equipped with CETAF Identifiers (Groom et al. 2017).
To facilitate the implementation of stable identifiers, simple
redirection scripts and guidelines for deciding on the local identifier
syntax have been compiled
(http://cetafidentifiers.biowikifarm.net/wiki/Main\_Page). Furthermore,
a capable \"CETAF Specimen URI Tester\" (http://herbal.rbge.info/)
provides an easy-to-use service for testing whether the existing
identifiers are operational.
For the usability and potential of any identifier system associated with
evolving data objects, active links to the source information are
critically important. This is particularly true for natural history
collections facing the next wave of industrialised mass digitisation,
where specimens come online with only basic, but rapidly evolving label
data. Specimen identifier systems must therefore have components for
monitoring the availability and correct implementation of individual
data objects. Our next implementation steps will involve the development
of a \"Semantic Specimen Catalogue\", which has a list of all existing
specimen identifiers together with the latest RDF metadata snapshot. The
catalogue will be used for semantic inference across collections as well
as the basis for periodic testing of identifiers.
Abstract
Life sciences research, and even more specifically biodiversity sciences
research, has yet to coalesece on a single system of identifiers for
specimens (physical samples collected for research) or even a single set
of standards for identifiers. Diverse identifier systems lead to
duplication and ambiguity, which in turn lead to challenges in finding
specimens, tracking and citing their usage, and linking them to data.
Other research disciplines provide experience that biodiversity sciences
could use to overcome these challenges. Earth sciences/geology may be
the most advanced discipline in this regard, thanks to the use of the
International GeoSample Number (IGSN) system, which was established to
provide globally unique identifiers for geological samples. The original
motivation of IGSN was to overcome duplication of sample numbers
reported in the scientific literature and to support the correlation of
observations on the same samples carried out by different laboratories
and reported in different publications. The IGSN system is managed
through a small set of \'allocating agents\' who act on behalf of a
national agency or community, under the overall coordination of the IGSN
Organization - a volunteer group representing a mixture of research
institutions and agencies. Similar to widely-recognized Digital Object
Identifiers (DOIs), the primary requirement of an allocating agent is to
maintain the mapping from an IGSN to a web \'landing page\'
corresponding to each sample. A standard (minimal) schema for describing
samples registered with IGSN has been developed, but individual IGSN
allocating agents will often supplement the base metadata with
additional information. Other efforts are working on cross-disciplinary
sample metadata schemas, but no single core standard has been agreed
upon yet. An important part of the development of the IGSN system has
been an engagement with scholarly publishers, with a goal of making each
mention of an IGSN within a report or paper be a hyperlink, and also for
links to other observations relating to the same sample to be
automatically highlighted by the publisher.
Abstract
Zooarchaeological specimens are the remains of animals, including
vertebrate and invertebrate taxa, recovered from, or in association
with, archaeological contexts of deposition or surrounding landscapes.
The physical scope of zooarchaeological specimens is diverse and
includes macro- and micro-zooarchaeological specimens composed of
archaeologically preserved bone, shell, exoskeletons, teeth, hair or
fur, scales, horns or antlers, as well as geochemical (e.g., isotopes)
and biochemical (e.g., ancient DNA) signatures derived from faunal
remains. Artifacts and objects created from animal remains, such as bone
pins, shell beads, preserved animal hides, are also zooarchaeological
specimens. Here we present recent work to utilize identifiers for
archaeological samples in new data publishing routines, focusing on key
challenges. One critical challenge is that archaeological samples are
often composited into different units depending on managers of
collections and analysts. Thus, in some cases, when migrating datasets
for publication, identifiers can refer to different sets of units, even
within the same dataset. Another key challenge is assuring that
different repositories can share sample identifiers. We show how Open
Context, a site-based archaeology-focused repository that also manages
objects such as zooarchaeological material, and VertNet, a
specimen-oriented biodiversity repository, have collaborated to share
sample identifiers.
While this illustrates a success story of linking data across
repositories, we discuss the complexity of how "occurrence identifiers,"
but not true sample identifiers, in VertNet are propagated to another
system where the identifiers point to a similar record called "Animal
Bone" in Open Context.
Abstract
The Ocean Biogeographic Information System (OBIS) began in 2000 as the
repository for data from the Census of Marine Life. Since that time,
OBIS has expanded its goals beyond simply hosting data to supporting
more aspects of marine conservation (Pooter et al. 2017). In order to
accomplish those goals, the OBIS secretariat in partnership with its
European node (EurOBIS) hosted at the Flanders Marine Institute (VLIZ,
Belgium), and the Intergovernmental Oceanographic Commission (IOC)
Committee on International Oceanographic Data and Information Exchange
(IODE, 23rd session, March 2015, Brugge) established a 2-year pilot
project to address a particularly problematic issue that environmental
data collected as part of marine biological research were being
disassociated from the biological data. OBIS-Event-Data is the solution
that was developed from that pilot project, which devised a method for
keeping environmental data together with the biological data (Pooter et
al. 2017).
OBIS is seeking early adopters of the new data standard OBIS-Event-Data
from among the marine biodiversity monitoring communities, to further
validate the data standard, and develop data products and scientific
applications to support the enhancement of Biological and Ecosystem
Essential Ocean Variables (EOVs) in the framework of the Global Ocean
Observing System (GOOS) and the Marine Biodiversity Observation Network
of the Group on Earth Observations (GEO BON MBON).
After the successful 2-year IODE pilot project OBIS-ENV-DATA, the IOC
established a new 2-year IODE pilot project OBIS-Event-Data for
Scientific Applications (2017-2019). The OBIS-Event-Data data standard,
building on Darwin Core, provides a technical solution for combined
biological and environmental data, and incorporates details about
sampling methods and effort, including event hierarchy. It also
implements standardization of parameters involved in biological,
environmental, and sampling details using an international standard
controlled vocabulary (British Oceanographic Data Centre Natural
Environment Research Council).
A workshop organized by IODE/OBIS in April brought together major animal
tagging and tracking networks such as the Ocean Tracking Network (OTN),
the Animal Telemetry Network (ATN), the Integrated Marine Observing
System (IMOS), the European Tracking Network (ETN) and the Acoustic
Tracking Array Platform (ATAP) to test the OBIS-Event-Data standard
through the development of some data products and science applications.
Additionally, this workshop contributes to the further maturation of the
GOOS EOV on fish as well as the EOV on birds, mammals and turtles.
We will present the outcomes as well as any lessons learned from this
workshop on problems, solutions, and applications of using Darwin
Core/OBIS-Event-Data for bio-logging data.
Abstract
In recent years, bio-logging data, automatically gathered by sensors
deployed on animals, has become one of the fastest growing sources of
biodiversity data. This is largely due to the steadily declining mass,
size and costs of sensors, continuously opening new opportunities to
monitor new species. While previously 'tracking data'---data from
spatially enabled sensors such as GPS sensors---was most prominent,
currently almost 70% of all bio-logging data is comprised of non-spatial
data as e.g., physiological data. In contrast to the biodiversity data
community, where standards to mobilize and exchange data are relatively
well established, the bio-logging community is still lacking standards
to transport data from sensors into repositories, or to mobilize data in
a standardized format from different repositories to enable cooperation
between users, shared software tools, data aggregation for
meta-analysis, or a consistent format for long-term archiving.
To set the stage for a discussion about standards for bio-logging data
to be developed or adapted, we present a mind map describing the
different pathways of bio-logging data during its life cycle, and the
opportunities for standardization within this cycle. As an example we
present the use of the Open Geospatial Consortium (OGC) 'SensorML' and
'Observations & Measurements' standards to transfer bio-logging data
from a sensor to a repository and ultimately to a user for subsequent
analysis. These standards provide machine-readable methods for
describing bio-logging sensors and the measurements they collect,
offering a standardized structure that can be customized by the
bio-logging community (e.g. with standardized vocabularies) to achieve
interoperability.
Abstract
To usefully describe sensor deployments on animals is a major challenge
for advocates of data standards. Bio-logging studies also need to be
documented in a standard manner to facilitate discovery and determine
relevance? For systems aggregating biodiversity occurrence records, the
use of the Darwin Core standard (Wieczorek et al. 2012) to express
species occurrences is near ubiquitous. Bio-logging studies are
universally multiple instances of species occurrences that output high
quality spatial and temporal data recorded by specialists.
There are a lot of benefits to summarising these studies by means of a
single, flat file record. Simple Darwin Core offers the ability to do
this by representing the multiple occurrences as a date range in
dwc:eventDate and a footprint polygon using dwc:footprintWKT for the
area covered by the track. By also uniformly describing the species, the
dwc:basisOfRecord as Machine Observation, and a controlled vocabulary to
describe the type of bio-logging data, systems could offer an effective
means of querying tracking data. It's important to look to other data
standards initiatives relevant to bio-logging to ensure common usage of
Darwin Core terms.
The Atlas of Living Australia is using an implementation of Simple
Darwin Core to represent data from the bio-logging platform ZoaTrack as
occurrence data to make it discoverable via location or species-based
searches. Other initiatives, for example Swedish LifeWatch follow a
similar approach to represent data from the Wireless Remote Animal
Monitoring (WRAM) Scandinavian bio-logging infrastructure. With
endorsement from the community, the implementation could be useful as a
type of metadata catalogue record, opening it for usage in application
programmer interface (API) development and thus enabling machine
interoperability between systems and users. In short, bio-logging
systems and practitioners would be able to easily discover relevant
studies by searching by location and/or species.
Abstract
With the continuous development of imaging technology, the amount of
insect 3D data is increasing, but research on data management is still
virtually non-existent. This paper will discuss the specifications and
standards relevant to the process of insect 3D data acquisition,
processing and analysis.
The collection of 3D data of insects includes specimen collection,
sample preparation, image scanning specifications and 3D model
specification. The specimen collection information uses existing
biodiversity information standards such as Darwin Core. However, the 3D
scanning process contains unique specifications for specimen
preparation, depending on the scanning equipment, to achieve the best
imaging results.
Data processing of 3D images includes 3D reconstruction, tagging
morphological structures (such as muscle and skeleton), and 3D model
building. There are different algorithms in the 3D reconstruction
process, but the processing results generally follow DICOM (Digital
Imaging and Communications in Medicine) standards. There is no available
standard for marking morphological structures, because this process is
currently executed by individual researchers who create operational
specifications according to their own needs. 3D models have specific
file specifications, such as object files
(https://en.wikipedia.org/wiki/Wavefront\_.obj\_file) and 3D max format
(https://en.wikipedia.org/wiki/.3ds), which are widely used at present.
There are only some simple tools for analysis of three-dimensional data
and there are no specific standards or specifications in Audubon Core
(https://terms.tdwg.org/wiki/Audubon\_Core), the TDWG standard for
biodiversity-related multi-media.
There are very few 3D databases of animals at this time. Most of insect
3D data are created by individual entomologists and are not even stored
in databases. Specifications for the management of insect 3D data need
to be established step-by-step. Based on our attempt to construct a
database of 3D insect data, we preliminarily discuss the necessary
specifications.
Abstract
iDigBio Matsunaga et al. 2013 currently references over 22 million media
files, and stores approximately 120 terabytes worth of those media files
co-located with our compute infrastructure. Using these images for
scientific research is a logistical and technical challenge.
Transferring large numbers of images requires programming skill,
bandwidth, and storage space. While simple image transformations such as
resizing and generating histograms are approachable on desktops and
laptops, the neural networks commonly used for learning from images
require server-based graphical processing units (GPUs) to run
effectively.
Using the GUODA (Global Unified Open Data Access) infrastructure, we
have built a model pipeline for applying user-defined processing to any
subset of the images stored in iDigBio. This pipeline is run on servers
located in the Advanced Computing and Information Systems lab (ACIS)
alongside the iDigBio storage system. We use Apache Spark, the Hadoop
File System (HDFS), and Mesos to perform the processing. We have placed
a Jupyter notebook server in front of this architecture which provides
an easy environment with deep learning libraries for Python already
loaded for end users to write their own models. Users can access the
stored data and images and manipulate them according to their
requirements and make their work publicly available on GitHub.
As an example of how this pipeline can be used in research, we applied a
neural network developed at the Smithsonian Institution to identify
herbarium sheets that were prepared with hazardous mercury containing
solutions Schuettpelz et al. 2017. The model was trained with
Smithsonian resources on their images and transferred to the GUODA
infrastructure hosted at ACIS which also houses iDigBio. We then applied
this model to additional images in iDigBio to classify them to
illustrate the application of these techniques to broad image corpora
potentially to notify other data publishers of contamination. We present
the results of this classification not as a verified research result,
but as an example of the collaborative and scalable workflows this
pipeline and infrastructure enable.
Abstract
Earth's ecosystems are threatened by anthropogenic change, yet
relatively little is known about biodiversity across broad spatial (i.e.
continent) and temporal (i.e. year-round) scales. There is a significant
gap at these scales in our understanding of species distribution and
abundance, which is the precursor to conservation (Hochachka et al.
2012). The cost and availability of experts to collect data does not
scale to broad spatial or temporal surveys. With recent advances in
artificial intelligence (AI) it is becoming possible to automate some of
this data collection and analysis (Joppa 2017). The Cornell Lab of
Ornithology is working to apply AI in three ways:
incorporating AI into the analysis of radar data to assess densities of
migratory birds at a continent-wide scale and across years;
utilizing new techniques in convolution neural networks (CNNs) to
improving our ability to classify natural sounds by limiting background
noise;
applying our ability to train models to classify birds in images to
build a system that can analyze video streams.
Our approach to accomplishing this is through partnerships between our
non-profit organization, computer science faculty, and industry leaders.
By leveraging deep learning technologies and including an array of
stakeholders, we are able to process data that would take years to
analyze using traditional methods.
Methods.
We use 28 years of Next-Generation Radar (NEXRAD) imagery, which
contains birds aloft during nocturnal migration. Using CNNs we can
assess the density of birds captured on radar images to count the number
of individuals crossing the continental U.S. each spring and fall. For
acoustical analysis of birds vocalizing during nocturnal migration, we
are using recorders to monitor the calling activity of birds aloft and
CNN's to detect and classify bird vocalizations in noisy landscapes. We
gathered more than 6 million images from the eBird community, archived
them in the Macaulay Library at the Cornell Lab of Ornithology, and
crowdsourced millions of annotations to train models to classify more
than 5,000 species of birds in images. Now we are applying this approach
to video. These projects have used both supervised and unsupervised
learning techniques. With supervised learning and the use of elaborate
training datasets, we made tremendous headway in bird photo
identification. Unsupervised learning was used to eliminate rain in
NEXRAD images successfully, with little training data incorporated. We
expect advances in unsupervised learning will open new possibilities in
the future.
Conclusions.
The Cornell Lab pioneered the concept of autonomous recording units for
monitoring biodiversity two decades ago, but without AI to process the
data, discoveries were limited by human processing time. Today, we can
combine our findings using radar with acoustic monitoring and sightings
from citizen scientists for a more complete understanding of bird
populations. We now expect AI processes to be able to identify birds
with high confidence in the near future for images, audio recordings and
videos. Furthermore, while conventional approaches require using
separate neural nets that are combined in a separate process, we now
combine multi-model sensor integration into a single CNN. There is no
longer a need for pre-processing of data for AI pattern recognition. Our
vision is to continue to apply these techniques to create a 'real-time
global bird monitoring network', with a combination of humans and
automated sensors. This network of sensors (or robots) will have
comparable ability as a human to detect, identify, and count birds,
gathering information systematically and in places where humans cannot
reach.
Abstract
Widespread technology usage has resulted in a deluge of data that is not
limited to scientific domains. For example, technology companies
accumulate vast amounts of data on their users to support their
applications and platforms. The participation of many domains in big
data collection, data analysis and visualization, and the need for fast
data exploration has provided a stellar market opportunity for high
quality data visualization software to emerge. In this talk, leading
industry visualization software (Tableau) will be used to explore a
biodiversity dataset (Carex spp. distribution and morphology). The
advantages and disadvantages of using Tableau for scientific exploration
will be discussed, as well as how to integrate data visualization tools
early into the data pipeline. Lastly, the potential for developing a
data visualization \"stack\" (i.e., a combination of software products
and programming languages) using available tools will be discussed, as
well as what the future might look like for scientists looking to
capitalize on the growth of industry tools.
Abstract
Phytoplankton form the basis of the marine food web and are an indicator
for the overall status of the marine ecosystem. Changes in this
community may impact a wide range of species (Capuzzo et al. 2018)
ranging from zooplankton and fish to seabirds and marine mammals.
Efficient monitoring of the phytoplankton community is therefore
essential (Edwards et al. 2002). Traditional monitoring techniques are
highly time intensive and involve taxonomists identifying and counting
numerous specimens under the light microscope. With the recent
development of automated sampling devices, image analysis technologies
and learning algorithms, the rate of counting and identification of
phytoplankton can be increased significantly (Thyssen et al. 2015). The
FlowCAM (Álvarez et al. 2013) is an imaging particle analysis system for
the identification and classification of phytoplankton. Within the
Belgian Lifewatch observatory, monthly phytoplankton samples are taken
at nine stations in the Belgian part of the North Sea. These samples are
run through the FlowCAM and each particle is photographed. Next, the
particles are identified based on their morphology (and fluorescence)
using state-of-the-art Convolutional Neural Networks (CNNs) for computer
vision. This procedure requires learning sets of expert validated
images. The CNNs are specifically designed to take advantage of the two
dimensional structure of these images by finding local patterns, being
easier to train and having many fewer parameters than a fully connected
network with the same number of hidden units.
In this work we present our approach to the use of CNNs for the
identification and classification of phytoplankton, testing it on
several benchmarks and comparing with previous classification
techniques. The network architecture used is ResNet50 (He et al. 2016).
The framework is fully written in Python using the TensorFlow (Abadi, M.
et al. 2016) module for Deep Learning.
Deployment and exploitation of the current framework is supported by the
recently started European Union Horizon 2020 programme funded project
DEEP-Hybrid-Datacloud (Grant Agreement number 777435), which supports
the expensive training of the system needed to develop the application
and provides the necessary computational resources to the users.
Abstract
Over the next 5 years major advances in the development and application
of numerous technologies related to computing, mobile phones, artificial
intelligence (AI), and augmented reality (AR) will have a dramatic
impact in biodiversity monitoring and conservation. Over a 2-week period
several of us had the opportunity to meet with multiple technology
experts in the Silicon Valley, California, USA to discuss trends in
technology innovation, and how they could be applied to conservation
science and ecology research. Here we briefly highlight some of the key
points of these meetings with respect to AI and Deep Learning.
Computing: Investment and rapid growth in AI and Deep Learning
technologies are transforming how machines can perceive the environment.
Much of this change is due to increased processing speeds of Graphics
Processing Units (GPUs), which is now a billion-dollar industry. Machine
learning applications, such as convolutional neural networks (CNNs) run
more efficiently on GPUs and are being applied to analyze visual imagery
and sounds in real time. Rapid advances in CNNs that use both supervised
and unsupervised learning to train the models is improving accuracy. By
taking a Deep Learning approach where the base layers of the model are
built upon datasets of known images and sounds (supervised learning) and
later layers relying on unclassified images or sounds (unsupervised
learning), dramatically improve the flexibility of CNNs in perceiving
novel stimuli. The potential to have autonomous sensors gathering
biodiversity data in the same way personal weather stations gather
atmospheric information is close at hand.
Mobile Phones: The phone is the most widely used information appliance
in the world. No device is on the near horizon to challenge this
platform, for several key reasons. First, network access is ubiquitous
in many parts of the world. Second, batteries are improving by about 20%
annually, allowing for more functionality. Third, app development is a
growing industry with significant investment in specializing apps for
machine-learning. While GPUs are already running on phones for video
streaming, there is much optimism that reduced or approximate Deep
Learning models will operate on phones. These models are already working
in the lab, with the biggest hurdle being power consumption and
developing energy efficient applications and algorithms to run
complicated AI processes will be important. It is just a matter of time
before industry will have AI functionality on phones.
These rapid improvements in computing and mobile phone technologies have
huge implications for biodiversity monitoring, conservation science, and
understanding ecological systems. Computing: AI processing of video
imagery or acoustic streams create the potential to deploy autonomous
sensors in the environment that will be able to detect and classify
organisms to species. Further, AI processing of Earth spectral imagery
has the potential to provide finer grade classification of habitats,
which is essential in developing fine scale models of species
distributions over broad spatial and temporal extents. Mobile Phones:
increased computing functionality and more efficient batteries will
allow applications to be developed that will improve an individual's
perception of the world. Already AI functionality of Merlin improves a
birder's ability to accurately identify a bird. Linking this
functionality to sensor devices like specialized glasses, binoculars, or
listening devises will help an individual detect and classify objects in
the environment.
In conclusion, computing technology is advancing at a rapid rate and
soon autonomous sensors placed strategically in the environment will
augment the species occurrence data gathered by humans. The mobile phone
in everyone's pocket should be thought of strategically, in how to
connect people to the environment and improve their ability to gather
meaningful biodiversity information.
Abstract
Reliable plant species identification from seeds is intrinsically
difficult due to the scarcity of features and because it requires
specialized expertise that is becoming increasingly rarer, as the number
of field plant taxonomists is diminishing (Bacher 2012, Haas and Häuser
2005). On the other hand, seed identification is relevant in some
science domains such as plant community ecology, archaeology,
paleoclimatology. Besides, economic activities such as agriculture,
require seed identification to assess weed species contained in the
\"soil seed banks\" (Colbach 2014) to enable targeted treatments before
they become a problem.
In this work, we explore and evaluate several approaches by using
different training image sets with various requisites and assessing
their performance with test datasets of different sources.
The core training dataset is provided by the Anthos project (Castroviejo
et al. 2017) as a subset of its image collection. It consists of nearly
a 1000 images of seeds identified by experts.
As identification algorithm, we will use state-of-the-art convolutional
neural networks for image classification (He et al. 2016). The framework
is fully written in Python using the TensorFlow (Abadi et al. 2016)
module for deep learning.
Abstract
Automated identification of plants and animals has improved considerably
in the last few years, in particular thanks to the recent advances in
deep learning. In order to evaluate the performance of automated plant
identification technologies in a sustainable and repeatable way, a
dedicated system-oriented benchmark was setup in 2011 in the context of
ImageCLEF (Goëau et al. 2011). Each year, since that time, several
research groups participated in this large collaborative evaluation by
benchmarking their image-based plant identification systems. In 2014,
the LifeCLEF research platform (Joly et al. 2014) was created in the
continuity of this effort so as to enlarge the evaluated challenges by
considering birds and fishes in addition to plants, and audio and video
contents in addition to images.
The 2017-th edition of the LifeCLEF plant identification challenge (Joly
et al. 2017) is an important milestone towards automated plant
identification systems working at the scale of continental floras with
10.000 plant species living mainly in Europe and North America
illustrated by a total of 1.1M images. Nowadays, such ambitious systems
are enabled thanks to the conjunction of the dazzling recent progress in
image classification with deep learning and several outstanding
international initiatives, aggregating the visual knowledge on plant
species coming from the main national botanical institutes. The
PlantCLEF plant challenge that we propose to present at this workshop
aimed at evaluating to what extent a large noisy training dataset
collected through the web (then containing a lot of labelling errors)
can compete with a smaller but trusted training dataset checked by
experts. To fairly compare both training strategies, the test dataset
was created from a third data source, the Pl\@ntNet (Joly et al. 2015)
mobile application that collects millions of plant image queries all
over the world.
Due to the good results obtained at the 2017-th edition of the LifeCLEF
plant identification challenge, the next big question is how far such
automated systems are from the human expertise. Indeed, even the best
experts are sometimes confused and/or disagree with each other when
validating images of living organism. A multimedia data actually
contains only partial information that is usually not sufficient to
determine the right species with certainty. Quantifying this uncertainty
and comparing it to the performance of automated systems is of high
interest for both computer scientists and expert naturalists. This work
reports an experimental study following this idea in the plant domain.
In total, 9 deep-learning systems implemented by 3 different research
teams were evaluated with regard to 9 expert botanists of the French
flora. The main outcome of this work is that the performance of
state-of-the-art deep learning models is now close to the most advanced
human expertise. This shows that automated plant identification systems
are now mature enough for several routine tasks, and can offer very
promising tools for autonomous ecological surveillance systems.
Abstract
The fast and accurate identification of forest species is critical to
support their sustainable management, to combat illegal logging, and
ultimately to conserve them. Traditionally, the anatomical
identification of forest species is a manual process that requires a
human expert with a high level of knowledge to observe and differentiate
certain anatomical structures present in a wood sample (Wiedenhoeft
(2011)).
In recent years, deep learning techniques have drastically improved the
state of the art in many areas such as speech recognition, visual object
recognition, and image and music information retrieval, among others
(LeCun et al. (2015)). In the context of the automatic identification of
plants, these techniques have recently been applied with great success
(Carranza-Rojas et al. (2017)) and even mobile apps such as Pl\@ntNet
have been developed to identify a species from images captured
on-the-fly (Joly et al. (2014)). In contrast to conventional machine
learning techniques, deep learning techniques extract and learn by
themselves the relevant features from large datasets.
One of the main limitations for the application of deep learning
techniques to forest species identification is the lack of comprehensive
datasets for the training and testing of convolutional neural network
(CNN) models. For this work, we used a dataset developed at the Federal
University of Parana (UFPR) in Curitiba, Brazil, that comprises 2939
images in JPG format without compression and a resolution of 3.264 x
2.448 pixels. It includes 41 different forest species of the Brazilian
flora that were cataloged by the Laboratory of Wood Anatomy at UFPR
(Paula Filho et al. (2014)). Due to the lack of comprehensive datasets
world wide, this has become a benchmark dataset in previous research
(Paula Filho et al. (2014), Hafemann et al. (2014)).
In this work, we propose and demonstrate the power of deep CNNs to
identify forest species based on macroscopic images. We use a
pre-trained model which is built from the resnet50 model and uses
weights pre-trained on ImageNet. We apply fine-tuning by first
truncating the top layer (softmax layer) of the pre-trained network and
replacing it with a new softmax layer. Then we train again the model
with the dataset of macroscopic images of species of the Brazilian flora
used in (Hafemann et al. (2014), Paula Filho et al. (2014)).
Using the proposed model we achieve a top-1 98% accuracy which is better
than the 95.77% reported in (Hafemann et al. (2014)) using the same data
set. In addition, our result is slightly better than the reported in
(Paula Filho et al. (2014)) of 97.77% which was obtained by combining
several conventional techniques of computer vision.
Abstract
Costa Rica is one of the countries with highest species biodiversity
density in the world. More than 2,000 tree species have already been
identified, many of which are used in the building, furniture, and
packaging industries (Grayum et al. 2003). This rich diversity makes the
correct identification of tree species very difficult. As a result, it
is common to see in the national market that species are commercialized
with mistaken identifications, which makes quality control particularly
challenging. In addition, because 90 timber tree species have been
classified as "threatened" in Costa Rica, correct identifications are
indispensable for law-enforcement.
The traditional system for tree species identification is based on macro
and microscopic evaluations of the anatomy of the wood. It entails
assesing anatomical features such as patterns of vessels, parenchymas,
and fibers. Typically, 7.7 x 10 cm pieces of wood cuts are used to
identify the tree species (Pan and Kudo 2011, Yusof et al. 2013).
However, assessing these features is extremely difficult for taxonomists
because properties of the wood can vary considerably due to
environmental conditions and intra-specific genetic variability.
Deep learning techniques have recently been used to identify plant
species (Carranza-Rojas et al. 2017a, Carranza-Rojas et al. 2017b) and
are potentially useful to detect subtle differences in patterns of
vessels, parenchyma, and other anatomical features of wood. However, it
is necessary to have a large collection of macroscopic photographs of
individuals from various parts of the country (Pan and Kudo 2011). As a
first step in the application of deep learning techniques, we have
defined a formal, standard protocol for collecting wood samples,
physically processing them, taking pictures, performing data
augmentation, and using metadata to provide the primary data necessary
for deep learning applications. Unlike traditional xylotheque sampling
methods that destroy trees or use wood from fallen trees, we propose a
method that extracts small size samples with sufficient quality for
anatomical characterization but does not affect the growth and survival
of the individual.
This study has been developed in three forest permanent plots in Costa
Rica, all of which are sites with historical growth data over the last
20 years. We have so far evaluated 40 species (10 individuals per
species) with diameters greater than 20 cm. From each individual, a
cylindrical sample of 12 mm diameter and 7.5 cm in length was extracted
with a cordless drill. Each sample is then cut into five of 8 x 8 x 8 mm
cubes and further processed to result in curated xylotheque samples, a
dataset with all relevant metadata and original images, and a dataset
with images obtained by performing data augmentation on the original
images.
Abstract
As a child, I loved exhibits at the museum. As an adult conservation
biologist, entering the back rooms of the museum to view the collections
is even more remarkable. I have begun to realise the scope of what might
be held in museum collections, and to consider what these specimens,
artefacts, taonga (treasure) might tell us. Using examples from my work
on insects, birds and kahukurii (dogskin cloaks), and analyses from
morphometrics to isotopes, I will show how sampling from museum
collections can add layers of richness and complexity to research, with
the added dimensions of space, time, and connection to communities.
Finally, I'll discuss some of the ethics and understandings that guide
my work with museum collections, and what it means to be part of
collaborative partnerships of discovery with museum curators and
communities.
Abstract
Natural history collections are essential for understanding the world's
biodiversity and drive research in taxonomy, systematics, ecology and
biosecurity. One of the biggest challenges faced is the decline of new
taxonomists and public interest in collections-based research, which is
alarming considering that an estimated 70% of the world's species are
yet to be formally described.
Science communication combines public relations with the dissemination
of scientific knowledge and offers many benefits to promoting natural
history collections to a wide audience. For example, social media has
revolutionised the way collections and their staff communicate with the
public in real time, and can attract more visitors to collection
exhibits and new students interested in natural history. Although not
everyone is born a natural science communicator, institutions can
encourage and provide training for their staff to become engaging
spokespeople skilled in social media and public speaking, including
television, radio and/or print media. By embracing science
communication, natural history collections can influence their target
audiences in a positive and meaningful way, raise the profile of their
institution, encourage respect for biodiversity, promote their events
and research outputs, seek philanthropic donations, connect with other
researchers or industry leaders, and most importantly, inspire the next
generation of natural historians.
Abstract
Since 2010, the Canterbury region on the eastern coast of New Zealand's
South Island has experienced more than 14,000 earthquakes. This
presentation begins by considering the immediate impact of these seismic
events on Canterbury Museum; how were its buildings, its collections,
its team and its communities affected? Within the first weeks and
months, what processes were put in place to manage the collections and
to what extent was the Museum's team able to undertake work to ensure
the institution remained relevant during a national disaster? With a
distance of almost eight years since the first major earthquake, this
presentation reflects on some of the lessons learnt about the realities
of planning for, and responding to, disaster and the impact of a
continuing series of earthquakes on the concept of 'business as usual'.
Abstract
Taxonomic work is slow and time consuming. Alarm bells have rung for
years about the need to go faster, the need to attract and train new
taxonomic workers, and the need to convince other branches of science
that taxonomic work is vital. Morphological taxonomy is either being
overrun or augmented -- depending on your perspective -- by genomics,
artificial intelligence, new imaging methods and species-related data
from other branches of science.
Ecology is one such branch of science, where defining, documenting and
managing information about species traits has emerged as one of the most
significant problems in the discipline. Traits have been recorded for
aeons, but the resulting data has largely been insulated within cliques.
How do we integrate these data and make them available in a form that
will help to address significant issues about our environment? The
'speed bumps' on the route to a useful solution may be more social than
technical.
Cross-disciplinary collaboration is required to address the big
questions in biodiversity research today, and it will need to extend
beyond taxonomy and ecology to other disciplines, such as pharmacology
and material science. As Harry Truman said, and John LaSalle often
quoted, "It is amazing what you can accomplish if you do not care who
gets the credit".
We are challenged to understand and answer the key questions about the
world on which we all depend. What are the challenges and the
opportunities to accelerate biodiversity discovery and documentation?
Abstract
Standards set up by Biodiversity Information Standards-Taxonomic
Databases Working Group (TDWG), initially developed as a way to share
taxonomical data, greatly facilitated the establishment of the Global
Biodiversity Information Facility (GBIF) as the largest index to
digitally-accessible primary biodiversity information records (PBR) held
by many institutions around the world. The level of detail and coverage
of the body of standards that later became the Darwin Core terms enabled
increasingly precise retrieval of relevant records useful for increased
digitally-accessible knowledge (DAK) which, in turn, may have helped to
solve ecologically-relevant questions.
After more than a decade of data accrual and release, an increasing
number of papers and reports are citing GBIF either as a source of data
or as a pointer to the original datasets. GBIF has curated a list of
over 5,000 citations that were examined for contents, and to which tags
were applied describing such contents as additional keywords. The list
now provides a window on what users want to accomplish using such DAK.
We performed a preliminary word frequency analysis of this literature,
starting at titles, which refers to GBIF as a resource. Through a
standardization and mapping of terms, we examined how the
facility-enabled data seem to have been used by scientists and other
practitioners through time: what concepts/issues are pervasive, which
taxon groups are mostly addressed, and whether data concentrate around
specific geographical or biogeographical regions. We hoped to cast light
on which types of ecological problems the community believes are
amenable to study through the judicious use of this data commons and
found that, indeed, a few themes were distinctly more frequently
mentioned than others. Among those, generally-perceived issues such as
climate change and its effect on biodiversity at global and regional
scales seemed prevalent. The taxonomic groups were also unevenly
mentioned, with birds and plants being the most frequently named.
However, the entire list of potential subjects that might have used
GBIF-enabled data is now quite wide, showing that the availability of
well-structured data has spawned a widening spectrum of possible use
cases. Among them, some enjoy early and continuous presence (e.g.
species, biodiversity, climate) while others have started to show up
only later, once a critical mass of data seemed to have been attained
(e.g. ecosystems, suitability, endemism). Biodiversity information in
the form of standards-compliant DAK may thus already have become a
commodity enabling insight into an increasingly more complex and diverse
body of science. Paraphrasing Tennyson, more things were wrought by data
than TDWG dreamt of.
Abstract
Agile, interconnected and diverse communities of practice can serve as a
hedge on an uncertain world. We currently live in an era of populist
politics and diminishing government funding, challenging our collective
optimism for the future. However, the communities we build and
contribute to can be prepared and strengthened to address the challenges
ahead. How we choose to operate in this world of less funding is tied to
the collective impacts we all believe we can achieve by working
together. How we choose to work together and structure our communities
matters.
Abstract
Taxidermy made for display is often considered less significant in
museum research collections. This is because historical taxidermy
material often becomes disassociated with key data and through the
rigours of public display, end up in poor physical condition.
However by tracing a specimen\'s biography as a living animal and
following its transition into a museum afterlife, much can be revealed
about the development of natural history collections and changing
attitudes towards animals.
This presentation will investigate several pieces of taxidermy in the
zoology collection of the Tasmanian Museum and Art Gallery (TMAG)
(http://www.tmag.tas.gov.au/collections\_and\_research/zoology/collections),
where research has uncovered surprising stories and helped reassess the
significance and cultural value of this material.
An unregistered lion head, identified as animal celebrity John Burns,
tells the story of the golden age of Australian and New Zealand
circuses, changing attitudes around animal ethics in the circus and the
negotiations between scientific institutions in acquiring exotics
species in the late nineteenth century.
A collection of taxidermied domestic chickens from the 1940s is found to
mark the modernisation of the TMAG public displays in communicating
current research and the development of a dedicated museum education
unit.
The colourful afterlife of these specimens in the museum collection
highlights struggles with storage issues, changes in collecting
priorities and evolution of public display and education at TMAG.
Abstract
In France, a national information system on water withdrawals called
Banque Nationale des Prélèvements en Eau (BNPE) has been set up to
comply with the Water Framework Directive (WFD) and national Law on
Water and Aquatic Environments. The aims are to centralize information
on the volume of water withdrawals and to share it on the website
www.bnpe.eaufrance.fr, where data can both be viewed and exported
without restriction. BNPE shares data in a form that can be used for
water management studies, scientific research, or to assess impacts on
aquatic habitats.
THE BNPE PROJECT SCOPE
The BNPE is a part of the French Water Information System (SIE), set up
to share public data on water and aquatic environments\*1. The BNPE
project is managed by the French Biodiversity Agency (AFB) and the
Adour-Garonne Water Agency, and is supervised by the French Ministry in
charge of Environment. Database and related tools were developed with
the French Geological Survey (BRGM).
To achieve its goals, the project mainly reuses information from Water
Agencies, based on taxes collected using the \'taker-payer\' principle:
persons who take water from the natural environment have to pay. Data on
water withdrawals disseminated by BNPE can now be reused by land
managers, decision-makers and researchers due to the single access of
these data for all of France (metropolitan and overseas). These data
are:
Detailed data of water withdrawn: volume of water withdrawn (m^3^),
geographic coordinates of the water pump, water uses (e.g. energy,
irrigation, drinking water supply, industries), type of water
(groundwater, surface water: river, lake or estuary),
Aggregated data: synthesis is available by year, geography, use or type
of water.
In 2018, BNPE shared data from 2008 to 2016.
CHALLENGES OF CENTRALIZATION AND REUSE OF DATA : FEEDBACK FROM THE
PROJECT
The BNPE project faced the challenges of centralization and reuse of
data at a national level by making the data available to everyone. The
reuse of data derived from taxes due to environmental issues is not
easy, even in an open data context. We identified two main issues:
The data standardization issue
The stakeholders of the project set up a dictionary to define \*2 common
repositories and a data exchange format. This work was done with the
collaboration of the Sandre\*3, the French National Service for Water
Data and Common Repositories Management. However, the definition of the
standard is too broad and producers encounter issues in standardizing
their data. This project shows us the need to define a limited core of
data concepts to share, which are very well defined and cannot be
misinterpreted. BNPE also focuses on the importance of using concepts
that already exist in the producer's information system. Centralization
and enrichment of datasets are two additional steps that need to be
differentiated for a project to succeed.
The challenge of reusing data
The project is confronting issues related to assembling a relevant
dataset of water withdrawals. Data from taxes paid by water takers lack
key environmental information that limits its use for environmental
studies. For example, only 50% of water withdrawn is linked to a
specific river, lake or groundwater source. Moreover, because current
water use datasets are derived from taxes on withdrawals greater than
7000 m^3^ per year, the data are missing for some withdrawals. AFB is
studying additional data sources to complete the dataset (e.g., local
authorities, crowdsourcing, spatial joining).
Abstract
The European Search Catalogue for Plant Genetic Resources, EURISCO,
provides information about more than 1.9 million accessions of crop
plants and their wild relatives, preserved ex situ by almost 400
institutes in Europe and beyond (Weise et al. 2017). EURISCO, which is
being maintained on behalf of the European Cooperative Programme for
Plant Genetic Resources, is based on a network of National Inventories
of 43 member countries. It represents an important effort for the
preservation of the world's agrobiological diversity by providing
information about the large genetic diversity kept by the collaborating
institutions.
Besides the classical passport data, in 2016, EURISCO started to
additionally collect phenotypic data about the documented germplasm
accessions. The selection of genebank material for both research and
breeding purposes is increasingly carried out through the selection of
specific phenotypic values, e.g. flowering time or plant height. Thus,
these data are of high importance to users of plant genetic resources
(PGR) since they determine the value of the respective germplasm.
However, because there are no commonly agreed standards existing within
the genebank community, this kind of data is very difficult to handle.
In this context, the challenges range from synonymous/homonymous
descriptor names over different rating scales to different/insufficient
amounts of meta information, thus hampering both integration and
cross-experiment comparison of data.
The presentation will illustrate the approach followed within EURISCO,
together with the challenges resulting therefrom. Using this as a solid
basis for a discussion about the utilization of this kind of data, the
presentation shall be regarded as a call for cooperation.
Abstract
Trait data in biology can be extracted from text and structured for
reuse within and across taxa. For example, body length is one trait
applicable to many species and \"body length is about 170 cm\" is one
trait data point for the human species. Trait data can be used in more
detailed analyses to describe species evolution and development
processes, so it has begun to be valued by more than taxonomists. The
EOL (Encyclopedia of Life) TraitBank provides an example of a trait
database.
Current trait databases are in their infancy. Most are based on
morphological data such as shape, color, structural and sexual
characteristics. In fact, some data such as behavioral and biological
characteristics may be similarly included in trait databases.
To build a trait database we constructed a list of controlled vocabulary
to record the states of various terms. These terms may exhibit common
characteristics:
They can be grouped as conceptual (subject) and descriptive (delimiter)
terms. For example, in "the shoulder height is 65--70 cm", \"shoulder
height\" is the conceptual term and \"65--70 cm\" is the descriptive
term.
Conceptual terms may be part of an interdependent hierarchical
structure. Examples in morphology, physiology and conservation or
protection status, demonstrate how parts or systems may be broken into
smaller measurable (quantifiable) or enumerable pieces.
Descriptive terms will modify or delimit parameters of conceptual terms.
These may be numerical with distinguishing units, counts, or other
adjectives or enumerable with special nouns.
Although controlled vocabularies about animals are complex, they can be
normalized using RDF (Resource Description Framework) and OWL (web
ontology language) standards.
Next, we extract traits from two main types of existing descriptions.
tabular data, which is more easily digested by machine, and
descriptive text, which is complex.
Pure text often needs to be extracted manually or by NLP (computerized
natural language processing). Sometimes machine learning methods can be
used. Moreover, different human languages may demand different
extraction methods.
Because the number of recordable traits exceeds current collection
records, the database structure should be optimized for retrieval speed.
For this reason, key-value databases are more suitable for storage of
traits data than relational databases. EOL used the database Virtuoso
for Traitbank, which is a non-relational database.
Using existing mature tools and standards of ontology, we can construct
a preliminary work-flow for animal trait data, but some tools and
specifications for data analysis and use need to await additional data
accumulation.
Abstract
The South African National Biodiversity Institute (SANBI) has initiated
the development of the National Biodiversity Information System to
provide access to integrated South African biodiversity information. The
aim of the project is to centrally manage all biodiversity information
to support researchers, conservationists, policy and decision-makers in
achieving their goals, support planners in making sensible decisions,
and help SANBI understand the anthropogenic impact on biodiversity. The
project is set to deliver a centralised web-based infrastructure to
capture, aggregate, manage, discover, analyse and visualise biodiversity
data and associated information through a suite of tools and spatial
layers. The infrastructure is a Microsoft technology stack with
microservices component architecture
(http://microservices.io/patterns/microservices.html), which is vital to
building an application out of small collaborating services, stemming
from integrating the enterprise system.
SANBI conducted a review of the data holdings of the individual herbaria
and museums in South Africa. The intention is to have a federated
approach to data management, exposing what is available as a collection
but ensuring that each individual natural science collection has full
ownership and management control over their data within a defined
framework and governed by internationally accepted data policies and
standards. The presentation highlights the opportunities and unexpected
difficulties with developing a national botanical and zoological
collections data management service in South Africa.
Abstract
The long-term lifecycle management of natural history data requires
careful planning. Elements that have a significant impact on this
planning include data quality, domain-specific requirements, and data
interoperability. Standards like Darwin Core Wieczorek et al. 2012 are
built to be flexible, allowing institutions to share data quickly
without extensive modification of internal information management
processes. However, there is often limited consensus on the exact
meanings and use of key terms by various domains. If we want to increase
the quality, interoperability, and long-term health of collections data,
we must reassess how we record specimen data, paying special attention
to the terms we use and how we use them.
Here we share results from efforts to evaluate current data sharing
practices for data from paleontology collections. By analysing the use
of terms in Darwin Core, we are constructing a framework for how
paleontological data is shared, how terms are used across many
institutions, and where there are inconsistencies or lack of terms to
support a fully robust record. We have also used data quality assessment
and validation tools developed by organizations like the Global
Biodiversity Information Facility (GBIF) to provide insight and testing
for term-specific requirements addressing quality on a more global scale
than might be the focus of any more locally driven data quality
assessment.
These assessments can guide the development of a new framework for
sharing paleontological data, enabling the community to collaborate and
find solutions to increase quality and interoperability. Additionally,
individual institutions can utilize the framework to enhance long-term
care of digital assets with global participation in mind.
Abstract
Since the Nagoya Protocol on Access to genetic resources and Benefit
Sharing (ABS) came into force in 2014, the conservation and assurance of
national biodiversity has been internationally stressed. The Government
of South Korea is exercising significant efforts to integrate and manage
the information pertaining to biological resources in line with this
global trend. However, connecting and sharing biodiversity data has
certain challenges because the existing databases and information
systems are being operated using different standards.
In the present study, we established an integrated management system for
freshwater biodiversity information, the Freshwater Biodiversity
Platform (FBP), to support the conservation and sustainable use of
biodiversity. This platform allows the management of various types of
biodiversity data, such as occurrences, habitats and genetics, for
freshwater species inhabiting South Korea. The data fields are based on
a global biodiversity data standard, Darwin Core, and national
biodiversity standards of South Korea in order to share our data more
efficiently, both nationally and internationally. It is important to
note that the platform deals with information related to the utilization
of biological resources as well as information representing the national
biodiversity. We have collected bibliographical data, such as papers and
patents, from databases, including information on the use of biological
resources. The data have been refined by applying a national species
list of South Korea and ontology terms in (MeSH) to compile valuable
information for biological industries. Furthermore, our platform is open
source and is compatible with multiple language packs to facilitate the
availability of biodiversity data for other countries and institutions.
Currently, the Freshwater Biodiversity Platform is being used to collect
and standardize various types of existing freshwater biodiversity data
to build foundations for data management. Based on these data, we will
improve the platform by adding new systems that can analyze and release
data for public access. This platform will provide integrated
information on freshwater species from the Korean Peninsula to the world
and contribute to the conservation and sustainable use of biological
resources.
Abstract
Freshwater biodiversity is critically understudied in Rwanda, and to
date there has not been an efficient mechanism to integrate freshwater
biodiversity information or make it accessible to decision-makers,
researchers, private sector or communities, where it is needed for
planning, management and the implementation of the National Biodiversity
Strategy and Action Plan (NBSAP). A framework to capture and distribute
freshwater biodiversity data is crucial to understanding how economic
transformation and environmental change is affecting freshwater
biodiversity and resulting ecosystem services. To optimize conservation
efforts for freshwater ecosystems, detailed information is needed
regarding current and historical species distributions and abundances
across the landscape. From these data, specific conservation concerns
can be identified, analyzed and prioritized.
The purpose of this project is to establish and implement a long-term
strategy for freshwater biodiversity data mobilization, sharing,
processing and reporting in Rwanda. The expected outcome of the project
is to support the mandates of the Rwanda Environment Management
Authority (REMA), the national agency in charge of environmental
monitoring and the implementation of Rwanda's NBSAP, and the Center of
Excellence in Biodiversity and Natural Resources Management (CoEB). The
project also aligns with the mission of the Albertine Rift Conservation
Society (ARCOS) to enhance sustainable management of natural resources
in the Albertine rift region. Specifically, organizational structure,
technology platforms, and workflows for the biodiversity data capture
and mobilization are enhanced to promote data availability and
accessibility to improve Rwanda's NBSAP and support other
decision-making processes. The project is enhancing the capacity of
technical staff from relevant government and non-government institutions
in biodiversity informatics, strengthening the capacity of CoEB to
achieve its mission as the Rwandan national biodiversity knowledge
management center. Twelve institutions have been identified as data
holders and the digitization of these data using Darwin Core standards
is in progress, as well as data cleaning for the data publication
through the ARCOS Biodiversity Information System
(http://arbmis.arcosnetwork.org/). The release of the first national
State of Freshwater Biodiversity Report is the next step. CoEB is a
registered publisher to the Global Biodiversity Information Facility
(GBIF) and holds an Integrated Publishing Toolkit (IPT) account on the
ARCOS portal. This project was developed for the African Biodiversity
Challenge, a competition coordinated by the South African National
Biodiversity Institute (SANBI) and funded by the JRS Biodiversity
Foundation which supports on-going efforts to enhance the biodiversity
information management activities of the GBIF Africa network. This
project also aligns with SANBI's Regional Engagement Strategy, and
endeavors to strengthen both emerging biodiversity informatics networks
and data management capacity on the continent in support of sustainable
development.
Abstract
As a national center for managing biological data, the Korean
Bioinformation Center (KOBIC) provides capabilities and resources to
manage and standardize the explosively growing amount of biological data
from national Research and Development grants by developing a systematic
and integrative approach. The biological data includes biological
material resource, genome, and biodiversity data, such as observation,
collection, taxonomy, character, and genome information of living
organisms. The Korean government has enacted legislature for the
collection, management and utilization of biological data in 2009 and,
as a follow-up, KOBIC has undertaken the mission to collect and
integrate the scattered biological data in Korea. We first made a
biological data format for exchanging data between government agencies.
After that, the Korean Bio-resource Information System (KOBIS) has been
developed. KOBIS is an integrated information system for efficient
acquisition and systematic management of biological data. KOBIS contains
more than 109,000 species and 12.1 million occurrence records from 107
collaborating institutions from four ministries. KOBIS is a system that
establishes a catalog of scientific names by linking species information
by ministries. The main function is integrated information search. The
results of integrated information search show character information,
bibliographic information, electronic book, DNA classification, gene
information, photo image, and research achievement. We will continue to
focus our efforts on the management of KOBIS for facilitation of
information sharing, distribution, and service towards mining biological
data.
KOBIS is available at http://www.kobis.re.kr.
Abstract
Primary biodiversity data, or occurrence data, are being produced at an
increasing rate and are used in numerous studies (Hampton et al. 2013,
La Salle et al. 2016). This data avalanche is a remarkable opportunity
but it comes with hurdles. First, available software solutions are rare
for very large datasets and those solutions often require significant
computer skills (Gaiji et al. 2013), while most biologists are not
formally trained in bioinformatics (List et al. 2017). Second, large
datasets are heterogeneous because they come from different producers
and they can contain erroneous data (Gaiji et al. 2013). Hence, they
need to be curated. In this context, we developed a biodiversity
occurrence curator designed to quickly handle large amounts of data
through a simple interface: the Darwin Core Spatial Processor (DwCSP).
DwCSP does not require the installation or use of third-party software
and has a simple graphical user interface that requires no computer
knowledge. DwCSP allows for the data enrichment of biodiversity
occurrences and also ensures data quality through outlier detection. For
example, the software can enrich a tabulated occurrence file (Darwin
Core for instance) with spatial data from polygon files (e.g., Esri
shapefile) or a Rasters file (geotiff). The speed of the enriching
procedures is ensured through multithreading and optimized spatial
access methods (R-Tree indexes). DwCSP can also detect and tag outliers
based on their geographic coordinates or environmental variables. The
first type of outlier detection uses a computed distance between the
occurrence and its nearest neighbors, whereas the second type uses a
Mahalanobis distance (Mahalanobis 1936). One hundred thousand
occurrences can be processed by DwCSP in less than 20 minutes and
another test on forty million occurrences was completed in a few days on
a recent personal computer. DwCSP has an English interface including
documentation and will be available as a stand-alone Java Archive (JAR)
executable that works on all computers having a Java environment
(version 1.8 and onward).
Abstract
Museum-preserved samples are attracting attention as a rich resource for
DNA studies. Museomics aims to link DNA sequence data back to the museum
collection. Molecular biologists are interested in morphological
information including body size, pattern, and colors, and sequence data
have also become essential for biodiversity research as evidence for
species identification and phylogenetic analysis.
For more than 30 years, molecular data, such as DNA and protein
sequences, have been captured by the DNA Data Bank of Japan (DDBJ), the
European Bioinformatics Institute (EBI, UK), and the National Center for
Biotechnology Information (NCBI, US) under the International Nucleotide
Sequence Database Collaboration (INSDC). INSDC provides collected
molecular data to researchers as public databases including GenBank for
DNA sequences and Gene Expression Omnibus (GEO) for gene expression.
These three institutes synchronize archived data and publish all data on
an FTP (File Transfer Protocol) site so that it is available for big
data analysis.
In recent years, high-throughput sequencing technology, also called
next-generation sequencing (NGS) technology, has been widely utilized
for molecular biology including genomics, transcriptomics, and
metagenomics. Biodiversity researchers also focus on NGS data for DNA
barcoding and phylogenetic analysis as well as molecular biology.
Additionally, a portable NGS platform, MinION (Oxford Nanopore
Technologies), has been launched, enabling biodiversity researchers to
perform DNA sequencing in the field. Along with GenBank and GEO data,
INSDC accepts NGS data and provides a public primary database, called
the Sequence Read Archive (SRA). As of March 2018, 6.4 Peta Bases of NGS
data is freely available under more than 130,000 projects in SRA. The
Database Center for Life Science (DBCLS) provides a search engine for
public NGS data, called DBCLS SRA (http://sra.dbcls.jp/) in
collaboration with DDBJ. SRA contains not only raw sequence reads or
processed data mapped to genome, but also information on the
experimental design, including project types, sequencing platforms, and
sample species. Researchers can use this data to refine their search
results. We also linked publications referring to NGS data to the
corresponding SRA entries.
The mission of DBCLS is to accelerate the accessibility of life science
data. Collected data used to be described in the Excel-readable tabular
format, but these formats are difficult to merge with other databases
because of the ambiguity of labels. To overcome this difficulty, we
recently integrated life science data with Semantic Web technology. We
held annual meetings to integrate life science data, called
BioHackathons, in which researchers from all over the world
participated. UniProt and Ensembl databases currently provide an RDF
(Resource Description Framework) version of curated genome and protein
data, respectively. In the biodiversity domain, there are many databases
such as GBIF (The Global Biodiversity Information Facility) for species
occurrence records, EoL (The Encyclopedia of Life) as a knowledge base
of all species, and BoL (The Barcode of Life) for DNA barcoding data.
RDF is utilized to describe Darwin Core based data so that
bioinformatics and biodiversity informatics researchers can technically
merge both types of data. Currently, specimen data and DNA sequence data
are not linked. Museomics starts with cross-referencing specimen and
sequence IDs and by making data sources comply with an existing
standard.
Abstract
The Eastern Highlands of Zimbabwe is a biodiversity hotspot that forms
part of the Eastern Afromontane region, which has seen an increase in
human activities such as agriculture, illegal mining, and introduction
of invasive species. These anthropogenic activities have had negative
environmental consequences including land degradation and water
pollution, which have negatively impacted on the quality of aquatic
habitats and biodiversity in the region. The region harbours several
freshwater species of conservation interest whose numbers and
distribution are little known. We also do not know the impacts of the
ongoing human activities and threats on the local wetland biodiversity
and the integrity of the ecosystem in the region. The relevant data on
the wetland biodiversity from previous studies and surveys is also not
readiliy available to guide poliies and conservation efforts in this
region.
With the aid of the Biodiversity Information for Development (BID)
program sponsored by the Global Biodiversity Information Facility (GBIF)
and the European Union (EU), a project titled \'Freshwater Biodiversity
of the Eastern Highlands of Zimbabwe: Assessing Conservation Priorities
Using Primary Species-Occurrence Data\' has mobilized and digitized over
2,000 occurrence records on freshwater biodiversity, with a focus on
fish, invertebrates, amphibians and bird species in the region, since
October 2017. The project also makes use of biodiversity informatics
tools such as ecological niche modelling, to identify the important
sites for conservation of the freshwater biodiversity in this region.
The outputs will help to show policy makers, wildlife managers,
researchers and conservationists where to target resources and
conservation efforts. This will also help protect the biodiversity that
still existsin the unprotected wetlands of the Eastern Highlands of
Zimbabwe and that could be lost to human activities such as clearing for
agriculture.
Abstract
Recognizing the abundance and the accumulation of information and data
on biodiversity that are still poorly exploited and even unfunded, the
REBIOMA project (Madagascar Biodiversity Networking), in collaboration
with partners, has developed an online dataportal in order to provide
easy access to information and critical data, to support conservation
planning and the expansion of scientific and professional activities in
Madagascar biodiversity.
The mission of the REBIOMA data portal is to serve quality-labeled,
up-to-date species occurrence data and environmental niche models for
Madagascar's flora and fauna, both marine and terrestrial. REBIOMA is a
project of the Wildlife Conservation Society Madagascar and the
University of California, Berkeley.
REBIOMA serves species occurrence data for marine and terrestrial
regions of Madagascar. Following upload, data is automatically validated
against a geographic mask and a taxonomic authority. Data providers can
decide whether their data will be public, private, or shared only with
selected collaborators. Data reviewers can add quality labels to
individual records, allowing selection of data for modeling and
conservation assessments according to quality. Portal users can query
data in numerous ways.
One of the key features of the REBIOMA web portal is its support for
species distribution models, created from taxonomically valid and
quality-reviewed occurrence data. Species distribution models are
produced for species for which there are at least eight, reliably
reviewed, non-duplicate (per grid cell) records. Maximum Entropy
Modeling (MaxEnt for short) is used to produce continuous distribution
models from these occurrence records and environmental data for
different eras: past (1950), current (2000), and future (2080). The
result is generally interpreted as a prediction of habitat suitability.
Results for each model are available on the portal and ready for
download as ASCII and HTML files.
The REBIOMA Data Portal address is http://data.rebioma.net, or visit
http://www.rebioma.net for more general information about the entire
REBIOMA project.
Abstract
Herbaria in Taiwan face critical data challenges:
Different taxonomic views prevent data exchange;
There is a lack of development practices to keep up with standard and
technological advances;
Data is disconnected from researchers' perspective, thus it is difficult
to demonstrate the value of taxonomists' activities, even though a few
herbaria have their specimen catalogue partially exposed in Darwin Core.
In consultation with the Herbarium of the Taiwan Forestry Research
Institute (TAIF), the Herbarium of the National Taiwan University (TAI)
and the Herbarium of the Biodiversity Research Center, Academia Sinica
(HAST), which together host most important collections of the vegetation
on the island, we have planned the following activities to address data
challenges:
Investigate a new data model for scientific names that will accommodate
different taxonomic views and create a web service for access to
taxonomic data;
Refactor existing herbarium systems to utilize the aforementioned
service so the three herbaria can share and maintain a standardized name
database;
Create a layer of Application Programming Interface (API) to allow
multiple types of accessing devices;
Conduct behavioral research regarding various personas engaged in the
curatorial workflow;
Create a unified front-end that supports data management, data
discovery, and data analysis activities with user experience
improvements.
To manage these developments at various levels, while maximizing the
contribution of participating parties, it is crucial to use a proven
methodological framework. As the creative industry has been leading in
the area of solution development, the concept of design thinking and
design thinking process (Brown and Katz 2009) has come to our radar.
Design thinking is a systematic approach to handling problems and
generating new opportunities (Pal 2016). From requirement capture to
actual implementation, it helps consolidate ideas and identify agreed-on
key priorities by constantly iterating through a series of interactive
divergence and convergence steps, namely the following:
Empathize: A divergent step. We learn about our audience, which in this
case includes curators and visitors of the herbarium systems, about what
they do and how they interact with the system, and collate our findings.
Define: A convergent step. We construct a point of view based on
audience needs.
Ideate: A divergent step. We brainstorm and come up with creative
solutions, which might be novel or based on existing practice.
Prototype: A convergent step. We build representations of the chosen
idea from the previous step.
Test: Use the prototype to test whether the idea works. Then refine from
step 3 if problems were with the prototyping, or even step 1, if the
point of view needs to be revisited.
The benefits by adapting to this process are:
Instead of "design for you", we "design together", which strengthens the
sense of community and helps the communication of what the revision and
refactoring will achieve;
When put in context, increased awareness and understanding of
biodiversity data standards, such as Darwin Core (DwC) and Access to
Biological Collections Data (ABCD);
As we lend the responsibility of process control to an external
facilitator, we are able to focus during each step as a participant.
We illustrate how the planned activities are conducted by the five
iterative steps.
Abstract
GBIF Benin, hosted at the University of Abomey-Calavi, has published
more than 338,000 occurrence records in 87 datasets and checklists. It
has been a Global Biodiversity Information Facility (GBIF) node since
2004 and is a leader in several projects from the Biodiversity
Information for Development (BID) programme.
GBIF facilitates collaboration between nodes at different levels through
its Capacity Enhancement Support Programme (CESP)
\[https://www.gbif.org/programme/82219/capacity-enhancement-support-programme\].
One of the actions included in the CESP guidelines is called 'Mentoring
activities'. Its main goal is the transfer of knowledge between partners
such as information, technologies, experience, and best practices.
Sharing architecture and development is the key solution to solve some
technical challenges or impediments (hosting, staff turnover, etc.) that
GBIF nodes could face. The Atlas of Living Australia (ALA) team
developed a functionality called 'data hub'. It gives the possibility to
create a standalone website with a dedicated occurrence search engine
that seeks among a range of data (e.g. specific genus, geographic area).
In 2017, GBIF Benin and GBIF France wanted to strengthen their
partnership and started a CESP project. One of the core objectives of
this project is the creation of the Atlas of Living Benin using ALA
modules. GBIF France developers, with the help of the GBIF Benin team,
are in the process of configuring a data hub that will give access to
Beninese data only, while at the same time Atlas of Living France will
give access to French data only. Both data portals will use the same
back end, therefore the same databases. Benin is the first African GBIF
node to implement this kind of infrastructure.
On this poster, we will present the Atlas of Living Benin specific
architecture and how we have managed to distinguish data coming from
Benin and coming from France.
Abstract
The existing web representation of the Flora of North America (FNA)
project needs improvement. Despite being electronically available, it
has little more functionality than its printed counterpart. Over the
past few years, our team has been working diligently to build a new more
effective online presence for the FNA. The main objective is to
capitalize on modern Natural Language Processing (NLP) tools built for
biodiversity data (Explorer of Taxon Concepts or ETC; Cui et al. 2016),
and present the FNA online in both machine and human readable formats.
With machine-comprehensible data, the mobilization and usability of
flora treatments is enhanced and capabilities for data linkage to a
Biodiversity Knowledge Graph (Page 2016) are enabled. For example,
usability of treatments increases when morphological statements are
parsed into finely grained pieces of data using ETC, because these data
can be easily traversed across taxonomic groups to reveal trends.
Additionally, the development of new features in our online FNA is
facilitated by FNA data parsing and processing in ETC, including a
feature to enable users to explore all treatments and illustrations
generated by an author of interest. The current status of the ongoing
project to develop a Semantic MediaWiki (SMW) platform for the FNA is
presented here. New features recently implemented are introduced,
challenges in assembling the Semantic MediaWiki are discussed, and
future opportunities, which include the integration of additional floras
and data sources, are explored. Furthermore, implications of
standardization of taxonomic treatments, which work such as this
entails, will be discussed.
Abstract
In 2015, the global biodiversity information initiatives Biodiversity
Heritage Library (BHL), Barcode of Life Data systems (BoLD), Catalogue
of Life (CoL), Encyclopedia of Life (EOL), and the Global Biodiversity
Information Facility (GBIF) took the first step to work on the idea for
building a single shared authoritative nomenclature and taxonomic
foundation that could be used as a backbone to order and connect
biodiversity data across various domains. At present, the Catalogue of
Life is being used by BHL, BoLD, EOL, and GBIF, but each extend the CoL
with additional data to meet the specific backbone services required.
The goal of the CoL+ project is to innovate the CoL systems by
developing a new information technology infrastructure that includes
both the current Catalogue of Life and a provisional Catalogue of Life
(replacing the current GBIF backbone taxonomy), separates scientific
names and taxonomic concepts with associated unique identifiers, and
provides some (infrastructural) support for taxonomic and nomenclatural
content authorities to finish their work. The project's specific
objectives are to
establish a clearinghouse covering scientific names across all life;
provide a single taxonomic view grounded in the consensus classification
of the Catalogue of Life along with candidate taxonomic sources, show
differences between sources, and provide an avenue for feedback to
content authorities while allowing the broader community to contribute,
and
establish a partnership and governance, allowing a continuing commitment
after the project's end for a clearinghouse infrastructure and its
associated components, including a roadmap for future developments of
the infrastructure.
As result of the project we expect to have a shared information space
for names and taxonomy between the Catalogue of Life, nomenclator
content authorities (e.g. IPNI, ZooBank) and several global biodiversity
information initiatives.
Abstract
The 3i World Auchenorrhyncha database (http://dmitriev.speciesfile.org)
is being migrated into TaxonWorks (http://taxonworks.org) and comprises
nomenclatural data for all known Auchenorrhyncha taxa (leafhoppers,
planthoppers, treehoppers, cicadas, spittle bugs). Of all those
scientific names, 8,700 are unique genus-group names (which include
valid genera and subgenera as well as their synonyms). According to the
Rules of Zoological Nomenclature, a properly formed species-group name
when combined with a genus-group name must agree with the latter in
gender if the species-group name is or ends with a Latin or Latinized
adjective or participle. This provides a double challenge for
researchers describing new or citing existing taxa. For each species,
the knowledge about the part of speech is essential information (nouns
do not change their form when associated with different generic names).
For the genus, the knowledge of the gender is essential information.
Every time the species is transferred from one genus to another, its
ending may need to be transformed to make a proper new scientific name
(a binominal name). In modern day practice, it is important, when
establishing a new name, to provide information about etymology of this
name and the ways it should be used in the future publications: the
grammatical gender for a genus, and the part of speech for a species.
The older names often do not provide enough information about their
etymology to make proper construction of scientific names. That is why
in the literature, we can find numerous cases where a scientific name is
not formed in conformity to the Rules of Nomenclature. An attempt was
made to resolve the etymology of the generic names in Auchenorrhyncha to
unify and clarify nomenclatural issues in this group of insects. In
TaxonWorks, the rules of nomenclature are defined using the NOMEN
onthology (https://github.com/SpeciesFileGroup/nomen).
Abstract
Compilation and retrieval of reliable data on biological interactions is
one of the critical bottlenecks affecting efficiency and statistical
power in testing ecological theories. TaxonWorks, a web-based workbench,
can facilitate such research by enabling the digitization of complex
biological interactions involving multiple species, individuals, and
trophic levels. These data can be further organized into spatial and
temporal axes, and annotated at the level of individual or grouped
interactions (e.g. singularly citing the combined elements of a
tritrophic interaction). The simple, customizable nature of tools
ultimately reduces the time-consuming steps of data gathering, cleaning,
and formatting of datasets for subsequent exploration and analysis while
also improving the asserted semantics.
An example use case is provided with a dataset of associations among
plants, pathogens and insect vectors. The curated data are accessed
through the JSON serving TaxonWorks API (Application Programming
Interface) by an R package. Analysis and visualization of the network
graphs persisted in TaxonWorks is demonstrated using core R
functionality and the igraph package (Csardi and Nepusz 2006).
TaxonWorks is open-source, collaboratively built software available at
http://taxonworks.org.
Abstract
As part of the Biodiversity Information System on Nature and Landscapes
(Système d\'Informations Nature et Paysages or SINP), the French
National Natural History Museum has been appointed by the French
ministry in charge of ecology to develop mechanisms for biodiversity
data exchange, especially taxon occurrences (there are also elements on
habitat occurrences, geo-heritage, etc.). Given that there are thousands
of different sources for datasets, containing over 42 million records,
such a development brings into question the underlying quality of data.
To add complexity, there can be several layers of quality assurance: one
by the producer of the data, one by a regional node, and another one by
the national node.
The approach to quality issues was addressed by a dedicated working
group, representative of biodiversity stakeholders in France. The
resulting documents focus on core methodology elements that characterize
a data quality process for, in the first instance, taxon occurrences
only. It may be extended to habitats, geology, etc. in the near future.
For scientific validation, two processes are used:
One automated process that uses expertise upstream (automated validation
based on previous databases created through the use of said expertise),
with several criteria such as comparison with a national taxonomic
reference database (TAXREF), and with species reference distributions.
The outcomes of this process will indicate error potential and can be
used to automatically flag data above a certain threshold for the
following process.
A second, manual process, that allows for further scrutiny in order to
reach a conclusive evaluation.
The combination of both processes allows experts to focus on data that
has a higher likelihood of being erroneous, thus saving time and
resources.
One objective of the INPN (Inventaire National du Patrimoine Naturel, or
National Inventory of Natural Heritage), after one or both approaches,
is to have each record assigned a confidence level.
The poster will be about National scientific validation of data in the
SINP. It will show for whom and why it is done, whether the expertise
lies upstream or downstream (manual validation through expert networks),
what documents may exist, and what attributes have been considered to be
added to the national standards so as to convey the information derived
from these processes.
Abstract
Web portals are commonly used to expose and share scientific data. They
enable end users to find, organize and obtain data relevant to their
interests. With the continuous growth of data across all science
domains, researchers commonly find themselves overwhelmed as finding,
retrieving and making sense of data becomes increasingly difficult.
Search engines can help find relevant websites, but the short
summarizations they provide in results lists are often little
informative on how relevant a website is with respect to research
interests.
To yield better results, a strategy adopted by Google, Yahoo, Yandex and
Bing involves consuming structured content that they extract from
websites. Towards this end, the schema.org collaborative community
defines vocabularies covering common entities and relationships (e.g.,
events, organizations, creative works) (Guha et al. 2016). Websites can
leverage these vocabularies to embed semantic annotations within web
pages, in the form of markup using standard formats. Search engines, in
turn, exploit semantic markup to enhance the ranking of most relevant
resources while providing more informative and accurate summarization.
Additionally, adding such rich metadata is a step forward to make data
FAIR, i.e. Findable, Accessible, Interoperable and Reusable.
Although schema.org encompasses terms related to data repositories,
datasets, citations, events, etc., it lacks specialized terms for
modeling research entities. The Bioschemas community (Garcia et al.
2017) aims to extend schema.org to support markup for Life Sciences
websites. A major pillar lies in reusing types from schema.org as well
as well-adopted domain ontologies, while only proposing a limited set of
new types. The goal is to enable semantic cross-linking between
knowledge graphs extracted from marked-up websites. An overview of the
main types is presented in Fig. 1. Bioschemas also provides profiles
that specify how to describe an entity of some type. For instance, the
protein profile requires a unique identifier, recommends to list
transcribed genes and associated diseases, and points to recommended
terms from the Protein Ontology and Semantic Science Integrated
Ontology.
The success of schema.org lies in its simplicity and the support by
major search engines. By extending schema.org, Bioschemas enables life
sciences research communities to benefit from a lightweight semantic
layer on websites and thus facilitates discoverability and
interoperability across them. From an initial pilot including just a few
bio-types such as proteins and samples, the Bioschemas community has
grown and is now opening up towards other disciplines. The biodiversity
domain is a promising candidate for such further extensions. We can
think of additional profiles to account for biodiversity-related
information. For instance, since taxonomic registers are the backbone of
many web portals and databases, new profiles could describe taxa and
scientific names while reusing well-adopted vocabularies such as Darwin
Core terms (Baskauf et al. 2016) or TDWG ontologies (TDWG Vocabulary
Management Task Group 2013). Fostering the use of such markup by web
portals reporting traits, observations or museum collections could not
only improve information discovery using search engines, but could also
be a key to spur large-scale biodiversity data integration scenarios.
Abstract
BIOfid is a specialized information service currently being developed to
mobilize biodiversity data dormant in printed historical and modern
literature and to offer a platform for open access journals on the
science of biodiversity. Our team of librarians, computer scientists and
biologists produce high-quality text digitizations, develop new
text-mining tools and generate detailed ontologies enabling semantic
text analysis and semantic search by means of user-specific queries. In
a pilot project we focus on German publications on the distribution and
ecology of vascular plants, birds, moths and butterflies extending back
to the Linnaeus period about 250 years ago. The three organism groups
have been selected according to current demands of the relevant research
community in Germany. The text corpus defined for this purpose comprises
over 400 volumes with more than 100,000 pages to be digitized and will
be complemented by journals from other digitization projects,
copyright-free and project-related literature. With TextImager (Natural
Language Processing & Text Visualization) and TextAnnotator (Discourse
Semantic Annotation) we have already extended and launched tools that
focus on the text-analytical section of our project. Furthermore,
taxonomic and anatomical ontologies elaborated by us for the taxa
prioritized by the project's target group - German institutions and
scientists active in biodiversity research - are constantly improved and
expanded to maximize scientific data output. Our poster describes the
general workflow of our project ranging from literature acquisition via
software development, to data availability on the BIOfid web portal
(http://biofid.de/), and the implementation into existing platforms
which serve to promote global accessibility of biodiversity data.
Abstract
A new R package for biodiversity data cleaning, \'bdclean\', was
initiated in the Google Summer of Code (GSoC) 2017 and is available on
github. Several R packages have great data validation and cleaning
functions, but \'bdclean\' provides features to manage a complete
pipeline for biodiversity data cleaning; from data quality explorations,
to cleaning procedures and reporting. Users are able go through the
quality control process in a very structured, intuitive, and effective
way. A modular approach to data cleaning functionality should make this
package extensible for many biodiversity data cleaning needs. Under GSoC
2018, \'bdclean\' will go through a comprehensive upgrade. New features
will be highlighted in the demonstration.
Abstract
TaxonWorks (http://taxonworks.org) is an integrated workbench for
taxonomists and biodiversity scientists. It is designed to capture,
organize, and enrich data, share and refine it with collaborators, and
package it for analysis and publication. It is based on PostgreSQL
(database) and the Ruby-on-Rails programming language and framework for
developing web applications
(https://github.com/SpeciesFileGroup/taxonworks). The TaxonWorks
community is built around an open software ecosystem that facilitates
participation at many levels. TaxonWorks is designed to serve both
researchers who create and curate the data, as well as technical users,
such as programmers and informatics specialists, who act as data
consumers. TaxonWorks provides researchers with robust, user friendly
interfaces based on well thought out customized workflows for efficient
and validated data entry. It provides technical users database access
through an application programming interface (API) that serves data in
JSON format. The data model includes coverage for nearly all classes of
data recorded in modern taxonomic treatments primary studies of
biodiversity, including nomenclature, bibliography, specimens and
collecting events, phylogenetic matrices and species descriptions, etc.
The nomenclatural classes are based on the NOMEN ontology
(https://github.com/SpeciesFileGroup/nomen).
Abstract
Providing data in a semantically structured format has become the gold
standard in data science. However, a significant amount of data is still
provided as unstructured text - either because it is legacy data or
because adequate tools for storing and disseminating data in a
semantically structured format are still missing. We have developed a
description module for Morph∙D∙Base, a semantic knowledge base for
taxonomic and morphologic data, that enables users to generate highly
standardized and formalized descriptions of anatomical entities using
free text and ontology-based descriptions. The main organizational
backbone of a description in Morph∙D∙Base is a partonomy, to which the
user adds all the anatomical entities of the specimen that they want to
describe. Each element of this partonomy is an instance of an ontology
class and can be further described in two different ways:
as semantically enriched free-text description that is annotated with
terms from ontologies, and
semantically through defined input forms with a wide range of
ontology-terms to choose from.
To facilitate the integration of the free text into a semantic context,
text can be automatically annotated using jAnnotator, a javascript
library that uses about 700 ontologies with more than 8.5 million
classes of the National Center for Biomedical Ontology (NCBO) bioportal.
Users get to choose from suggested class definitions and link them to
terms in the text, resulting in a semantic markup of the text. This
markup may also include labels of elements that the user already added
to the partonomy. Anatomical entities marked in the text can be added to
the partonomy as new elements that can subsequently be described
semantically using the input forms. Each free text together with its
semantic annotations is stored following the W3C Web Annotation Data
Model standard (https://www.w3.org/TR/annotation-model). The whole
description with the annotated free text and the formalized semantic
descriptions for each element of the partonomy are saved in the
tuplestore of Morph∙D∙Base.
The demonstration is targeted at developers and users of data portals
and will give an insight to the semantic Morph∙D∙Base knowledge base
(https://proto.morphdbase.de) and jAnnotator
(http://git.morphdbase.de/christian/jAnnotator).
Abstract
Web APIs (Application Programming Interface) are a common means for Web
portals and data producers to enable HTTP-based, machine-processable
access to their data. They are a prominent source of information\*1
pertaining to topics as diverse as scientific information, social
networks, entertainment or finance. The methods of Linked Data (Heath
and Bizer 2011) similarly aim to publish machine-readable data on the
Web, while connecting related resources within and between datasets,
thereby creating a large distributed knowledge graph. Today, the
biodiversity community is increasingly adopting the Linked Data
principles to publish data such as trait banks, museum collections and
taxonomic registers (Parr et al. 2016, Baskauf et al. 2016). However,
standard approaches are still missing to combine disparate
representations coming from both Linked Data interfaces and the manifold
Web APIs that were developed during the last two decades to expose
legacy biodiversity databases on the Web.
The SPARQL Micro-Service architecture (Michel et al. 2018) tackles the
goal of reconciling Linked Data interfaces and Web APIs. It proposes a
lightweight method to query a Web API using SPARQL (Harris and Seaborne
2013), the Semantic Web standard to query knowledge graphs expressed in
the Resource Description Framework (RDF). A SPARQL micro-service
provides access to a small RDF graph, typically resource-centric, that
it builds at run-time by transforming a fraction of the whole dataset
served by the Web API into RDF triples. Furthermore, Web APIs
traditionally rely on internal, proprietary resource identifiers that
are unsuited for use as Uniform Resource Identifiers (URIs). To address
this concern, a SPARQL micro-service can assign a URI to a Web API
resource, allowing an application to look up this URI and get a
description of the resource in return (this process is referred to as
dereferencing).
In this demo, we wish to showcase the value of SPARQL micro-services in
the biodiversity domain. We first query TAXREF-LD, a Linked Data
representation of the French taxonomic register of living beings (Michel
et al. 2017), to retrieve information about a given taxon. Then, we
demonstrate how we can enrich our knowledge about this taxon with
various types of data retrieved on-the-fly from multiple Web APIs:
trait data from the Encyclopedia of Life trait bank (Parr et al. 2016),
articles or books from the Biodiversity Heritage Library,
audio recordings from the Macaulay scientific media archive,
photos from the Flickr photography social network, and
music tunes from MusicBrainz.
Different visualizations are demonstrated, ranging from raw RDF triples
to Web pages generated dynamically and integrating heterogeneous data,
as suggested in Fig. 1. Depending on the audience's interests, we shall
touch upon the alignment of Web APIs' proprietary vocabularies with
well-adopted thesauri or ontologies, or more technical concerns e.g.
related to the effort required to deploy a new SPARQL micro-service.
Abstract
In recent years, the natural history collections community has made
great progress in accelerating the pace of collection digitization and
global data-sharing. However, a common workflow bottleneck often occurs
in that period immediately following image capture but preceding image
submission to portals, a critical phase involving quality control, file
management, image processing, metadata capture, data backup, and
monitoring performance and progress.
While larger institutions have likely developed reliable, automated
workflows over time, small and medium institutions may not have the
expertise or resources to design and implement workflows that take full
advantage of automation opportunities. Without automation, these
institutions must invest many hours of manual effort to meet quality and
performance goals.
To address its own needs, BRIT developed a number of workflow automation
components, which coalesced over time into a suite of tools that operate
on both an image capture station as a client application and on a server
that provides file storage and image processing features. Together,
these tools were created to meet the following goals:
Simplify file management and data preservation through automation
Quickly identify quality issues
Quickly capture skeletal metadata to facilitate later databasing
Significantly reduce time between image capture and online availability
Provide performance and quality monitoring and reporting
Easy configuration and maintenance of client and server
The client and server components together can be considered a
"digitization appliance": software integrated with the specific goal of
providing a comprehensive suite of digitization tools that can be
quickly and easily deployed on simple consumer hardware. We have made
this software available to the natural history collections community
under an open-source license at
https://github.com/BRITorg/digitization\_appliance.
Abstract
The Specify Software Project (www.specifysoftware.org) has been funded
by the University of Kansas and with grants from the U.S. National
Science Foundation for 20 years. In 2018, the effort is pivoting from a
grant-funded project to a community-supported effort through the
establishment of a consortium of biological collection institutions.
Specify Collection Consortium software products will remain open source
and free to download and use. Consortium membership benefits will
include access to technical support services and seats on the Board of
Directors and advisory committees, groups that will determine priorities
for future products, platform capabilities, and technical support
services. In 2017 and 2018, we have been engaged in organizational
planning and development--modeling the Specify Collections Consortium on
examples of viable open source and open access consortia in other
research communities. Founding members of the Consortium in the U.S.
include the University of Michigan, University of Florida, and
University of Kansas. The Consortium\'s mission will be to support
collections institutions in mobilizing data from their holdings to
broader biological and computational initiatives to advance
collections-based research, while facilitating efficient data curation
and collection management. We will provide an update on our progress
with the Consortium\'s development and highlight new capabilities and
integration features of the Specify 6 & 7 software platforms.
Abstract
To improve access to biodiversity knowledge for diverse audiences, the
Encyclopedia of Life (EOL) aggregates materials from hundreds of content
providers. In addition to text, media, references, taxon names and
hierarchies, traits and other structured data are an increasingly
important component of EOL (TraitBank). Content priorities for TraitBank
include information about body size, geographic distribution, habitat,
trophic ecology, and biotic interactions in general. Our goal is to
summarize available data at the level of species and supraspecific taxa
and to achieve broad taxonomic coverage for high priority topics.
Integration of information from heterogeneous sources relies on a
variety of community standards (e.g., Dublin Core, Darwin Core, Audubon
Core) as well as post-hoc semantic annotations that standardize
terminology for traits and metadata and provide links to domain
ontologies and controlled vocabularies (e.g., Ontology of Biological
Attributes, Phenotypic Quality Ontology, Environment Ontology, Uber
Anatomy Ontology). Taxon names are mapped to a reference hierarchy that
leverages taxonomic information from many different resources (e.g.,
Catalogue of Life, World Register of Marine Species, Paleobiology
Database, National Center for Biotechnology Information). Names
reconciliation takes into account canonical name strings, authorities,
and synonym relationships as well as information about ranks and
hierarchies (parent/child taxa). In EOL version 3 this infrastructure
supports complex queries across EOL data sets, autogenerated natural
language descriptions of taxa, and knowledge-based recommender systems
for the exploration of content along multiple axes, including phylogeny,
ecology, life history, relevance to humans and other characteristics
derived from structured data. Most TraitBank data currently come from
published data compilations and databases of specialist projects, but
there are still significant gaps in coverage for many lesser known
groups. Recent advances in natural language processing, image analysis,
and machine learning technologies, facilitate the automated extraction
and processing of data from unstructured text and images. This will soon
make it possible to recruit vast amounts of information from millions of
pages of taxonomic, ecological, and natural history literature available
in open access repositories like Biodiversity Heritage Library (BHL) and
Plazi. Natural history collections are another promising source of new
taxon information. Millions of museum specimens indexed by organizations
like the Global Biodiversity Information Facility (GBIF) and Integrated
Digitized Biocollections (iDigBio) already contribute significantly to
our understanding of species occurrences in space and time. But
specimens and associated labels and field notes can also provide
information about morphology, phenology, habitats, and biotic
interactions. Data mined from literature corpora or specimen collections
will generally lack detailed descriptions of what exactly was measured,
metadata about the data capture process, measurement accuracy, and other
important parameters. The integration of this information with data sets
from the primary literature therefore poses challenges that go beyond
the standardization of taxonomy and terminology. Leverage of data from a
wide variety of sources is however necessary to achieve a comprehensive,
interconnected biodiversity knowledge base that supports the exploration
of trait diversity across the tree of life.
Abstract
The World Flora Online (WFO) is primarily a data management project
initiated in 2012 in response to Target 1 of the Global Strategy for
Plant Conservation -- \"To create an online flora of all known plants by
2020\". A WFO Consortium has been formed of now 42 international
partners with a governing Council and three Working Groups. The World
Flora Online Public Portal (www.worldfloraonline.org) was launched at
the International Botanical Congress in Shenzhen, China in July, 2017.
The baseline Public Portal was primarily populated with a taxonomic
backbone of information gathered from The Plant List augmented by newer
taxonomic sources like Solanaceae Source. To support all known plant
names in the WFO. including both vascular and non-vascular plants, new
WFO identifiers (WFOIDs) were created, which were also cross-referenced
to the International Plant Names Index (IPNI) identifiers for plant
names included there. The next phase of the World Flora Online involves
additional enhancement of the taxonomic backbone by engagement of new
plant Taxonomic Expert Networks (TENs) and acceleration of ingestion of
descriptive data from digital floras and monographs, and other sources
like International Union for Conservation of Nature (IUCN) threat
assessments and the Botanic Gardens Conservation International (BGCI)
Global Tree Assessment. Descriptive data can be text descriptions,
images, geographic distributions, identification keys, phylogenetic
trees, as well as atomized trait data like threat status, lifeform or
habitat. Initial digital descriptive datasets have been received by WFO
from Flora of Brazil, Flora of South Africa, Flora of China, Flora of
North Africa, Solanaceae Source and several others. The hard work is
underway to match the names associated with the submitted descriptions
to the names and WFOIDs in the World Flora Online taxonomic backbone and
then merging the descriptive data elements into the WFO database.
Numerous data tools have been adopted and created to accomplish the data
cleaning, standardization and transformation required before descriptive
data can be integrated. The WFO project has discovered many variations
between just the few datasets received so far, which highlights the need
for better standardization and controlled vocabularies for flora and
monographic descriptive data. This presentation will review some of the
issues identified by the project when merging descriptive data and some
potential gaps in the TDWG standards specifically for flora descriptive
data. Some opportunities for consideration by the TDWG Species
Information Interest Group will be presented.
Abstract
Species level information, as an important component of the biodiversity
information landscape, is an area where some TDWG standards and
activities, coincide. Plinian Core (Plinian Core Task Group 2018) is a
generalistic specification that covers aspects such species descriptions
and nomenclature, as well as many others (legal, conservation,
management, etc.). While the Plinian Core non-biological terms have no
counterpart in the TDWG developments, some of its biological ones have,
and that is the focus of this work. First, it must be noticed that
Plinian Core relies on some TDWG standards for specific facets of
species information:
Standard: Darwin Core (Darwin Core maintenance group, Biodiversity
Information Standards (TDWG) 2014)
Elements: taxonConceptID, Hierarchy, MeasurementOrFact,
ResourceRelationShip.
Standard:Ecological Metadata Language (EML project members 2011)
Elements: associatedParty, keywordSet, coverage, dataset
Standard:Encyclopedia of Life Schema (EOL Team 2012)
Elements: AncillaryData: DataObjectBase
Standard:Global Invasive Species Network (GISIN 2008)
Elements: origin, presence, persistence, distribution, harmful,
modified, startValidDate, endValidDate, countryCode, stateProvince,
county, localityName, county, language, citation, abundance\...
Standard:Taxon Concept Schema. TCS (Taxonomic Names and Concepts
interest group 2006)
Elements: scientificName
Given the direct dependency of Plinian Core for these terms, they do not
pose any compatibility or interoperability problem. However, biological
descriptions \--especially structured ones\-- are the object of DELTA
(Dallwitz 2006) and the Structured Descriptive Data (SDD) (Hagedorn et
al. 2005), and also covered by Plinian Core. This convergence presents
overlaps, mismatches and nuances, which discussion is the core of this
work.
Using some species descriptions as a test case, and transforming them
between these standards (Plinian Core, DELTA, and SDD), the strengths
and compatibility issues of these specifications are evaluated and
discussed.
Some operational aspects of Plinian Core in relation to GBIF\'s IPT
(GBIF Secretariat 2016) and the INSPIRE directive (European Commission
2007) are also reviewed.
Abstract
Taxonomic monographs are a series of publications covering a higher
taxonomic group with each monograph focusing on an individual species.
They are a compendium of the current state of research and knowledge
detailing many aspects of the species and are extensively used by
researchers, ornithologists and conservationists to learn what is
'currently' known about a species. Birds, being one of the more easily
seen and studied taxa, have a number of specialized taxonomic monographs
where data from a wide variety of disciplines are combined into a single
place and utilized for research and conservation management. Many of the
existing avian monographs have regional or subdomain focus such as
"Birds of the Western Palearctic" or "Catalan Breeding Bird Atlas
1999-2002" and monographs are sometimes focused on different user
communities, ranging from those with casual interest to professional
ornithologists and researchers.
The Lab of Ornithology maintains several monograph series. Merlin and
All About Birds include simplified information that is of interest to
the casual observer and Birds of North America and Neotropical Birds
Online are monographs with complete, detailed life histories, prepared
for ornithologists and active researchers. These monograph projects were
originally supported using different Content Management Systems which
became very difficult to maintain, difficult to keep content current and
provided no capacity for organizing and sharing of content across
monograph projects. Bird taxonomies change annually and the previous
systems had no capacity to intelligently manage taxonomic changes. To
solve these issues, we created a new Content Management System with
Taxonomic Concepts at its core. Reviewing a number of existing monograph
projects led us to create an underlying content structure that is very
analogous to Plinian Core. The initial requirement to support multiple
monograph series, some focused on the professional community and others
focused on budding amateurs, presented challenges to creating a 'one
size fits all' model for structuring content that includes authoritative
articles covering most aspects of a species life history, traditional
range maps, dynamic observation maps, relative abundance models, photos,
images, video and a bibliography. In this talk I'll present in detail
the Content Management System and the underlying models we have
developed. Four of these five models are tied to the underlying
taxonomic concept while the fifth is tied to the taxonomic names.
Articles, multimedia (including traditional range maps), taxonomic
description and bibliography have long existed in print monographs and
having these authored and displayed via the web makes it much simpler to
incorporate new information and, keep the information current and
publish the information to an existing standard. The incorporation of
dynamic content has only been possible with the advent of the web and
standards for the underlying Taxonomic Concepts. With four monographs
currently in production and several more in development, we've
encountered both advantages and disadvantages in using these models for
managing and serving monograph series. I will discuss these in detail
and compare the models with Plinian Core to highlight both fundamental
differences as well as common ground.
Abstract
Aiming at promoting interaction among researchers and the integration of
data from their pollen collections, herbaria and bee collections, RCPol
was created in 2013. In order to structure RCPol work, researchers and
collaborators have organized information on Palynology and trophic
interactions between bees and plants. During the project development,
different computing tools were developed and provided on RCPol website
(http://rcpol.org.br), including: interactive keys with multiple inputs
for species identification (http://chaves.rcpol.org.br); a glossary of
palinology related terms
(http://chaves.rcpol.org.br/profile/glossary/eco); a plant-bee
interactions database (http://chaves.rcpol.org.br/interactions); and a
data quality tool (http://chaves.rcpol.org.br/admin/data-quality). Those
tools were developed in partnership with researchers and collaborators
from Escola Politécnica (USP) and other Brazilian and foreign
institutions that act on palynology, floral biology, pollination, plant
taxonomy, ecology, and trophic interactions. The interactive keys are
organized in four branches: palynoecology, paleopalynology,
palynotaxonomy and spores. These information are collaboratively
digitized and managed using standardized Google Spreadsheets. All the
information are assessed by a data quality assurance tool (based on the
conceptual framework of TDWG Biodiversity Data Quality Interest Group
Veiga et al. 2017) and curated by palynology experts. In total, it has
published 1,774 specimens records, 1,488 species records (automatically
generated by merging specimens records with the same scientific name),
656 interactions records, 370 glossary terms records and 15 institutions
records, all of them translated from the original language (usually
Portuguese or English) to Portuguese, English and Spanish. During the
projectʼs first three years, 106 partners, among researchers and
collaborators from 28 institutions from Brazil and abroad, actively
participated on the project. An important part of the project\'s
activities involved training researchers and students on palynology,
data digitization and on the use of the system. Until now six training
courses have reached 192 people.
Abstract
The Australian Department of the Environment and Energy (DoEE) is
working with the Atlas of Living Australia (ALA), Biodiversity Climate
Change Virtual Laboratory (BCCVL) together with 2 state environment
departments (New South Wales and Queensland) to develop a standard
framework for modelling threatened species distributions for use in
policy / environmental decision making.
In addition, DoEE is working with 7 state and territory environment
departments to implement a common assessment method (CAM) for the
assessment and listing of nationally threatened species. The method is
based on the IUCN Red List criteria. Each Australian jurisdiction has
traditionally used different assessment method, including categories,
criteria, thresholds, definitions and scales of assessment to list
threatened species within their jurisdiction. The CAM is a standardised
method for species assessed for listing at the national level. Through
cross-jurisdictional collaboration, this will improve the efficiency of
the assessment process and facilitate consistency across jurisdictional
lists.
The BCCVL includes linkages to species observations on the ALA and users
are able to add their own data including contextual and species data.
The project aims to create a secure environment where
cross-jurisdictional collaboration can occur both on the standardisation
of methodologies for creating species distributions and the integration
of data. The project also aims to provide a secure platform for
jurisdictions to contribute sensitive observations not available through
the ALA and take into consideration expert feedback on the distribution
of species.
The project will provide a public-facing platform whereby SDM's can be
published. This will be searchable by area, species or contributor. All
outputs will be scientifically robust, repeatable, maintainable, open
and transparent. The increased validity and robustness of models lead to
better informed decisions relating to impacts of development and
conservation of species.
Abstract
How do you successfully engage volunteers in citizen science projects?
In recent years, citizen science has grown considerably in popularity,
resulting in rapid increases in the number of citizen science and
crowdsourcing projects and providing cost-effective means for scientists
to gather more data over broader spatial ranges to tackle research
questions in a wide variety of scientific, conservation, and
environmental fields Bonney et al. 2016, Aceves-Bueno et al. 2017. While
the proliferation of such projects has produced a growing abundance of
citizen scientist-generated data and published research informed by
citizen science methods Follett and Strezov 2015, this also means that
volunteers have a greater number of projects competing for their time.
When faced with an increasingly-crowded landscape, how can you generate
interest in a citizen science or crowdsourcing project and maintain
contributions over the project's lifetime?
The Biodiversity Heritage Library (BHL) supports a variety of citizen
science and crowdsourcing projects, from transcribing field notes to
tagging scientific illustrations with taxonomic names on Flickr and
enhancing data for 19^th^ century periodicals through its
Zooniverse-based Science Gossip project. Through a variety of outreach
strategies including collaborative social media campaigns, partnerships
with citizen science communities, and interactive incentives, BHL has
successfully engaged volunteers with diverse projects to enrich the
library's data and increase discoverability of its collections.
This presentation will discuss outreach strategies for citizen science
projects that BHL has undertaken to further support research initiatives
with our content. In addition, the presentation will share
lessons-learned and offer suggestions that attendees can apply to their
own citizen science engagement efforts.
Abstract
Biodiversity literature and archival collections are not only
indispensable in taxonomic research, they provide crucial information
for understanding of museums' natural history collections. Literature
and archives document collecting events resulting in specimen
collections, contain original descriptions based on those specimens, and
provide a wealth of other contextual information for the study of life
on earth. The Biodiversity Heritage Library is committed to improving
research efficiency by providing open access to a growing body of
biodiversity literature and archives. While descriptive metadata is
widely available for both specimen collections (i.e., DarwinCore) and
literature (i.e., MARCXML), connections between the two collection types
cannot generally be found at these descriptive levels thus hindering
efficient discovery of relevant materials. The integration of name
finding services, powered by Global Names Architecture, provides a
significant value-add through page-level access to mentions of a given
taxon name. Yet how might one search based on a museum code, a common
name, or a place name? This presentation will share how BHL's top
technical priorities for 2018 will help facilitate more efficient
searching and discovery of information in the pages of the BHL corpus.
Specifically, updates on BHL's top two priorities -- implementation of
full text search and incorporation of available crowdsourced
transcriptions---will be covered.
Abstract
The classification of living things depends upon the literature. Access
to this literature is essential to taxonomic research and to our
understanding of biodiversity. There have been tremendous efforts to
digitise the world's biodiversity literature; the Biodiveristy Heritage
Library (BHL) alone has uploaded over 54 million pages, all of which is
freely accessible online. Our scientific literature is far more
accessible than it has ever been, but that does not mean it is easily
discoverable. Much of the taxonomic literature online remains outside
the linked network of scholarly research. But that is rapidly changing.
Taxonomic aggregators are an invaluable source of authoritative
information on species names and their hierarchical classification. It
is critical that this information includes citations for taxonomic
descriptions, that these citations link to the published literature
online and that (wherever possible) the citations include DOIs (Digital
Object Identifiers). The DOI is an essential part of a publication's
bibliographic metadata and should be included (as a live link) in any
reference to that content.
However, the definitive (DOI'd) versions of recent publications are
frequently behind paywalls. And, while much of the historic literature
available online is open access, commercial publishers are uploading
out-of-copyright publications onto their own websites, assigning DOIs to
"their" definitive versions (the versions that must be cited in other
publications, as per DOI requirements) and then locking the defintiive
versions behind paywalls. This is perfectly within their rights. DOIs
may be assigned to legacy publications retrospectively, providing that:
a) the party assigning them owns the rights for the content, or has
permission from the rights holder to assign a DOI, and b) the
publication does not already have a DOI. If there are no rights attached
to a piece of content, anyone can assign a DOI to it.
This means that citation traffic from the bibliographies of current
publications is increasingly directed towards commercial publishers'
websites, rather than towards open access versions, such as those freely
available on the Biodiversity Heritage Library (BHL). However, taxonomic
aggregators are not bound by the same obligations as publishers and may
therefore choose to link to any online version of a publication
(although the DOI should still be included in the citation).
Many taxonomic aggregators link to the literature available on BHL. The
taxonomic name profiles in EOL (Encyclopedia of Life), GBIF (Global
Biodiversity Information Facility) and ALA (Atlas of Living Australia)
each contain a BHL bibliography: a list of links to the pages in BHL
that contain an identified mention of that taxon name. However, the
lists of returned results can be long, and they may or may not include
the citations for accepted names, synonyms and taxon concepts. Some
biodiversity aggregators feature these key citations on the names pages
(or tabs) of taxon profiles. However, where these do exist, they are
usually plain text rather than links.
BHL is now registering DOIs for the content it hosts and is creating
landing pages for articles, containing the full bibliographic metadata,
including (where applicable) the DOI. Articles are now discoverable by
article title, keywords within titles (scientific names, locations,
traits, etc.), author names and DOIs, and can be easily linked to (via
their landing pages) by other parties.
This paper will examine the issues, benefits and complexities associated
with linking to definitive versions, the difference between easy and
open access, the ethics of putting out-of-copyright content behind
paywalls, and the future of creating order amongst the massively
expanding resource of literature online.
Abstract
The Biodiversity Heritage Library (BHL) provides open access to over 54
million pages of biodiversity literature. Much of this literature is
either in the public domain or is licensed for reuse under the Creative
Commons framework. Anyone can therefore freely reuse much of the
information and data provided by BHL. This presentation will outline how
the work of a citizen scientist using BHL content might benefit research
scientists. It will discuss how a citizen scientist can reuse and link
BHL literature and data in Wikipedia and Wikidata. It will explain the
research efficiencies that can be obtained through this reuse and
linking, for example through the consolidation of database identifiers.
The presentation will outline the subsequent reuse of the BHL data added
to Wikipedia and Wikidata by the internet search engine Google. It will
discuss an example of the linking of this information in the citizen
science observation platform iNaturalist. The presentation will explain
how BHL, as a result of its open reuse licensing of information and
data, helps in the creation of more accurate citizen science generated
biodiversity data and assists with the wider and more effective
dissemination of biodiversity information.
Abstract
A program to integrate species diversity information systems was
launched by the Chinese Academy of Sciences (CAS) in January 2018, with
funding from the CAS Earth project, a Strategic Priority Research
Program of CAS. The program will create a series of data products, such
as China flora online, species catalogues, distribution maps, software
tools for data mining and knowledge discovery based on big data and
artificial intelligence technology, and a service platform and portal
highlighting species diversity information in China. The products and
platform will provide the robust data to support decision making on
biodiversity conservation, fundamental research on biodiversity
evolution and spatial patterns, and species identification for citizen
science. China flora online will include 35,000 species of higher plants
in China and an online editing environment for botanists to maintain the
floral records. The trait database will include structured data of
animals, plants and fungi, such as weight, height, length, color and
shape of organisms. This species catalogue will be the annually updated
version of the Catalogue of Life, China. The distribution maps will show
the spatial pattern for each species of vertebrate animal and higher
plant. Cell phone apps will help users to easily and quickly identify
plants in the field. The mechanism and workflow for data collection,
integration, public sharing and quality control will be built up in the
next few years.
Abstract
Due to the recent establishment of the Global Genome Biodiversity
Network (GGBN) data portal, we have extended Specify collections
management software (http://www.sustain.specifysoftware.org/) to more
effectively manage, publish, and integrate tissue and DNA extract data
by adding support for the GGBN data schema. Specify's database design
now includes a number of data fields and tables proscribed in GGBN
standard vocabularies. We also realigned some of the underlying table
relationships to address the needs of specimen curation and collection
transactions for extract and tissue samples. Specify now also supports
"Next Generation" sequencing metadata with fields to record NCBI SRA ID
numbers for web-linking tissue and extract metadata to entries in the
NCBI SRA databases.
With the ongoing evolution of the TDWG Darwin Core (DwC) standard for
specimen data exchange, we generalized Specify 7's data publishing
capabilities to export collections data to any DwC or other
standards-based, exchange schema. This generic, external schema mapping
capability enables Specify collections to design and map data packages
to integrate their data with any community aggregator or collaborative
project database based on Darwin Core or other community standard-based
format. The development of these versatile new integration capabilities
was in collaboration with, and through financial support from GGBN. This
talk will highlight these changes in the context of delivery of museum
tissue and extract data records to the GGBN data portal for aggregation.
Abstract
The Genomic Observatories Metadatabase (GeOMe, http://www.geome-db.org/)
is an open access repository for geographic and ecological metadata
associated with biosamples and genetic data. It contributes to the
informatics stack -- Biocode Commons -- of the Genomic Observatories
Network
(https://gigascience.biomedcentral.com/articles/10.1186/2047-217X-3-2).
The GeOMe project interface enables administrators to plan and execute
field based sample collection efforts. GeOMe projects specify a core set
of sample metadata fields based on community standard vocabularies and
also includes plugins for associating samples with photos, subsamples,
NextGen sequence metadata, and permits. Users can upload their own
expedition-specific metadata, which contributes to the overall project
dataset while providing the user a convenient method for updating and
refining their contributed data. GeOMe provides connection points to the
Global Biodiversity Information Facility and archived genetic data
stored in the National Center for Biotechnology Information\'s (NCBI\'s)
Sequence Read Archive (SRA), linking specimens and seqeuences via unique
persistent identifiers.
Abstract
Genomic research depends upon access to DNA or tissue collected and
preserved according to high-quality standards. At present, the
collections in most natural history museums do not sufficiently address
these standards. In response to these challenges, natural history
museums, culture collections, herbaria, botanical gardens and others
have started to build high-quality biodiversity biobanks. Unfortunately,
information about these collections remains fragmented, scattered and
largely inaccessible. Without a central registry of relevant
institutions, it is difficult and time-consuming to locate the needed
samples.
The Global Genome Biodiversity Network (GGBN) was created to fill this
gap by establishing a central access point for locating samples meeting
quality standards for genome-scale applications, while complying with
national and international legislations and conventions (e.g. the Nagoya
Protocol). The GGBN is rapidly growing and currently has 70 members and
works closely together with GBIF, SPNHC, CETAF, INSDC, BOLD, ESBB,
ISBER, GSC and others to reach its goals.
Knowledge of biodiversity biobank content is urgently needed to enable
concerted efforts and strategies in collecting and sampling new material
and making ABS a reality. GGBN provides an infrastructure for making
genomic samples discoverable and accessible.
While respecting national law, GGBN requires that its members comply
with the provisions of the Nagoya-protocol. Thus researchers,
collection-holding institutions, and networks should adopt a common Best
Practice approach to manage ABS, as has been developed by GGBN. A Code
of Conduct; recommendations for implementing the Code of Conduct (the
Best Practices), and implementation tools, such as standard Material
Transfer Agreements (MTA) and mandatory and recommended data fields in
collection databases, are tools which will aid compliance. This talk
provides an overview of GGBN and comprises updates on GGBN's best
practices on ABS and the Nagoya Protocol, with examples of their use and
applicability.
Abstract
Arctos (https://arctosdb.org), an online collection management
information system, was developed in 1999 to manage museum specimen data
and to make those data publicly available. The portal
(arctos.database.museum) now serves data on over 3.5 million cataloged
specimens from more than 130 collections throughout North America in an
instance at the Texas Advanced Computing Center. Arctos also is a
community of museum professionals that collaborates on museum best
practices and works together to improve Arctos data richness and
functionality for on-line museum data streaming. In 2017, three large
Arctos genomics collections at the Museum of Southwestern Biology (MSB),
Museum of Vertebrate Zoology, Berkeley (MVZ), and University of Alaska
Museum of the North (UAM), received support from GGBN to create a
pipeline for publishing data from Arctos to the GGBN portal.
Modifications to Arctos included standardization of controlled
vocabulary for tissues; changes to the data structure and code tables
with regard to permit information, container history, part attributes,
and sample quality; implementation of interfaces and protocols for
parent-child relationships between tissues, tissue subsamples, and DNA
extracts; and coordination with the DWC community to ensure that all
GGBN data standards and formatting are included in the standard DWC
export in order to finalize the pipeline to GGBN. The addition of these
three primary Arctos biorepositories to the GGBN network will add over
750,000 tissue and DNA records representing over 11,000 species and 667
families. These voucher-based archives represent primarily vertebrate
taxa, with growing collections of arthropods, endoparasites, and
incipient collections of microbiome and environmental samples associated
with online media and linked to GenBank and other external databases.
The high-quality data in Arctos complement and significantly extend
existing GGBN holdings, and the establishment of an Arctos-GGBN pipeline
also will facilitate future collaboration between more Arctos
collections and GGBN.
Abstract
The GGBN Data Standard
(https://terms.tdwg.org/wiki/GGBN\_Data\_Standard) provides a platform
based on a documented agreement to promote the efficient sharing and
usage of genomic sample material and associated specimen information in
a consistent way. It builds upon existing standards commonly used within
the community extending them with the capability to exchange data on
tissue, environmental and DNA samples as well as sequences. The standard
has been recently extended to support environmental DNA and High
Throughput Sequencing (HTS) library samples. Both, eDNA and HTS library
sample use cases have been published in the GGBN Sandbox
(http://sandbox.ggbn.org) and will be presented here. The use case
collection is documented in the GGBN wiki
(http://wiki.ggbn.org/ggbn/Use\_Case\_Collection).
In addition a general overview of the GGBN Data Portal
(http://www.ggbn.org) will be given. Based on ABCD, DwC and the GGBN
Data Standard the GGBN Data Portal is the gateway to standardized access
of DNA, tissue and environmental samples and their associated specimens.
The third core piece of GGBN is the GGBN Document Library
(https://library.ggbn.org), today containing more than 300 documents
about research, management and legal aspects of biodiversity biobanks.
We will provide an overview of covered topics and gaps that the
community can help to fill.
Finally an outlook of goals and priority tasks for the next two years
will be given.
Abstract
The Open Biodiversity Knowledge Management System (OBKMS) is an
end-to-end, eXtensible Markup Language (XML)- and Linked Open Data
(LOD)-based ecosystem of tools and services that encompasses the entire
process of authoring, submission, review, publication, dissemination,
and archiving of biodiversity literature, as well as the text mining of
published biodiversity literature (Fig. 1). These capabilities lead to
the creation of interoperable, computable, and reusable biodiversity
data with provenance linking facts to publications.
OBKMS is the result of a joint endeavour by Plazi and Pensoft lasting
many years. The system was developed with the support of several
biodiversity informatics projects - initially (Virtual Biodiversity
Research and Access Network for Taxonomy) ViBRANT, and then followed by
pro-iBiosphere, European Biodiversity Observation Network (EU BON), and
Biosystematics, informatics and genomics of the big 4 insect groups
(BIG4). The system includes the following key components:
ARPHA Journal Publishing Platform: a journal publishing platform based
on the TaxPub XML extension for National Library of Medicine (NLM)'s
Journal Publishing Document Type Definition (DTD) (Version 3.0). Its
advanced ARPHA-BioDiv component deals with integrated biodiversity data
and narrative publishing (Penev et al. 2017).
GoldenGATE Imagine: an environment for marking up, enhancing, and
extracting text and data from PDF files, supporting the TaxonX XML
schema. It has specific enhancements for articles containing
descriptions of taxa (\"taxonomic treatments\") in the field of
biological systematics, but its core features may be used for general
purposes as well.
Biodiversity Literature repository (BLR): a public repository hosted at
Zenodo (CERN) for published articles (PDF and XML) and images extracted
from articles.
Ocellus/Zenodeo: a search interface for the images stored at BLR.
TreatmentBank: an XML-based repository for taxonomic treatments and data
therein extracted from literature.
The OpenBiodiv knowledge graph: a biodiversity knowledge graph built
according to the Linked Open Data (LOD) principles. Uses the RDF data
model, the SPARQL Protocol and RDF Query Language (SPARQL) query
language, is open to the public, and is powered by the OpenBiodiv-O
ontology (Senderov et al. 2018).
OpenBiodiv portal:
Semantic search and browser for the biodiversity knowledge graph.
Multiple semantic apps packaging specific views of the biodiviersity
knowledge graph.
Supporting tools:
Pensoft Markup Tool (PMT)
ARPHA Writing Tool (AWT)
ReFindit
R libraries for working with RDF and for converting XML to RDF
(ropenbio, RDF4R).
Plazi RDF converter, web services and APIs.
As part of OBKMS, Plazi and Pensoft offer the following services beyond
supplying the software toolkit:
Digitization through imaging and text capture of paper-based or
digitally born (PDF) legacy literature.
XML markup of both legacy and newly published literature (journals and
books).
Data extraction and markup of taxonomic names, literature references,
taxonomic treatments and organism occurrence records.
Export and storage of text, images, and structured data in data
repositories.
Linking and semantic enhancement of text and data, bibliographic
references, taxonomic treatments, illustrations, organism occurrences
and organism traits.
Re-packaging of extracted information into new, user-demanded outputs
via semantic apps at the OpenBiodiv portal.
Re-publishing of legacy literature (e.g., Flora, Fauna, and Mycota
series, important biodiversity monographs, etc.).
Semantic open access publishing (including data publishing) of journal
and books.
Integration of biodiversity information from legacy and newly published
literature into interoperable biodiversity repositories and platforms
(Global Biodiversity Information Facility (GBIF), Encyclopedia of Life
(EOL), Species-ID, Plazi, Wikidata, and others).
In this presentation we make the case for why OpenBiodiv is an essential
tool for advancing biodiversity science. Our argument is that through
OpenBiodiv, biodiversity science makes a step towards the ideals of open
science (Senderov and Penev 2016). Furthermore, by linking data from
various silos, OpenBiodiv allows for the discovery of hidden facts.
A particular example of how OpenBiodiv can advance biodiversity science
is demonstrated by the OpenBiodiv\'s solution to \"taxonomic anarchy\"
(Garnett and Christidis 2017). \"Taxonomic anarchy\" is a term coined by
Garnett and Christidis to denote the instability of taxonomic names as
symbols for taxonomic meaning. They propose an \"authoritarian\"
top-down approach to stablize the naming of species. OpenBiodiv, on the
other hand, relies on taxonomic concepts as integrative units and
therefore integration can occur through alignment of taxonomic concepts
via Region Connection Calculus (RCC-5) (Franz and Peet 2009). The
alignment is \"democratically\" created by the users of system but no
consensus is forced and \"anarchy\" is avoided by using unambiguous
taxonomic concept labels (Franz et al. 2016) in addition to Linnean
names.
Abstract
The temporality of specimens is an often overlooked but quintessential
part of using aggregated biodiversity occurrences for research,
especially when millions of these occurrences exist in deep time.
Presently in Darwin Core, there are terms for describing the geological
context of specimens, which is needed for paleontological specimens.
However, information about the contextual absolute date associated with
a specimen, and how that date was generated is not supported in Darwin
Core, but would strongly enhance usability for research. Providers do
occasionally try provisioning this information, but it is currently
hidden in a few different Darwin Core fields, making it hard to discover
and nearly impossible to search for in biodiversity portals. Here we
provide an overview of where absolute date content for paleontological
and archaeological specimens are currently found in published specimens
records. We will then introduce a working Darwin Core extension that
focuses on chronometric content, and demonstrate the use of this
extension with published datasets from the zooarchaeological and
paleontological communities. This new advancement will allow providers
to make these crucial data available, researchers to easily find the
temporal range associated with an occurrence, evaluate how this range
was determined, and compile occurrences based on their shared ages to
help streamline the research process.
Abstract
Important initiatives, such as the Convention on Biological Diversity\'s
(CBD) Aichi targets, the United Nations\' 2030 Agenda for Sustainable
Development (and its Sustainable Development Goals) highlight the urgent
need to stop the continuous and increasing loss of biodiversity. That
requires an increase in the knowledge that will allow for sustainable
use of natural resources. To accomplish that, detailed studies are
needed to evaluate multiple species and regions. These studies demand
great effort from professionals, searching for species and/or observing
their behavior. In this case, the use of new monitoring devices could be
beneficial in data collection and identification, optimizing the
specialist effort to detect and observe species in-situ. With the
advance of technology platforms for developing connected devices and
sensors, associated with the evolution of the Internet of Things (IoT)
concepts, and the advances of unmanned aerial vehicles (UAVs) and
Wireless sensor networks (WSN), new scenarios in biodiversity studies
are possible. The technology available now could allow studies applying
relatively cheaper sensors with long-range (approx. 15 km), low power,
low bit rate communication and up to 10-year battery life, using a Low
Power Wide Area Network (LPWAN) and with capacity to run bio-acoustic or
image processing detection. Platforms like Raspberry Pi or any other
with signal processing capabilities can be applied (Hodgkinson and Young
2016). Sensor technologies protocols applied in IoT networks are usually
simple and flexible. Common semantics and metadata definitions are
necessary to extract information and representations to construct
complex networks. Some of these metadata definitions can be adopted from
the current Darwin Core schema. However, Darwin Core evolved based on
enterprise technologies (i.e. XML) and relational database definitions,
that usually need machines with significant bandwidth to transmit data.
Today the technology scenario is taking another route, going from
centralized to distributed architectures, occasionally applying
non-relational and distributed databases, ready to deal with
synchronization and eventual consistency problems. These distributed
databases are usually employed to construct complex networks, where
relation restrictions are not mandatory or, sometimes, even desired
(Baggio et al. 2016). With these new techniques becoming a reality in
biodiversity conservation studies, new metadata definitions are
necessary. Those new metadata need to standardize and create a shared
vocabulary that includes requirements for devices information exchange,
data analytics, and model generation. Also, these new definitions could
aggregate the Essential Biodiversity Variables (EBVs) concepts, that aim
to identify the minimum of variables that can be used to inform
scientists, managers and decision makers (Haase et al. 2018). For this
reason, we propose the insertion of EBV definitions in the construction
of sensor integration metadata and models characterization inside the
Darwin Core metadata definitions (Fig. 1).
Abstract
The Specialized Information Service Biodiversity Research (BIOfid;
http://biofid.de/) has recently been launched to mobilize valuable
biodiversity data hidden in German print sources of the past 250 years.
The partners involved in this project started digitisation of the
literature corpus envisaged for the pilot stage and provided novel
applications for natural language processing and visualization. In order
to foster development of new text mining tools, the Senckenberg
Biodiversity Informatics team focuses on the design of ontologies for
taxa and their anatomy. We present our progress for the taxa prioritized
by the target group for the pilot stage, i.e. for vascular plants, moths
and butterflies, as well as birds. With regard to our text corpus a key
aspect of our taxonomic ontologies is the inclusion of German vernacular
names. For this purpose we assembled a taxonomy ontology for vascular
plants by synchronizing taxon lists from the Global Biodiversity
Information Facility (GBIF) and the Integrated Taxonomic Information
System (ITIS) with K.P. Buttler's Florenliste von Deutschland
(http://www.kp-buttler.de/florenliste/). Hierarchical classification of
the taxonomic names and class relationships focus on rank and status
(validity vs. synonymy). All classes are additionally annotated with
details on scientific name, taxonomic authorship, and source. Taxonomic
names for birds are mainly compiled from ITIS and the International
Ornithological Congress (IOC) World Bird List, for moths and butterflies
mainly from GBIF, both lists being classified and annotated accordingly.
We intend to cross-link our taxonomy ontologies with the Environment
Ontology (ENVO) and anatomy ontologies such as the Flora Phenotype
Ontology (FLOPO). For moths and butterflies we started to design the
Lepidoptera Anatomy Ontology (LepAO) on the basis of the already
available Hymenoptera Anatomy Ontology (HAO). LepAO is planned to be
interoperable with other ontologies in the framework of the OBO foundry.
A main modification of HAO is the inclusion of German anatomical terms
from published glossaries that we add as scientific and vernacular
synonyms to make use of already available identifiers (URIs) for
corresponding English terms. International collaboration with the
founders of HAO and teams focusing on other insect orders such as
beetles (ColAO) aims at development of a unified Insect Anatomy
Ontology. With a restriction on terms applicable on all insects the
unified Insect Anatomy Ontology is intended to establish a basis for
accelerating the design of more specific anatomy ontologies for any
particular insect order. The advancement of such ontologies aligns with
current needs to make knowledge accumulated in descriptive studies on
the systematics of organisms accessible to other domains. In the context
of BIOfid our ontologies provide exemplars on how semantic queries of
yet untapped data relevant for biodiversity studies can be achieved for
literature in non-English languages. Furthermore, BIOfid will serve as
an open access platform for professional international journals
facilitating non-commercial publishing of biodiversity and
biodiversity-related data.
Abstract
Field data collection by Citizen Scientists has been hugely assisted by
the rapid development and spread of smart phones as well as apps that
make use of the integrated technologies contained in these devices. We
can improve the quality of the data by increasing utilisation of the
device in-built sensors and improving the software user-interface.
Improvements to data timeliness can be made by integrating directly with
national and international biodiversity repositories, such as the Atlas
of Living Australia (ALA).
I will present two Citizen Science apps that we developed for the
conservation of two of Australia's iconic species -- the koala and the
echidna. First is the Koala Counter app used in the Great Koala Count 2
-- a two-day Blitz-style population census. The aim was to improve both
the recording of citizen science effort as well as to improve the
recording of "absence" data which would improve population modelling.
Our solution was to increase the transparent use of the phone sensors as
well as providing an easy-to-use user interface. Second is the
EchidnaCSI app -- an observational tool for collecting sightings and
samples of echidna.
From a software developer's perspective, I will provide details on
multi-platform app development as well as collaboration and integration
with the Australian national biodiversity repository -- the Atlas of
Living Australia. Preliminary analysis regarding data quality will be
presented along with lessons learned and paths for future research. I
also seek feedback and further ideas on possible enhancements or
modifications that might usefully be made to improve these techniques.
Abstract
Scratchpads are an online Virtual Research Environment (VRE) for
biodiversity scientists, allowing anyone to share their data and create
their own research networks (http://scratchpads.eu/). In operation since
2007, the platform has supported more than 1,000 communities in their
efforts to share, manage and aggregate information on the natural world.
Funded through a series of European Commission and United Kingdom
research council grants, the platform reached a height of popularity in
2014 with more than 14,500 users, but high levels of usage, coupled with
the difficulty of sustaining external funding, led to a significant
decline in the quality of service provision and support available to the
project. Consequently, the Scratchpads service was closed to new
communities in October 2016 and was managed on an essential care and
maintenance basis until new permanent funding became available in
December 2017. Despite these challenges, the Scratchpad system continues
to be used by a loyal community of taxonomists and systematists. As part
of our efforts to stabilise the platform and develop a sustainable
future for its users, we present our findings from an in-depth analysis
of Scratchpad usage metrics and user behaviour. We investigate the
growth of the Scratchpads since their inception; how global taxonomic
concepts have been generated, used and adapted; the geographical and
taxonomic coverage of Scratchpads; the functionality most popular with
users, and those features that failed to gain traction with the
community; and finally how aggregated data was used and modified by
select user communities. Our presentation examines the challenges of
maintaining a complex digital project once funding expires and the
initial project team disperses. We conclude with a summary of the
Scratchpad software development roadmap based on this quantitative
analysis of user behaviour. This is informing the future of the
Scratchpads system and identifies how VREs for the biodiversity data
community might be developed to provide a more integrated and
sustainable solution to the problem of community management for
biodiversity data.
Abstract
The quality of data produced by citizen science (CS) programs has been
called into question by academic scientists, governments, and
corporations. Their doubts arise because they perceive CS groups as
intruding on the rightful opportunities of standard science and industry
organizations, because of a normal skepticism of novel approaches, and
because of a lack of understanding of how CS produces data.
I propose a three-pronged strategy to overcome these objections and
improve trust in CS data.
Develop methods for CS programs to advertise their efforts in data
quality control and quality assurance (QCQA). As a first step the PPSR
core could incorporate a field that would allow programs to point to
webpages that document the QAQC practices of each program. It is my
experience that many programs think carefully about data quality, but
the CS community currently lacks an established protocol to share this
information.
Define and implement best practices for generating biodiversity data
using different methods. Wiggins et al. 2011 published a list of
approaches that can be used for QCQA in CS projects but how these
approaches should be implemented has not been systematically
investigated.
Measure and report data quality. If one takes the point of view that
citizen science is akin to a new category of scientific instruments,
then the ideas of instrument measurement and calibration can be applied
CS. Scientists are well aware that any instrument needs to be calibrated
before its efficacy can be established. However, because CS is new
approach, the specific procedures needed for different kinds of programs
are just now being worked out for the first time.
The strategy outlined above faces some specific challenges. Citizen
science biodiversity programs must address two important problems that
standard scientific entities encounter when sampling and monitoring
biodiversity. The first is correctly identifying species. For citizens
this can be a problem because they often do not have the training and
background of scientist teams. Likewise, it may be difficult for CS
projects to manage updating and maintaining the taxonomies of the
species being investigated. A second set of challenges is the diverse
kinds of biodiversity data collected by CS programs. For instances,
Notes from Nature decodes that labels of museum specimens, Snapshot
Serengeti identifies species of large mammals from camera trap
photographs, iNaturalist collections images of species and then has a
crowdsource identification processs, while eBird collects observations
of birds that are immediately filtered with computer algorithms for
review by the observer and if, subsequently flagged, reviewed by a local
expert. Each of these programs likely requires a different set of best
practices and methods to measure data quality.
Abstract
Pl\@ntNet is an international initiative which was the first one
attempting to combine the force of citizens networks with automated
identification tools based on machine learning technologies (Joly et al.
2014). Launched in 2009 by a consortium involving research institutes in
computer sciences, ecology and agriculture, it was the starting point of
several scientific and technological productions (Goëau et al. 2012)
which finally led to the first release of the Pl\@ntNet app (iOS in
February 2013 (Goëau et al. 2013) and Android (Goëau et al. 2014) the
following year). Initially based on 800 plant species, the app was
progressively enlarged to thousands of species of the European, North
American and tropical regions. Nowadays, the app covers more than 15 000
species and is adapted to 22 regional and thematic contexts, such as the
Andean plant species, the wild salads of southern Europe, the indigenous
trees species of South Africa, the flora of the Indian Ocean Islands,
the New Caledonian Flora, etc. The app is translated in 11 languages and
is being used by more than 3 millions of end-users all over the world,
mostly in Europe and the US.
The analysis of the data collected by Pl\@ntnet users, which represent
more than 24 millions of observations up to now, has a high potential
for different ecological and management questions. A recent work
(Botella et al. 2018), in particular, did show that the stream of
Pl\@ntNet observations could allow a fine-grained and regular monitoring
of some species of interest such as invasive ones. However, this
requires cautious considerations about the contexts in which the
application is used. In this talk, we will synthesize the results of
this study and present another one related to phenology. Indeed, as the
phenological stage of the observed plants is also recorded, these data
offer a rich and unique material for phenological studies at large
geographical or taxonomical scale. We will share preliminary results
obtained on some important pantropical species (such as the Melia
azedarach L., and the Lantana camara L.), for which we have detected
significant intercontinental phenological patterns, among the project
data.
Abstract
Many organisations running citizen science projects don't have access to
or the knowledge or means to develop databases and apps for their
projects. Some are also concerned about long-term data management and
also how to make the data that they collect accessible and impactful in
terms of scientific research, policy and management outcomes. To solve
these issues, the Atlas of Living Australia (ALA) has developed
BioCollect. BioCollect is a sophisticated, yet simple to use tool which
has been built in collaboration with hundreds of real users who are
actively involved in field data capture. It has been developed to
support the needs of scientists, ecologists, citizen scientists and
natural resource managers in the field-collection and management of
biodiversity, ecological and natural resource management (NRM) data.
BioCollect is a cloud-based facility hosted by the ALA and also includes
associated mobile apps for offline data collection in the field.
BioCollect provides form-based structured data collection for:
Ad-hoc survey-based records;
Method-based systematic structured surveys; and
Activity-based projects such as natural resource management intervention
projects (eg. revegetation, site restoration, seed collection, weed and
pest management, etc.).
This session will cover how BioCollect is being used for citizen science
in Australia and some of the features of the tool.
Abstract
eBird is a global citizen science project that gathers observations of
birds. The project has been making a considerable contribution to the
collection and sharing of bird observations, even in the data-poorest
countries, and is accelerating the accumulation of bird records
globally. On 22 March 2018 eBird surpassed ½ billion bird observations.
A primary component of ensuring the best quality data is the network of
more than 1300 volunteer reviewers who scour incoming data for accuracy.
Reviewers provide active feedback to participants on everything from
bird identification to best practices for data collection. Since eBird's
inception in 2002, almost 23 million observations have been reviewed,
requiring more than 190,000 hours of effort by reviewers. In this
presentation we review how eBird recruits expert reviewers, describe
their responsibilities, and offer some insight in new developments to
improve the reviewing process.
How are reviewers recruited. There are three primary methods that used
to identify new reviewers. First, if we don't have any active
participants in a region (e.g., Kamchatka Russia) eBird staff search
birding listserves to find an individual who is reporting a lot of
high-quality observations from the area. We then contact those
individuals and offer them the opportunity to review records for the
region. This option has the lowest likelihood of success. Second, if an
individual is submitting a lot of records to eBird from a region that
needs a reviewer we contact them and request their participation. Third,
in much of the world eBird has partner groups. These partner
organizations (e.g., Taiwan, Spain, India, Portugal, Australia, and all
of the Western Hemisphere) recruit their own reviewers. The third method
is the most effective way to gain expert participation.
What does a reviewer do? eBird reviewers work to improve eBird data in
three primary areas. First, they develop and manage the eBird checklist
filters for a region. These filters generate a checklist of birds for a
particular time and location, and determine what records get flagged for
further review. Second, if an eBird participant tries to report a
species that is not on the checklist, or if the number of individuals of
a species exceeds the filter limit, then these records get flagged for
review. Reviewers contact the observer and request further
documentation. Currently, 57% of all records that are evaluated by
reviewers are validated. Finally, eBird reviewers validate whether the
participant is eBirding correctly. That is, are they correctly filling
out the information on when, where, and how they went birding. It has
been our experience that different types of reviewers are required to
effectively review eBird submissions: those who are good at reviewing
bird records and those who are good at educating observers on how to
participate.
What are future plans? eBird will move towards more effective reviewer
teams, where the volume of observations can be split amongst a number of
individuals with different strengths, allowing identification experts to
focus on observation-level ID issues; and strong communicators to focus
on working with contributors on checklist-level best practices.
Currently, a single eBird review platform handles a broad array of
different reviewing functions. It is our intent to split some of these
functions into multiple platforms. For example, right now all review
happens at the database level of the 'observation': a record of a taxon
at a date and location. Plans are underway to develop tools that will
allow reviewers to work at the entire checklist level (i.e., to more
easily review the accuracy of how all the observations during a
checklist event were submitted), which will enable much more effective
review of checklist-level data quality concerns.
Abstract
Volunteers, researchers and citizen scientists are important
contributors to observation and monitoring databases. Their
contributions thus become part of a global digital data pool, that forms
the basis for important and powerful tools for conservation, research,
education and policy. With the data contributed by citizen scientists
also come concerns about data completeness and quality. For data
generated by citizen scientists taxonomic bias effects, where certain
species (groups) are underrepresented in observations, are even stronger
than for professionally collected data. Identification tools that help
citizen scientists to access more difficult, underrepresented groups,
can help to close this gap.
We are exploring the possibilities of using artificial intelligence for
automatic species identification as a tool to support the registration
of field observations. Our aim is to offer nature enthusiasts the
possibility of automatically identifying species, based on photos they
have taken as part of an observation. Furthermore, by allowing them to
register these identifications as part of the observation, we aim to
enhance the completeness and quality of the observation database. We
will demonstrate the use of automatic species recognition as part of the
process of observation registration, using a recognition model that is
based on deep learning techniques.
We investigated the automatic species recognition using deep learning
models trained with observation data of the popular website
Observation.org (https://observation.org/). At Observation.org data
quality is ensured by a review process of all observations by experts.
Using the pictures and corresponding validated metadata from their
database, models were developed covering several species groups. These
techniques were based on earlier work that culminated in ObsIdentify, an
free offline mobile app for identifying species based on pictures taken
in the field. The models are also made available as an API web service,
which allows for identification by offering a photo through common
HTTP-communication - essentially like uploading it through a webpage.
This web service was implemented in the observation entry workflows of
Observation.org. By providing an automatically generated taxonomic
identification with each image, we expect to stimulate existing citizen
scientists to generate a larger quantity of and more biodiverse
observations. Additionally we hope to motivate new citizen scientists to
start contributing.
Additionally, we investigated the use of image recognition for the
identification of additional species in the photo other than the primary
subject, for example the identification of the host plant in photos of
insects. The Observation.org database contains many of such photos which
are associated with a single species observation, while additional,
other species are also present in the photo, but are unidentified.
Combining object detection to detect individual species with species
recognition models opens up the possibility of automatically identifying
and counting these species, enhancing the quality of the observations.
In the presentation we will present the initial results of this
application of deep learning technology, and discuss the possibilities
and challenges.
Abstract
Specimen labels are written in numerous languages and accurate
interpretation requires local knowledge of place names, vernacular names
and people's names. In many countries more than one language is in
common usage. Belgium, for example, has three official languages.
Crowdsourcing has helped many collections digitize their labels and
generates useful data for science. Furthermore, direct engagement of the
public with a herbarium increases the collection's visibility and
potentially reinforces a sense of common ownership. For these reasons we
built DoeDat, a multilingual crowdsourcing platform forked from Digivol
of the Australian Museum (Figs 1, 2). Some of the useful features we
inherited from Digivol include a georeferencing tool, configurable
templates, simple project management and individual institutional
branding.
Running a multilingual website does increase the work needed to setup
and manage projects, but we hope to gain from the broader engagement we
can attract. Currently, we are focusing our work on Belgian collections
were Dutch and French are the primary languages, but in the future we
may expand our languages when we work on our international collections.
We also hope that we can eventually merge our code with that of Digivol,
so that we can both benefit from each other\'s developments.
Abstract
The implementation of Citizen Science in biodiversity studies has led
the general public to engage in environmental actions and to contribute
to the conservation of natural resources (Chandler et al. 2017).
Smartphones have become part of the daily lives of millions of people,
allowing the general public to collect data and conduct automatic
measurements at a very low cost. Indeed, a series of Citizen Science
mobile applications have allowed citizens to rapidly record specimen
observations and contribute for the development of large biodiversity
databases around the World. Citizen Science applications have a
multitude of purposes, as well as target a variety of taxa, biological
questions and geographical regions.
Brazil is a megadiverse country that includes many threatened species
and Biomes. Conversation efforts are urgent and the engagement of the
civil society is critical. Brazilian dry and wet forests are dominated
by members of the plant family Bignoniaceae, all of which are
characterized by beautiful trumpet-shaped flowers and a big-bang
flowering strategy. Species of the Neotropical Bignoniaceae trees are
popularly known in Brazil as "Ipê" and are broadly cultivated throughout
the country due to the showy flowers and strong wood. Different species
have different flower colors, making its identification relatively easy.
The showy and colorful flowers are extremely admired by the local
population and the media. Flowering of "Ipês" is triggered by dry
climate, lower temperatures and increasing day-light, making this group
an excellent model for phenological and climatic studies involving
Citizen Science.
Here, we developed a multi-platform mobile application focused on the
plant family Bignoniaceae that allows users to contribute phenological
data for species from this plant family. More specifically, through this
application the user is able to provide data about specimen locations,
phenology and date, all of which can be validated by a photograph. This
platform is based on React Native, a hybrid app framework that helps the
developers to reuse the code across multiple mobile platforms, a
development much more efficient and with efforts focused on the user
experience. This technology uses Javascript as programming language and
Facebook React as a basis for development. The system is similar to
other CS apps such as iNaturalist. Namely, the overall observations
improve the quality of the ranking through positive feedback from the
community, strengthening the network of interactions between users and
encouraging active participation. On the other hand, the application
allows users to access all previously stored observations, which, in
turn, can suggest improvements to that particular observation.
Furthermore, observations without a correct ID can be stored until
others can suggest a correct identification, maximizing the value of
individual observations and data gathered.
An important aspect of this mobile application is the participation of a
network of experts on this plant family, allowing a rapid and accurate
verification of individual observations. This team of Bignoniaceae
experts is also able to make full use of the data gathered by
correlating climate and phenological patterns. Results from these
analyses are provided to the citizens gathering the data which will, in
turn, stimulate the collection of new data, especially in poorly sampled
locations. This is a very dynamic mobile application, that aims to
engage the civil society with true scientific research, stimulating the
management of natural resources and conservation efforts. Through this
mobile app, we hope to engage the general public into biodiversity
studies by improving their knowledge on an iconic group of Brazilian
plants, while contributing data for scientific studies. The system is
expected to be released in May and will be available at
ipesdobrasil.org.br.
Abstract
The Online Pollen Catalogs Network (RCPol) (http://rcpol.org.br) was
conceived to promote interaction among researchers and the integration
of data from pollen collections, herbaria and bee collections. In order
to structure RCPol work, researchers and collaborators have organized
information on Palynology in four branches: palynoecology,
paleopalynology, palynotaxonomy and spores. This information is
collaboratively digitized and managed using standardized Google
Spreadsheets. These datasets are assessed by the RCPol palynology
experts and when a dataset is compliant with the RCPol data quality
policy, it is published to http://chaves.rcpol.org.br.
Data quality assessment used to be performed manually by the experts and
was time-consuming and inconsistent in detecting data quality problemas
such as incomplete and inconsistent information. In order to support
data quality assessment in a more automated and effective way, we are
developing a data quality tool which implements a series of mechanisms
to measure, validate and improve completeness, consistency, conformity,
accessibility and uniqueness of data, prior to a manual expert
assessment. The system was designed according to the conceptual
framework proposed by Task Group 1 of the Biodiversity Data Quality
Interest Group Veiga et al. 2017. For each sheet in the Google
Spreadsheet, the system generates a set of assertions of measures,
validations and amendments for the records (rows) and datasets (sheets),
according to a profile defined for RCPol. The profile follows the
policies of data quality measurement, validation and enhancement. The
data quality measurement policy encompassess the dimensions of
completeness, consistency, conformity, accessibility and uniqueness.
RCPol uses a quality assurance approach: only data that are compliant
with all the quality requirements are published in the system.
Therefore, its data quality validation policy only considers datasets
with 100% completeness, consistency, conformity, accessibility and
uniqueness. In order to improve the quality in each relevant dimension,
a set of enhancements was defined in the data quality enhancement
policy. Based on this RCPol profile, the system is able to generate
reports that contain measures, validations and amendments assertions
with the method and tool used to generate the assertion. This web-based
system can be tested at http://chaves.rcpol.org.br/admin/data-quality
with the dataset
https://docs.google.com/spreadsheets/u/1/d/1gH0aa2qqnAgfAixGom3Gnx6Qp
91ZvWhUHPb\_QeoIreQ. This system is able to assure that only data
compliant with the data quality profile defined by RCPol are fit for use
and can be published.
This system contributes significantly to decreasing the workload of the
experts. Some data may still contain values that cannot be easily
automatically assessed, e.g. validate if the content of an image matches
the respective scientific name, so expert manual assessment remains
necessary. After the system reports that data are compliant with the
profile, a manual assessment must be performed by the experts, using the
data quality report as support, and only after that will the data be
published. The next steps include archival of the data quality reports
in a database, improving the web interface to enable searching and
sorting of assertions, and to provide a machine readable interface for
the data quality reports.
Abstract
Task Group 2 of the TDWG Data Quality Interest Group aims to provide a
standard suite of tests and resulting assertions that can assist with
filtering occurrence records for as many applications as possible.
Currently 'data aggregators' such as the Global Biodiversity Information
Facility (GBIF), the Atlas of Living Australia (ALA) and iDigBio run
their own suite of tests over records received and report the results of
these tests (the assertions): there is, however, no standard reporting
mechanisms. We reasoned that the availability of an internationally
agreed set of tests would encourage implementations by the aggregators,
and at the data sources (museums, herbaria and others) so that issues
could be detected and corrected early in the process.
All the tests are limited to Darwin Core terms. The \~95 tests refined
from over 250 in use around the world, were classified into four output
types: validations, notifications, amendments and measures. Validations
test one of more Darwin Core terms, for example, that
dwc:decimalLatitude is in a valid range (i.e. between -90 and +90
inclusive). Notifications report a status that a user of the record
should know about, for example, if there is a user-annotation associated
with the record. Amendments are made to one or more Darwin Core terms
when the information across the record can be improved, for example, if
there is no value for dwc:scientificName, it can be filled in from a
valid dwc:taxonID. Measures report values that may be useful for
assessing the overall quality of a record, for example, the number of
validation tests passed.
Evaluation of the tests was complex and time-consuming, but the
important parameters of each test have been consistently documented.
Each test has a globally unique identifier, a label, an output type, a
resource type, the Darwin Core terms used, a description, a dimension
(from the Framework on Data Quality from TG1), an example, references,
implementations (if any), test-prerequisites and notes. For each test,
generic code is being written that should be easy for institutions to
implement -- be they aggregators or data custodians.
A valuable product of the work of TG2 has been a set of general
principles. One example is "Darwin Core terms are either:
literal verbatim (e.g., dwc:verbatimLocality) and cannot be assumed
capable of validation,
open-ended (e.g., dwc:behavior) and cannot be assumed capable of
validation, or
bounded by an agreed vocabulary or extents, and therefore capable of
validation (e.g., dwc:countryCode)".
Another is "criteria for including tests is that they are informative,
relatively simple to implement, mandatory for amendments and have power
in that they will not likely result in 0% or 100% of all record hits." A
third: "Do not ascribe precision where it is unknown."
GBIF, the ALA and iDigBio have committed to implementing the tests once
they have been finalized. We are confident that many museums and
herbaria will also implement the tests over time. We anticipate that
demonstration code and a test dataset that will validate the code will
be available on project completion.
Abstract
In the process of sharing information, it is of highest importance that
we utilize common codes and signifiers, so that communication is
effective. This process presents a series of complexities that are
related to capturing and transmitting the meaning of the information
despite homonymy, polysemy and synonymy. Biodiversity data sharing is
not exempt from these challenges and understanding the meaning often
requires expert knowledge. For communication to be effective, and
therefore for data to be of maximal re-use, we need common vocabularies
that unequivocally refer us to the same concepts.
The community has agreed upon some vocabularies to structure shared
information, i.e., biodiversity data standards such as the Darwin Core
standard (Wieczorek et al. 2012). The bterms in Darwin Core can be
thought of as the names of the columns in a spreadsheet. For example,
there are terms such as genus, stateProvince, sex, etc. This allows us
to capture and share information which we agree belongs under one of
those terms. However, we have not yet reached an agreement on how to
express the permitted values under all those terms, that is,
vocabularies of values. As a simple example, we agree that if we have a
record of an organism that is a female, we will share the fact that it
is a female under the "sex" term, but we could represent female with the
values "female", "fem.", "f.", and other possible abbreviation and
language variants. Other more complex examples, bound to expert
knowledge, include biological taxonomies and how we name distinct
species and species concepts.
While many vocabularies exist in the community, we currently do not
possess a full suite of vocabularies of values that apply uniformly
across the biodiversity data community and there is no single repository
to explore the available resources. While some of the available
vocabularies are discipline-specific, many that could be applied more
broadly remain independent and scattered. Additionally, similar lists of
terms that refer to the same concepts can be found in different
languages, but disconnected from one another.
The lack of or non-adherence to vocabularies of values constitutes a
data quality issue, as the heterogeneity in the data renders data less
discoverable and difficult to use. Capturing information in myriad ways
risks being incomplete and inaccurate in our transmission of
information. If we cannot be certain that a particular value
unambiguously refers to a particular concept, we cannot assert that a
record containing that value could reliably be used for a particular
purpose. In this context, the construction and use of vocabularies of
values, including the explicit declaration of usage, is a data quality
issue.
From the TDWG Data Quality Interest Group we have begun to tackle this
problem, with the aim of creating a suitable environment for thought and
development of vocabularies of values. Accordingly, a new task group has
been constituted, whose main goals are to:
prepare a scoping document in which we will determine the types of
vocabularies needed (including multi-lingual approaches) and the
strategy for organizing the construction and/or management of
new/existing vocabularies;
develop a common repository to store vocabularies and/or link to
existing ones;
develop best practices for building TDWG vocabularies; and
develop an exemplary vocabulary following the standard format.
This will provide the community with a framework to work on and build
upon vocabularies of values in a way that would allow better
understanding and maximal interoperability.
Abstract
As the world strives towards achieving Sustainable Development Goals,
development planners both at national and local levels have now come to
understand the importance of informed decision-making. Natural resources
management is one of the areas where careful planning is required to
ensure sustainable use of and maximum benefit from the services we get
from ecosystems.
In developing countries, the scarcity of resources (both in terms of
funding and skills) constitutes the main hindrance to the generation of
accurate and timely data and information that would guide planning and
implementation of development strategies. As a result, decisions are
taken on an ad-hoc basis and without possibility of appreciating the
long-term effect of these decisions.
In that regard, Albertine Rift Conservation Society (ARCOS) has
developed a participatory and cost-effective framework to monitor the
status and trends of biodiversity and ecosystem services at the
landscape level and to assess the socio-economic conditions that affect
them.
The approach termed "Integrated Landscape Assessment and Monitoring --
ILAM" uses the Driver-Pressure-State-Impact-Response model and applies a
simple indicators framework that allows teams to collect needed data in
a rapid and cost-effective way. Burkhard and Müller (2008)
This approach is flexible enough to be adaptable to the available time
and funding resources and is therefore very suitable to be applied in
the context of the developing world including east-African countries.
This flexibility ranges from the use GIS and remote sensing techniques
combined with thorough biodiversity field surveys to simple rapid
assessment of key indicators using smaller teams and for short periods
of time in the field.
Since 2013, ARCOS has been biennially conducting ILAM studies in its
five focal landscapes in Rwanda, Uganda and Burundi and the results have
influenced major decisions such as the designation of at least two
wetlands as Ramsar sites and the upgrade of one forest as a national
park.
In addition to this, other planning processes have been informed by the
results of these studies, such as the process to develop the new Rwandan
National Strategy for Transformation for 2017--2024 and the development
of the districts' strategic plans for 2018--2024.
Currently the biodiversity data generated through these studies is being
published by Global Biodiversity Information Facility (GBIF) for wider
access by researchers and educators in the region and a portal, the
ARCOS Biodiversity Information Management System (ARBIMS), has been
established to facilitate sharing of data and information to guide
planning and decision-making in the region.
Abstract
Species-level observational data comprise the largest and
fastest-growing part of the Global Biodiversity Information Facility
(GBIF). The largest single contributor of species observations is eBird,
which so far has contributed more than 361 million records to GBIF.
eBird engages a vast network of human observers (citizen-scientists) to
report bird observations, with the goal of estimating the range,
abundance, habitat preferences, and trends of bird species at high
spatial and temporal resolutions across each species' entire life-cycle.
Since its inception, eBird has focused on improving the data quality of
its observations, primarily focused in two areas:
ensuring that participants describe how they gathered their observations
and,
all observations are reviewed for accuracy.
In this presentation I will review how this is done in eBird.
Standardized Data Collection. eBird gathers bird observations based on
how bird watchers typically observe birds with units of data collection
being "checklists" of zero or more species including a count of
individuals for each species observed. Participants choose the location
where they made their observations and submit their checklists via
Mobile Apps (50% of all submissions) or the website (50% of all
submissions). All checklists are submitted in a standard format
identifying where, how, and with whom they made their observations.
Mobile apps precisely record locations, the track taken, and the
distance they traveled while making the observations. The start time and
duration of surveys are also recorded. All observers must report whether
they reported all the birds they detected and identified, which allows
analysts to infer absence of birds if they were not reported. All data
are stored within an Oracle data management framework.
Data Accuracy. The most significant data quality challenge for species
observations is detecting and correctly identifying organisms to
species. The issue involves how to handle both false positives --- the
misidentification of an observed organism, and false negatives---failing
to report a species that was present. The most egregious false positives
can be identified as anomalies that fall outside the norm of occurrence
for a species at a particular time or space. However, false positives
can also be misidentifications of common species. These challenges are
addressed by:
Data-driven filters. eBird's existing data can identify and flag
potentially erroneous records at increasingly fine spatial, temporal,
and user-specific scales. These filters can identify outliers and likely
errors, which are the foundation of the eBird review process. By using
the vetted data to identify outliers, data quality checks run against
expected occurrence probabilities at very fine scales and identify
anomalies during data submission (including on mobile devices).
Incorporate observer expertise scores. Observer differences are the
largest source of variability in eBird data. Assessment of observer
metrics, and the inclusion of these data in species distribution models,
improves analysis output and model performance.
Expert reviewer network. More than 2000 volunteers review records
identified by the data-driven filters and contact data submitters to
confirm their observations. The existing data quality process functions
globally. Currently the approach is focused on misidentified birds, but
in the future will also involve collection event issues (e.g., issues
with protocol, location, or methodology), sensitive species, exotic
species, and better handle widely-observed individual rarities.
Additional tools are also to be developed to help editors improve
efficiency and better prioritize review.
In 2017, 4,107,757 observations representing 4.6% of all eBird records
submitted were flagged for review by the data driven filters. Of these
records 57.4% were validated and 42.6% were invalidated.
Abstract
From 81 study sites across the United States, the US National Ecological
Observatory Network (NEON), generates \>75,000 samples per year. Samples
range from soil and dust deposition material, tissue samples (e.g.,
small mammals and fish), DNA extracts, and whole organisms (e.g., ground
beetles and ticks). Samples are collected, processed, and documented
according to protocols that are standardized across study sites and
according to the needs of the ecological research community for future
studies. NEON has faced numerous challenges with managing data related
to these many diverse physical samples, particularly when data are
gathered at numerous steps throughout processing. Here, we share these
challenges as well as solutions, including innovative semantically
driven software tools and processing pipelines that manage data from
each sample\'s point of collection to its ultimate fate (consumption,
archive facility, or partnering data repository) while maintaining links
across sample hierarchies.
Abstract
What is a provider (or consumer) of biodiversity data to think when one
quality assessment tool asserts that a particular problem exists in
their data, while a different tool asserts that this problem is not
present? Is there a problem with their data? Is there a problem with one
of the tools? The Biodiversity Data Quality Task Group 2 is developing a
suite of standardized descriptions of tests (validations, measures,
amendments) of biodiversity data, implementations of which would be
expected to provide consistent assertions about a particular data set so
that input of identical data sets into two different test suite
implementations will produce the same results (for some meaning of "the
same").
Development of standard test definitions is a big step in the direction
of consistency. More is needed. Clear and detailed specifications for
each test will help. For example, data might have suitable quality for
global change analysis if collecting dates have a temporal resolution of
one year or less. One implementer\'s test may check if the event date
has a duration of 365 days or less, another might account for leap days,
another might test if the data can be unambiguously binned into single
years. For some data, each implementation will produce different
assertions about the record. If the standard test specification states
which of these meanings apply, then correct implementations should make
identical assertions. To tell, however, if two implementations of a
suite of tests will produce the same result for identical inputs we need
two things, one is a set of tests (of the tests), the other is an
understanding of what it means for results to be the same. It is
expected that there will be changes in the results of tests of
scientific names over time, and that different authorities will have
different opinions about that set of scientific names. One element of
"the same" is an expectation that results will be the same when test
implementations are run at the same time and with the same
configuration, but not necessarily otherwise.
Consider tests at three levels: First, tests of the internals of a test,
separate from the fitness for use framework (Veiga et al. 2017) or
serialization of test results. At this first level, unit tests are very
appropriate, but these are tightly coupled to the language of
implementation and the unit testing framework, and to the internal
details of the implementation. Unit tests are very effective for
software quality control, but not particularly portable. Second,
consider tests of the output of a suite of tests. At this level (of
integration tests), we are tightly coupled to both the fitness for use
framework and the serialization, and the meaning of "the same" is
important. Different software implementations may be expected to have
different orders of output for the same input, and human readable
comments would be expected to vary (e.g. with internationalization).
Identity of machine readable assertions but in varying orders should be
tolerable, but this is not easily accomplished. Implementation at this
level is difficult. Third, consider tests of the framework output of a
particular test. Order becomes unimportant, only machine readable
framework assertions can be considered, and this is probably the level
to target for testing. Input data for tests could be synthetic, real, or
modified real data. Real data has the advantage of being realistic, but
it is difficult to find real data which contains single issues. Clean
real data into which synthetic error conditions have been introduced is
enticing for test purposes, but risks confusion with real data, so I
propose some standard values for certain Darwin Core terms for
identifying synthetic data.
Abstract
The ability to communicate and assess the quality and fitness for use of
data is crucial to ensure maximum utility and re-use. Data consumers
have certain requirements for the data they seek and need to be able to
check if a data set conforms with these requirements. Data publishers
aim to provide data with the highest possible quality and need to be
able to identify potential errors that can be addressed with the
available information at hand. The development and adoption of data
publication guidelines is one approach to define and meet those
requirements. However, the use of a guideline, the mapping decisions,
and the requirements a dataset is expected to meet, are generally not
communicated with the provided data. Moreover, these guidelines are
typically intended for humans only.
In this talk, we will present \'whip\': a proposed syntax for data
specifications. With whip, one can define column-based constraints for
tabular (tidy) data using a number of rules, e.g. how data is structured
following Darwin Core, how a term uses controlled vocabulary values, or
what the expected minimum and maximum values are. These rules are human-
and machine-readable, which communicates the specifications, and allows
to automatically validate those in pipelines for data publication and
quality assessment, such as Kurator. Whip can be formatted as a (yaml)
text file that can be provided with the published data, communicating
the specifications a dataset is expected to meet. The scope of these
specifications can be specific to a dataset, but can also be used to
express expected data quality and fitness for use of a publisher,
consumer or community, allowing bottom-up and top-down adoption. As
such, these specifications are complementary to the core set of data
quality tests as currently under development by the TDWG Biodiversity
Data Quality Task 2 Group 2. Whip rules are currently generic, but more
specific ones can be defined to address requirements for biodiversity
information.
Abstract
Georeferencing helps to fill in biodiversity information gaps, allowing
biodiversity data to be represented spatially to allow for valuable
assessments to be conducted. The South African National Biodiversity
Institute has embarked on a number of projects that have required the
georeferencing of biodiversity data to assist in assessments for
redlisting of species and measuring the protection levels of species.
Data quality in biodiversity information is an important aspect. Due to
a lack of standardisation in collection and recording methods historical
biodiversity data collections provide a challenge when it comes to
ascertaining fitness for use or determining the quality of data. The
quality of historical locality information recorded in biodiversity data
collections faces the scrutiny of fitness for use as these information
is critical in performing assessments. The lack of descriptive locality
information, or ambiguous locality information deems most historical
biodiversity records unfit for use. Georeferencing should essentially
improve the quality of biodiversity data, but how do you measure the
fitness for use of georeferenced data?
Through the use of the Darwin Core coordinateUncertaintyinMeters,
georeferenced data can be queried to investigate and determine the
quality of the georeferenced data produced. My presentation will cover
the scope of ascertaining georeferenced data quality through the use of
the DarwinCore term coordinateUncertatintyInMeters, the impacts of using
a controlled vocabulary in representing the
coordinateUncertaintyInMeters, and will highlight how SANBI's
georeferencing efforts have contributed to data quality within the
management of biodiversity information.
Abstract
As part of the Biodiversity Information System on Nature and Landscapes
(SINP), the French National Natural History Museum has been appointed to
develop biodiversity data exchanges by the French ministry in charge of
ecology. Given there are, quite literally, thousands of different
sources, such a development brings into question the underlying quality
of data. To add complexity, there can be several layers of quality: one
being appraised by the producer himself, one by a regional node, and one
by the national node.
The approach to quality issues was addressed by a dedicated working
group, representative of biodiversity stakeholders in France. The
resulting documents focus on core methodology elements that characterize
a data quality process for taxon occurrences only in the first instance
(It may be extended to habitats, geology, etc. in the near future).
Three processes are covered, how to ensure:
data conformity by checking for the presence of compulsory elements or
that a given attribute is of the right type,
data consistency by checking information versus other information (for
example, an end date has to be later than a start date),
and scientific validation, through either manual (use of expertise) or
automated (comparison with knowledge databases) means, or even a
combined approach that provides users with a quality appraisal of said
data.
Within the SINP, only data that has passed conformity and consistency
tests can be exchanged with any and all types of validation levels. For
example, should there be no expert existing on a specific taxon group,
unvalidated data can be shared.
For scientific validation, two processes are used, one automatic that
uses several criteria such as comparison with a national taxonomic
reference database (TAXREF), and with species reference maps. The
combination of all these elements can be used to automatically flag data
for a second, deeper, manual process that allows for further scrutiny in
order to reach a conclusive evaluation. This allows experts to work only
on "doubtful" data, thus saving time.
In the future, other criteria that are currenlty used with the manual
approach, such as for example congruity, data scarcity on a given
species, determination difficulty, existence of associated proof
(specimen, picture...), knowledge of the ability of the observer,
databases on most frequent determination errors etc., could be added to
the automatic process.
Some elements must be included in the data to allow for comprehensive
testing, and have been included in a national data standard so that the
result of the validation process can be shared with users, allowing them
to judge how the data is fit for their use.
The presentation will deal with how such a work was undertaken and how
conformity, consistency and scientific validation have been treated and
issues solved by the workgroup. For example, there could be a 40 million
data record backlog. The presentation will also show how the required
elements could be integrated into the French national standard.
Abstract
The success of Darwin Core and ABCD Schema as flexible standards for
sharing specimen data and species occurrence records has enabled GBIF to
aggregate around one billion data records. At the same time, other
thematic, national or regional aggregators have developed a wide range
of other data indexes and portals, many of which enrich the data by
interpreting and normalising elements not currently handled by GBIF or
by linking other data from geospatial layers, trait databases, etc.
Unfortunately, although each of these aggregators has specific strengths
and supports particular audiences, this diversification produces many
weaknesses and deficiencies for data publishers and for data users,
including: incomplete and inconsistent inclusion of relevant datasets;
proliferation of record identifiers; inconsistent and bespoke workflows
to interpret and standardise data; absence of any shared basis for
linked open data and annotations; divergent data formats and APIs; lack
of clarity around provenance and impact; etc.
The time is ripe for the global community to review these processes.
From a technical standpoint, it would be feasible to develop a shared,
integrated pipeline which harvested, validated and normalised all
relevant biodiversity data records on behalf of all stakeholders. Such a
system could build on TDWG expertise to standardise data checks and all
stages in data transformation. It could incorporate a modular structure
that allowed thematic, national or regional networks to generate
additional data elements appropriate to the needs of their users, but
for all of these elements to remain part of a single record with a
single identifier, facilitating a much more rigorous approach to linked
open data. Most of the other issues we currently face around
fitness-for-use, predictability and repeatability, transparency and
provenance could be supported much more readily under such a model.
The key challenges that would need to be overcome would be around social
factors, particularly to deliver a flexible and appropriate governance
model and to allow research networks, national agencies, etc. to embed
modular components within a shared workflow. Given the urgent need to
improve data management to support Essential Biodiversity Variables and
to deliver an effective global virtual natural history collection, we
should review these challenges and seek to establish a data management
and aggregation architecture that will support us for the coming
decades.
Abstract
Digitized natural history data are enabling a broad range of innovative
studies of biodiversity. Large-scale data aggregators such as Global
Biodiversity Information facility (GBIF) and Integrated Digitized
Biocollections (iDigBio) provide easy, global access to millions of
specimen records contributed by thousands of collections. A developing
community of eager users of specimen data -- whether locality, image,
trait, etc. -- is perhaps unaware of the effort and resources required
to curate specimens, digitize information, capture images, mobilize
records, serve the data, and maintain the infrastructure (human and
cyber) to support all of these activities. Tracking of specimen
information throughout the research process is needed to provide
appropriate attribution to the institutions and staff that have supplied
and served the records. Such tracking may also allow for annotation and
comment on particular records or collections by the global community.
Detailed data tracking is also required for open, reproducible science.
Despite growing recognition of the value and need for thorough data
tracking, both technical and sociological challenges continue to impede
progress. In this talk, I will present a brief vision of how application
of a DOI to each iteration of a data set in a typical research project
could provide attribution to the provider, opportunity for comment and
annotation of records, and the foundation for reproducible science based
on natural history specimen records. Sociological change -- such as
journal requirements for data deposition of all iterations of a data set
-- can be accomplished using community meetings and workshops, along
with editorial efforts, as were applied to DNA sequence data two decades
ago.
Abstract
DiSSCo (The Distributed System of Scientific Collections) is a Research
Infrastructure (RI) aiming at providing unified physical
(transnational), remote (loans) and virtual (digital) access to the
approximately 1.5 billion biological and geological specimens in
collections across Europe. DiSSCo represents the largest ever formal
agreement between natural science museums (114 organisations across 21
European countries). With political and financial support across 14
European governments and a robust governance model DiSSCo will deliver,
by 2025, a series of innovative end-user discovery, access,
interpretation and analysis services for natural science collections
data.
As part of DiSSCo\'s developing data model, we evaluate the application
of Digital Objects (DOs), which can act as the centrepiece of its
architecture. DOs have bit-sequences representing some content, are
identified by globally unique persistent identifiers (PIDs) and are
associated with different types of metadata. The PIDs can be used to
refer to different types of information such as locations, checksums,
types and other metadata to enable immediate operations. In the world of
natural science collections, currently fragmented data classes (inter
alia genes, traits, occurrences) that have derived from the study of
physical specimens, can be re-united as parts in a virtual container
(i.e., as components of a Digital Object). These typed DOs, when
combined with software agents that scan the data offered by
repositories, can act as complete digital surrogates of the physical
specimens.
In this paper we:
investigate the architectural and technological applicability of DOs for
large scale data RIs for bio- and geo-diversity,
identify benefits and challenges of a DO approach for the DiSSCo RI and
describe key specifications (incl. metadata profiles) for a
specimen-based new DO type.
Abstract
Collections, aggregators, data re-packagers, publishers, researchers,
and external user groups form a complex web of data connections and
pipelines. This forms the natural history infrastructure essential for
collections use by an ever increasing and diverse external user
community. We have made great strides in developing the individual
actors within this system and we are now well poised to utilize these
capabilities to address big picture questions. We need to continue work
on the individual aspects, but the focus now needs to be on integration
of the functionality provided by the actors involved in the pipeline to
facilitate the transfer of data between them with as few human
interventions as possible. In order for the system to function
efficiently and to the benefit of all parties, information, data, and
resources need not only to be integrated efficiently but flow in the
reverse direction (attribution) to facilitate collections advocacy and
sustainability. There are unrealized benefits to collections from
inclusion into aggregators and subsequent use by researchers and
publishers. A recent National Science Foundation (NSF) funded Research
Coordination Network (RCN) Biodiversity Collections Network (BCoN) needs
assessment workshop identified a possible solution to the integration
and attribution of collections data and specimen information using a
suite of unique, persistent identifiers for specimen records
(Universally Unique Identifiers or UUIDs), datasets (Digital Object
Identifiers or DOIs) and institutions/collections (Cool Uniform Resource
Identifiers or Cool URIs). This talk will highlight this potential
workflow and the work needed to achieve this solution while soliciting
participation from actors in the pipeline and the community at large.
Abstract
Increasing the number of occurrence records available for biodiversity
research requires developing efficient pipelines from collectors and
observers to data aggregators and then marketing those pipelines to
biodiversity researchers. To be effective, these pipelines must
recognize that in many countries, internet access is slow, intermittent,
or expensive; cell phone internet access may be more common but many
people cannot afford the costs associated with using a cell phone for
databasing. The pipelines must also make it easy for users to provide
high quality data that conforms to international biodiversity data
standards. Marketing of these pipelines should include building
understanding of these standards and enable data providers to benefit
almost immediately from their contributions. Symbiota has succeeded in
making over 32 million specimen records available but most come from the
United States, a country with fast and reliable internet access in most
regions. We have established two Symbiota-based websites, OpenHerbarium
and OpenZooMuseum, to enable collectors and collections in Old World
countries that lack a national network, to become contributors to and
participants in the global biodiversity data sharing community. Talking
with biodiversity researchers in such countries has clarified the many
impediments to data sharing faced by their collectors and collections.
In this presentation, we shall describe the steps we have taken, and are
proposing to take, to improve the pipeline for collectors and
collections in countries with poor internet access.
Abstract
VertNet (vertnet.org) is a collaborative project that makes biodiversity
data free and available on the web. VertNet is also a tool designed to
help people discover, improve, and publish biodiversity data. It is also
the core of a collaboration between hundreds of biocollections that
contribute biodiversity data and work together to improve it. VertNet
has its genesis in the late 1990s and the very beginnings of vertebrate
collections data sharing, and is nearing its 20th birthday. The small
team that coordinates VertNet efforts long recognized the value of
archival versions of VertNet data separate from individual published
Darwin Core Archives. Here we describe why we produce what we call
"snapshots" of the VertNet index. To understand the snapshots, it is
important to also know how the VertNet indexing process works, which
includes efforts at better flagging record types and special content of
particular value to data consumers. We provide a brief explanation of
the process we developed for creating these snapshots, focusing on how
to assure their citation and licensing, and how to decide the scope of
different snapshots. We also discuss the collaborative process of
deciding infrastructure for archiving those snapshots, and our thinking
about timing of new snapshots. In particular, we cover the use of Google
BigQuery to produce snapshots and CyVerse as infrastructure for archival
storage.
Abstract
The South African Institute for Aquatic Biodiversity (SAIAB) operates
several research platforms, which may be used by the broader South
African research community (e.g. a marine research vessel and a remotely
operated underwater vehicle). SAIAB's Enterprise-grade data centre,
along with expertise in systems administration and biodiversity
information management, allow the institute to offer a Biodiversity
Information Management Platform.
Data hosted by SAIAB is replicated across three data centres, with each
centre being at least 250m apart and operating independently.
Infrastructure at two data centres replicates in real time, forming a
high availability cluster. The third datacentre is dedicated to storing
backups. High-capacity tape backup will be added in the near future. As
an additional measure, cloud storage is used to store daily extracts of
Specify databases, which are retained for one year.
In the first instance, the Platform aims to provide SAIAB researchers
and associates with biodiversity data curation services. This begins
with support for the SAIAB Collections Division, to ensure that voucher
specimens, tissue samples and associated media are accurately catalogued
and can be easily retrieved. Biodiversity data curation is broader than
this. It also means that any biodiversity data/metadata (records of
species, events, occurrences/observations and traits) can potentially be
curated using Specify Software, and standardised and published (subject
to relevant policies) to the GBIF Data Portal using the GBIF Integrated
Publishing Toolkit. The use of Specify Software to curate biodiveristy
data that do not represent voucher specimens (e.g. underwater images and
video) is a new research project within SAIAB, which has the potential
to be extended beyond SAIAB.
A new national initiative, the Natural Science Collections Facility
(NSCF), was launched in 2017 to reinvigorate natural science museums
across the country, to halt deterioration of specimens and improve
capacity for specimen and data curation.
In support of the NSCF, the SAIAB platform is offered to natural science
museums in South Africa (excluding herbaria, which are all part of or
affiliated with SANBI, and therefore accommodated by a different
system). Each museum will be provided with a webserver, Specify 7
database, Specify web portal and IPT server.
In offering this platform to the broader South African Biodiversity
Science community, SAIAB is primarily motivated by the potential for
collaborative research in capacity development for biodiversity data
curation / information management, using Specify Software. The first
research project will examine participating museums' capacity to use the
Specify Workbench sustainably, to import new voucher/occurrence records
generated by fieldwork. The requisite training to enhance this potential
will be provided.
The Natural Science Collections Facility (NSCF) is an important
collaborator in the context of enhancing the general state of South
Africa's specimen collections, and the Specify Collections Consortium is
an important collaborator, specifically for support.
Abstract
Managing digital data for long-term archival and disaster recovery is a
key component of our collective responsibility in managing digital data
and metadata. As more and more data are collected digitally and as the
metadata for traditional museum collections becomes both digitized and
more comprehensive, the need to ensure that these data are safe and
accessible in the long term becomes essential. Unfortunately, disasters
do occur and many irreplaceable datasets on biodiversity have been
permanently lost. Maintaining a long-term archive and putting in place
reliable disaster recovery processes can be prohibitively expensive,
both in the cost of hardware and software as well as the costs of
personnel to manage and maintain an archival system. Traditionally,
storing digital data for the long term and ensuring the data are
loss-less, safe and completely recoverable when a disaster occurs has
been managed on-premises with a combination of on-site and off-site
storage. This requires complex data workflows to ensure that all data
are securely and redundantly stored in multiple highly dispersed
locations to minimize the threat of data loss due to local or regional
disasters. Files are often moved multiple times across operating systems
and media types on their way to and from a deep archive, increasing the
risk of file integrity issues. With the recent advent of an array of
Cloud Services from organizations such as Amazon, Microsoft and Google
to more focused offerings from Iron Mountain, Atempo and others, we have
a number of options for long term archival of digital data. Deep archive
solutions, storage where retrieval expected only in the case of a
disaster, are offered by many of these organizations at a rate
substantially less than their normal data storage fees.
The most basic requirement for an archival system is storing multiple
replicates of the data in geographically isolated locations with a
mechanism for guaranteeing file integrity, usually using a checksum
algorithm. Additional components that are integral to a robust archive
include a simple metadata search and reliable retrieval.
In this presentation, we'll discuss the need for long term archive and
disaster recovery capabilities, detail the current best practices of
data archival systems and review a variety of archival options that have
become available with Cloud Services.
Abstract
The Cornell Lab of Ornithology gathers, utilizes and archives a wide
variety of digital assets ranging from details of a bird observation to
photos, video and sound recordings. Some of these datasets are fairly
small, while others are hundreds of terabytes. In this presentation we
will describe how the Lab archives these datasets to ensure the data are
both loss-less and recoverable in the case of a widespread disaster, how
the archival strategy has evolved over the years and explore in detail
the current hybrid cloud storage management system.
The Lab runs eBird and several other citizen science programs focused on
birds where individuals from around the globe enter their sightings into
a centralized database. The eBird project alone stores over 500,000,000
observations and the underlying database is over a terabyte in size.
Birds of North America, Neotropical Birds and All About Birds are online
species accounts comprising a wide range of authoritative live history
articles maintained in a relatively small database. Macaulay Library is
the world's largest image, sound and video archive with over 6,000,000
cuts totaling nearly 100 TB of data. The Bioacoustics Research Program
utilizes automated recording units (SWIFTs) in the forests of the US,
jungles of Africa and in all seven oceans to record the environment.
These units record 24 hours a day and gather a tremendous about of raw
data, over 200 TB to date with an expected rate of an additional 100TB
per year. Lastly, BirdCams run by the lab add a steady stream of media
detailing the reproductive cycles of a number of species. The lab is
committed to making these archives of the natural world available for
research and conservation today. More importantly, ensuring these data
exist and are accessible in 100 years is a critical component of the Lab
data strategy.
The data management system for these digital assets has been completely
overhauled to handle the rapidly increasing volume and to utilize
on-premises systems and cloud services in a hybrid cloud storage system
to ensure data are archived in a manner that is redundant, loss-less and
insulated from disasters yet still accessible for research. With
multimedia being the largest and most rapidly growing block of data,
cost rapidly becomes a constraining factor of archiving these data in
redundant, geographically isolated facilities. Datasets with a smaller
footprint, eBIrd and species accounts allow for a wider variety of
solutions as cost is less of a factor. Using different methods to take
advantage of differing technologies and balancing cost vs recovery
speed, the Lab has implemented several strategies based on data
stability (eBird data are constantly changing), retrieval frequency
required for research and overall size of the dataset. We utilize Amazon
S3 and Glacier as our media archive, we tag each media in Glacier with a
set of basic DarwinCore metatdata fields that key back to a master
metadata database and numerous project specific databases. Because these
metadata databases are much smaller in size, yet critical in searching
and retrieval of a required media file, they are archived differently
with up to the minute replication to prevent any data loss due to an
unexpected disaster. The media files are tagged with a standard set of
basic metadata and in the case where the metadata databases were
unavailable, retrieval of specific media and basic metadata can still
occur.
This system has allowed the lab to place into long term archive hundreds
of terabytes of data, store them in redundant, geographically isolated
locations and provide for complete disaster recovery of the data and
metadata.
Abstract
Validation using schemas and tools like the Darwin Core Archive
Validator from GBIF are mainly seen as methods of checking data quality
and fitness for use, but are also important for long-term preservation.
We may like to think that our present (meta)data standards and formats
are made for eternity, but in reality we know that standards evolve,
formats change (some even become obsolete with time), and so do our
needs for storage, searching and future dissemination for re-use. So we
might eventually come to a point where transformation of our archival
records and migration to other formats will be necessary. This could
also mean that even if the AIPs, the Archival Information Packages stay
the same in storage, the DIPs, the Dissemination Information Packages
that we want to extract from the archive are subject to change of
format. Further, in order for archival information packages to be
self-sustainable as required in the OAIS model, it is important to take
interdependencies between individual files in the information packages
into account, already by the time of ingest and validation of the SIPs,
the Submission Information Packages, and along the line at different
points of necessary transformation / migration (from SIP to AIP, from
AIP to DIP etc.) to counter obsolecense. Validation schemas and
transformation code should also be archived together with the AIPs. By
ensuring compliance with standards these tools are essential in
controlling uniformity of records in a collection, for future needs of
transformation and migration to new, sustainable formats. An example is
given of the problems encountered in transforming only a small,
relatively well defined collection of about 1000 archival items but with
substantial variations between them, due to a lack of effective input
constraints and validation at ingest.
A further assessment is made of validation errors encountered in some
Darwin Core Archives comprising thousands of records from some hundred
published datasets, and how these errors might affect a future potential
transformation / migration effort. Migration efforts must necessarily be
general in scope, while errors in datasets from non-compliance with
standards risk being reinforced or aggravated in the transformation
process, making the information contained in the resulting records more
difficult to interpret. The conclusion is that efforts should be made,
e.g. by means of embedded validation measures into upload forms and
other methods of information transfer (e.g. ftp, oai-pmh) to ensure as
close compliance as possible to standards, already at the time of
ingest.
Abstract
Biodiversity Information Serving our Nation - BISON (bison.usgs.gov) is
the U.S. node to the Global Biodiversity Information Facility
(gbif.org), containing more than 375 million documented locations for
all species in the U.S. It is hosted by the United States Geological
Survey (USGS) and includes a web site and application programming
interface for apps and other websites to use for free. With this massive
database one can see not only the 15 million records for nearly 10
thousand non-native species in the U.S. and its territories, but also
their relationship to all of the other species in the country as well as
their full national range. Leveraging this huge resource and its
enterprise level cyberinfrastructure, USGS BISON staff have created a
value-added feature by labeling non-native species records, even where
contributing datasets have not provided such labels.
Based on our ongoing four-year compilation of non-native species
scientific names from the literature, specific examples will be shared
about the ambiguity and evolution of terms that have been discovered, as
they relate to invasiveness, impact, dispersal, and management. The idea
of incorporating these terms into an invasive species extension to
Darwin Core has been discussed by Biodiversity Information Standards
(TDWG) working group participants since at least 2005. One roadblock to
the implementation of this standard\'s extension has been the diverse
terminology used to describe the characteristics of biological
invasions, terminology which has evolved significantly over the past
decade.
Abstract
Reducing the damage caused by invasive species requires a community
approach informed by rapidly mobilized data. Even if local stakeholders
work together, invasive species do not respect borders, and national,
continental and global policies are required. Yet, in general, data on
invasive species are slow to be mobilized, often of insufficient quality
for their intended application and distributed among many stakeholders
and their organizations, including scientists, land managers, and
citizen scientists. The Belgian situation is typical. We struggle with
the fragmentation of data sources and restrictions to data mobility.
Nevertheless, there is a common view that the issue of invasive alien
species needs to be addressed. In 2017 we launched the Tracking Invasive
Alien Species (TrIAS) project, which envisages a future where alien
species data are rapidly mobilized, the spread of exotic species is
regularly monitored, and potential impacts and risks are rapidly
evaluated in support of policy decisions (Vanderhoeven et al. 2017).
TrIAS is building a seamless, data-driven workflow, from raw data to
policy support documentation. TrIAS brings together 21 different
stakeholder organizations that covering all organisms in the
terrestrial, freshwater and marine environments. These organizations
also include those involved in citizen science, research and wildlife
management.
TrIAS is an Open Science project and all the software, data and
documentation are being shared openly (Groom et al. 2018). This means
that the workflow can be reused as a whole or in part, either after the
project or in different countries. We hope to prove that rapid data
workflows are not only an indispensable tool in the control of invasive
species, but also for integrating and motivating the citizens and
organizations involved.
Abstract
The Global Register of Introduced and Invasive Species (GRIIS) presents
annotated country checklists of introduced and invasive species.
Annotations include higher taxonomy of the species, synonyms,
environment/system in which the species occurs, and its biological
status in that country. Invasiveness is classified by evidenced impact
in that country. Draft country checklists are subjected to a process of
validation and verification by networks of country experts. Challenges
encountered across the world include confusion with alien/invasive
species terminology, classification of the 'invasive' status of an alien
species and issues with taxonomic synonyms.
Abstract
North America's Great Lakes contain 21% of the planet's fresh water, and
their protection is a matter of national security to both the USA &
Canada. One of the greatest threats to the health of this unparalleled
natural resource is invasion by non-indigenous species, several of which
already have had catastrophic impacts on property values, the fisheries,
shipping, and tourism industries, and continue to threaten the survival
of native species and wetland ecosystems.
The Great Lakes Invasives Network is a consortium (20 institutions) of
herbaria and zoology museums from among the Great Lakes states of
Minnesota, Wisconsin, Illinois, Indiana, Michigan, Ohio, and New York
created to better document the occurrence of selected non-indigenous
species and their congeners in space and time by imaging and providing
online access to the information on the specimens of the critical
organisms. The list of non-indigenous species (1 alga, 42 vascular
plants, 22 fish, and 13 mollusks) to be digitized was generated by
conducting a query of all fish, plants, algae, and mollusks present in
the database of GLANSIS -- the Great Lakes Aquatic Nonindigenous Species
Information System -- maintained by the National Oceanic and Atmospheric
Administration (NOAA). The network consists of collections at 20
institutions, including 4 of the 10 largest herbaria in North America,
each of which curates 1-7 million specimens (NY, F, MICH, and WIS).
Eight of the nation's largest zoology museums are also represented,
several of which (e.g., Ohio State and U of Minnesota) are
internationally recognized for their fish and mollusk collections.
Each genus includes at least one species that is considered a Great
Lakes non-indigenous taxon -- several have many, whereas others have
congeners on "watchlists", meaning that they have not arrived in the
Great Lakes Basin yet, but have the potential to do so, especially in
light of human activity and climate change. Because the introduction and
spread of these species, their close relatives, and hybrids into the
region is known to have occurred almost entirely from areas in North
America outside of the Basin, our effort will include non-indigenous
specimens collected from throughout North America.
Digitized specimens of Great Lakes non-indigenous species and their
congeners will allow for more accurate identification of invasive
species and hybrids from their non-invasive relatives by a wider
audience of end users. The metadata derived from digitized specimens of
Great Lakes non-indigenous species and their congeners will help
biologists to track, monitor, and predict the spread of invasive species
through space and time, especially in the face of a more rapidly
changing climate in the upper Midwest. All together consortium members
will digitize \>2 million individual specimens from \>860,000
sheets/lots of non-indigenous species and their congeneric taxa. Data
and metadata are uploaded to the Great Lakes Invasives Network, a
Symbiota portal (GreatLakesInvasvies.org), and ingested by the National
Resource for Advancing Digitization of Biodiversity Collections (ADBC)
(iDigBio.org) national resource.
Several initiatives are already in place to alert citizens to the
dangers of spreading aquatic invasive species among our nation\'s
waterways, but this project is developing complementary scientific and
educational tools for scientists, students, wildlife officers, teachers,
and the public who have had little access to images or data derived
directly from preserved specimens of invasive species collected over the
past three centuries.
Abstract
Agriculture and Agri-Food Canada (AAFC) is home to numerous specimen and
environmental collections generating highly relational data sets that
are analyzed using molecular methods (Sanger and NGS). The need to have
a system to properly manage these data sets and to capture accurate,
standardized metadata over entire laboratory workflows has been a
long-term strategic vision of the Biodiversity group at AAFC. Without
robust tracking, many difficulties arise when trying to publish or
submit data to external repositories. To even know what work has been
carried out on individual collection records over a researchers career
becomes a demanding task if the information is retrievable at all. SeqDB
was built to resolve these issues by centralizing, standardizing and
improving the availability and data quality of source specimen
collection data that is being studied using molecular methods. SeqDB
also facilitates integration with tools and external repositories in
order to take the burden off researchers and technicians having to
create adequate systems to track and mobilize their data sets, allowing
them to focus on research and collection management.
The development of SeqDB aligns with agile development methodologies and
attempts to fulfill rapidly emerging needs from genetics and genomics
research, which can evolve and fade quickly at times or be without clear
requirements. The success of SeqDB as an application supporting DNA
sequencing workflows has put it in the same space as other monolithic
architectures before it. As the feature set to support the application
continues to increase, the number of software developers vs operations
and maintenance staff is difficult to rebalance in our organisation. In
an effort to manage the scope for the project and ensure we are able to
continue to deliver on our mandate, the sequence tracking workflows of
the application will become part of the DINA ecosystem ("DIgital
information system for NAtural history data", https://dina-project.net).
Other functions of SeqDB such as collections management and taxonomy
tree curation, will be replaced with the DINA modules implementing these
functions.
In order to allow SeqDB to become a module of DINA, it has been decided
to refactor the application to base it on a Service Oriented
Architecture. By doing so, all molecular data of SeqDB will be exposed
as JSON API Web Services (JavaScript object notation application
programming interface) allowing other modules, user interfaces and the
current SeqDB application to communicate in a standardised way. The new
architecture will also bring an important technology upgrade for SeqDB
where the front end will eventually become a project in itself.
Abstract
As the biodiversity community increasingly adopts Semantic Web (SW)
standards to represent taxonomic registers, trait banks or museum
collections, some questions come up relentlessly: How to model the data?
For what goals? Can the same model fulfill different goals?
So far, the community has mostly considered the SW standards through
their most salient manifestation: the Web of Linked Data (Heath and
Bizer 2011). Indeed, the 5-star Linked Data principles are geared
towards the building of a large, distributed knowledge graph that may
successfully fulfill biodiversity's need for interoperability and data
integration. However, the SW addresses a much broader set of problems
involving automatic reasoning. For instance, reasoners can exploit
ontological knowledge to improve query answering, leverage class
definitions to infer class subsumption relationships, or classify
individuals i.e. compute instance relationships between individuals and
classes by applying reasoning techniques on class definitions and
instance descriptions (Shearer et al. 2008).
Whether a \"thing\" should be modelled as a class or a class instance
has been debated at length in the SW community, and the answer is often
a matter of perspective. In the context of taxonomic registers for
example, the NCBI Organismal Classification (Federhen 2012) and
Vertebrate Taxonomy Ontology (Midford et al. 2013) represent taxa as
classes in the Ontology Web Language (OWL). By contrast, other
initiatives represent taxa as instances of various classes, e.g. the
SKOS Concept class (skos:Concept) in the AGROVOC thesaurus (Caracciolo
et al. 2013) (we speak of the instances as SKOS concepts), the Darwin
Core taxon class (dwc:Taxon) in Encyclopedia of Life (Parr et al. 2016),
or classes depicting taxonomic ranks in GeoSpecies, DBpedia and the BBC
Wildlife Ontology. Such modelling discrepancies impede linking congruent
taxa throughout taxonomic registers. Indeed, one can state the
equivalence between two classes (with owl:equivalentClass) or two class
instances (with owl:sameAs, skos:exactMatch, etc.), but good practices
discourage the alignment of classes with class instances (Baader et al.
2003).
Recently, Darwin Core\'s popularity has fostered the modeling of taxa as
instances of class dwc:Taxon (Senderov et al. 2018, Parr et al. 2016).
In this context, pragmatism may incline a Linked Data provider to comply
with this majority trend to ensure maximum interlinking. Although
technically and conceptually valid, this choice entails certain
drawbacks. First, considering a taxon only as a an instance misses the
fact that it is a set of biological individuals with common
characteristics. An OWL class exactly captures this semantics through
the set of necessary and sufficient conditions that an individual must
meet to be a class member. In turn, an OWL reasoner can leverage this
knowledge to perform query answering, compute subsumption or instance
relationships. By contrast, taxa depicted by class instances are not
defined but described by stating their properties. Hence the second
drawback: unless we develop bespoke reasoners, there is not much a
standard OWL reasoner can deduce from instances.
Yet, some works have demonstrated the effectiveness of logic
representation and reasoning capabilities, e.g. computing the alignments
of two primate classifications (Franz et al. 2016) using generic
reasoners that nevertheless require proprietary input formats. OWL
reasoners are typically designed to solve such classification problems.
They may leverage taxonomic ontologies to compute alignments with other
ontologies or apply reasoning to individuals\' properties to infer their
species. Hence, pragmatically following the instance-based approach may
indeed maximize interlinking in the short term, but bears the risk of
denying ourselves potentially desirable use cases in the longer term. We
believe that developing class-based ontologies for biodiversity should
help leverage the SW's extensive theoretical and practical works to
tackle a variety of use cases that so far have been addressed with
bespoke solutions.
Abstract
The DINA Consortium ("DIgital information system for NAtural history
data", https://dina-project.net,Fig. 1 was formed in order to provide a
framework for like-minded large natural history collection-holding
institutions to collaborate through a distributed Open Source
development model to produce a flexible and sustainable collection
management system. Target collections include zoological, botanical,
mycological, geological and paleontological collections, living
collections, biodiversity inventories, observation records, and
molecular data.
The DINA system is architected as a loosely-coupled set of several
web-based modules. The conceptual basis for this modular ecosystem is a
compilation of comprehensive guidelines for Web application programming
interfaces (APIs) to guarantee the interoperability of its components.
Thus, all DINA components can be modified or even replaced by other
components without crashing the rest of the system as long as they are
DINA compliant. Furthermore, the modularity enables the institutions to
host only the components they need. DINA focuses on an Open Source
software philosophy and on community-driven open development, so the
contributors share their development resources and expertise outside of
their own institutions.
One of the overarching reasons to develop a new collection management
system is the need to better model complex relationships between
collection objects (typically specimens) involving their derivatives,
preparations and storage. We will discuss enhancements made in the DINA
data model to better represent these relationships and the influence it
has on the management of these objects, and on the sharing of
information. Technical detail of various components of the DINA system
will be shown in other talks in this symposium followed by a discussion
session.
Abstract
The DINA Symposium ("DIgital information system for NAtural history
data", https://dina-project.net) ends with a plenary session involving
the audience to discuss the interplay of collection management and
software tools. The discussion will touch different areas and issues
such as:
\(1) Collection management using modern technology:
How should and could collections be managed using current technology --
What is the ultimate objective of using a new collection management
system?
How should traditional management processes be changed?
\(2) Development and community
Why are there so many collection management systems?
Why is it so difficult to create one system that fits everyone's
requirements?
How could the community of developers and collection staff be built
around DINA project in the future?
\(3) Features and tools
How to identify needs that are common to all collections?
What are the new tools and technologies that could facilitate collection
management?
How could those tools be implemented as DINA compliant services?
\(4) Data
What data must be captured about collections and specimens?
What criteria need to be applied in order to distinguish essential and
"nice-to-have" information?
How should established data standards (e.g. Darwin Core & ABCD (Access
to Biological Collection Data)) be used to share data from rich and
diverse data models?
In addition to the plenary discussion around these questions, we will
agree on a streamlined format for continuing the discussion in order to
write a white paper on these questions. The results and outcome of the
session will constitute the basis of the paper and will be subsequently
refined.
Abstract
In order to ensure long-term commitment to the DINA project ("DIgital
information system for NAtural history data", https://dina-project.net),
it is essential to continuously deliver features of high value to the
user community. This is also what agile software development methods try
to achieve by emphasizing early delivery, rapid response to changes and
close collaboration with users (see for example the Manifesto for Agile
Software Development at http://agilemanifesto.org). We will give a brief
overview on how current development of the DINA collection management
system core is guided by agile principles. The mammal collection at the
Swedish Museum of Natural History will be used as an example.
Developing a cross-disciplinary collection management system is a
complex task that poses many challenges: Which features should we focus
on? What kinds of data should the system ultimately support? How can the
system be flexible but still easy to use? Since we cannot do everything
at once, we work towards a minimum viable product (MVP) that contains
just enough features at a time to bring value for selected target users.
In the mammal collection case, the MVP is the simplest product that is
able to replace the functions of the current system used for managing
the collection. As we begin to work with other collections, new MVPs are
defined and used to guide further development. Thus, the set of features
available will increase with each MVP, benefiting both new and current
users.
Another big challenge is migration of legacy data, which is labor
intensive and involves standardizing data that are not compatible with
the new system. To address these issues, we aim to build a flexible data
model that allows less structured data to coexist with more complex,
highly structured data. Migration should thus not require extensive data
standardization, transformation and cleaning. The plan is to instead
offer tools for transforming and cleaning the data after they have been
imported. With the data in place, it will be easier for the user to
provide feedback and suggest new features.
Abstract
The DINA system ("DIgital information system for NAtural history data",
https://dina-project.net) consists of several web-based services that
fulfill specific tasks. Most of the existing services are covering
single core features in the collection management system and can be used
either as integrated components in the DINA environment, or as
stand-alone services.
In this presentation single services will be highlighted as they
represent technically interesting approaches and practical solutions for
daily challenges in collection management, data curation and migration
workflows. The focus will be on the following topics: (1) a generic
reporting and label printing service, (2) practical decisions on
taxonomic references in collection data and (3) the generic management
and referencing of related research data and metadata:
Reporting as presented in this context is defined as an extraction and
subsequent compilation of information from the collection management
system rather than just summarizing statistics. With this quite broad
understanding of the term the DINA Reports & Labels Service (Museum für
Naturkunde Berlin 2018) can assist in several different collection
workflows such as generating labels, barcodes, specimen lists, vouchers,
paper loan forms etc. As it is based on customizable HTML templates, it
can be even used for creating customized web forms for any kind of
interaction (e.g. annotations).
Many collection management systems try to cope with taxonomic issues,
because in practice taxonomy is used not only for determinations, but
also for organizing the collections and categorizing storage units (e.g.
"Coleoptera hall"). Addressing taxonomic challenges in a collection
management system can slow down development and add complexity for the
users. The DINA system uncouples these issues in a simple taxonomic
service for the sole assignment of names to specimens, for example
determinations. This draws a clear line between collection management
and taxonomic research, of which the latter can be supported in a
separate service.
While the digitization of collection data and workflows proceeds,
linking related data is essential for data management and enrichment. In
many institutions research data is disconnected from the collection
specimen data because the type and structure cannot be easily included
in the collection management databases. With the DINA Generic Data
Module (Museum für Naturkunde Berlin 2017) a service exists that allows
for attaching any relational data structures to the DINA system. It can
also be used as a standalone service that accommodates structured data
within a DINA compliant interface for data management.
Abstract
The large efforts to document and map aboveground biodiversity have
helped to elucidate ecological and evolutionary mechanisms and
processes, predict responses to global change, and identify potential
management options in response to those changes. Yet these concepts have
mostly been applied to aboveground plant and animal communities, while
microbial diversity remains difficult to incorporate. The ability to
integrate microbial sequence data into an accessible global
infrastructure has previously been limited by a few key factors: First,
most of microbial diversity remains undescribed and unknown; there is
just an enormous amount of biodiversity. Second, there is a lack of
congruence between the many disparate microbial datasets (e.g. taxonomy,
phylogeny, and methodological biases), which limits the ability to
monitor and quantify global patterns of the terrestrial microbiome.
Finally, there is a lack of coordination and networking between
scientists studying microbes. In this presentation I will discuss two
case studies that highlight how we can begin to link microbial data to
the already well-established macro-knowledge and other environmental
databases (like global carbon maps)
Study 1 -- a megameta analysis: The emergence of high-throughput DNA
sequencing methods provides unprecedented opportunities to further
unravel microbial ecology and its worldwide role from human health to
ecosystem functioning. However, in spite of the abundance of sequencing
studies, combining data from multiple individual studies to address
macroecological questions of bacterial diversity remains methodically
challenging and plagued with biases. While previous meta-analysis
efforts have focused on diversity measures or abundances of major taxa,
in a recent study^(1)^ we show that disparate amplicon sequence data can
be combined at the taxonomy-based level to assess bacterial community
structure. Using a machine learning approach, we found that rarer taxa
are more important for structuring soil communities than abundant taxa.
We concluded that combining data from independent studies can be used to
explore novel patterns in bacterial communities, identify potential
'indicator' taxa with an important role in structuring communities, and
propose new hypotheses on the factors that shape microbial biogeography
previously overlooked.
Study 2 -- a global soil biodiversity database: Greater access to
microbial data is an important next step for biodiversity research and
conservation, and for understanding the ecology and evolution of
microbial communities. In collaboration with the Global Soil
Biodiversity Initiative and the German Biodiversity Synthesis Centre
(sDIV) we outlined steps that must be taken to ensure microbial sequence
data can be included in global measures and maps of biodiversity^(2)^.
Here I will discuss how the plant associated microbiome is an optimal
starting point to synthesize microbial sequence data on an open and
global platform. The plant-microbiome is an optimal model system that
goes across scales and time, can act as a bridge between microorganisms
and macroorganisms, and as an opportunity to more thoroughly explore the
synthesis of global microbial sequence data (for a global soil
biodiversity database). Beyond expanding primary research, the patterns
discovered in a synthesis of plant-microbiome can be used to explore and
guide ecosystem restoration and sustainability. Overall, a better
understanding of microbial biodiversity will help to predict
consequences of (human-induced) global changes and facilitate
conservation and adaptation responses.
\(1) Ramirez, K.S., C.G. Knight et al. and F.T. de Vries (2017).
Detecting macroecological patterns in bacterial communities across
independent studies of global soils. Nature Microbiology.
\(2) Ramirez, K.S., M. Döring, N. Eisenhauer, C. Gardi, J. Ladau, J.W.
Leff, G. Lentendu, Z. Lindo, M.C. Rillig, D. Russell, S. Scheu, M.G. St.
John, F.T. de Vries, T. Wubet, W.H. van der Putten, D.H. Wall, (2015).
Towards a global platform for linking soil biodiversity data. Frontiers
in Ecology and Evolutionary Biology 3(91). doi: 10.3389/fevo.2015.00091
Abstract
Traditionally, taxonomic characterisation of organisms has relied on
their morphology; however, molecular methods are increasingly used to
monitor and assess biodiversity and ecosystem health. Approaches such as
DNA amplicon diversity assessments are a particularly useful tool when
morphology-based taxonomy is difficult or taxa are morphologically
ambiguous, for example for freshwater bacteria and fungi as well as many
freshwater invertebrate species. DNA metabarcoding provides the ability
to distinguish cryptic taxa (which can differ markedly in their
ecological requirements and tolerances) and in addition it can provide
valuable insights into the genetic and ecological diversity of taxa and
ecosystems. While DNA metabarcoding has been used mostly on tissue of
sampled specimens, recent years have seen an increased use of
metabarcoding on environmental DNA samples: DNA extracted not from
sampled specimens, but from the surrounding soil or water. However, the
ability of metabarcoding of specimens and metabarcoding of environmental
DNA (eDNA) to assess biodiversity and the impact of anthropogenic
stressors on freshwater ecosystems is largely understudied. In this
talk, several studies that document the advantages and still open
challenges of (e)DNA metabarcoding for assessing impacts of
environmental stressors on aquatic ecosystems will be presented. These
studies, performed in Europe and New Zealand, integrate impacts across
different biotic groups, i.e. look at stressor effects on bacterial,
protist, fungal and macroinvertebrate communities. Specifically, we use
various case studies from freshwater ecosystems to address the following
questions:
whether eDNA samples, which can be relatively quickly obtained from the
water, can act as reliable proxies for catchment-level stressor impacts
by comparing these to DNA obtained from local bulk samples, and
whether DNA metabarcoding data can also provide quantitative information
rather than only presence-absence data.
In view of the case studies presented, a perspective on the urgent next
steps that need to be taken in order to include genetic tools in routine
biomonitoring will be derived and linked to the vision of the
international network DNAqua-Net.
Abstract
Adventitious roots in canopy soils associated with silver beech
(Lophozonia menziesii (Hook.f.) Heenan & Smissen (Nothofagaceae)) form
ectomycorrhizal associations. We used amplicon sequencing of the
internal transcribed spacer 2 region to compare diversity of
ectomycorrhizal fungal species in canopy and terrestrial sites. The
study data are archived as an NCBI BioProject (accession PRJNA421209),
with the raw DNA sequence reads available from the NCBI Sequence Read
Archive SRA637723 Community composition of canopy ectomycorrhizal fungi
was significantly different to the terrestrial community composition,
with several abundant ectomycorrhizal species significantly more
represented in the terrestrial soil than the canopy soil. Additionally,
we found evidence that an introduced ectomycorrhizal species was present
in these native forest soils. We identified OTUs in two ways: (i) by
manually curated BLAST searching of the NCBI nr database, and (ii) by
comparison with Species Hypotheses on UNITE v.7.2. We desired to make
species identifications where we could be reasonably confident they were
robust, but had to avoid making identifications when an incorrect name
could have implications for biosecurity or our understanding of
biodiversity and biogeography. We found some UNITE Species Hypotheses
included sequences of more than one taxon, which we were able to
separate and distinguish by phylogenetic analysis. Consequently we
exercised caution in reporting names based on the Species Hypotheses.
Using data from this case study, we will illustrate the achievements and
challenges faced in identifying species of ectomycorrhizal fungi from
DNA barcodes. Most DNA sequences of ectomycorrhizal fungi matched
closely New Zealand voucher specimens stored in either the New Zealand
Fungal Herbarium (PDD) or the Otago Regional Herbarium (OTA), which
facilitated the validation of identifications. In the case of PDD
specimens, collection and DNA data were linked via the Systematics
Collections Data database (https://scd.landcareresearch.co.nz). We are
working towards a similar database for OTA specimens, using the Specify
6 database platform.
Abstract
Several national and international environmental laws require countries
to meet clearly defined targets with respect to the ecological status of
aquatic ecosystems. In Europe, the EU-Water Framework Directive (WFD;
2000/60/EC) represents such a detailed piece of legislation. The WFD
that requires the European member countries to achieve an at least
'good' ecological status of all surface waters at latest by the year
2027. In order to assess the ecological status of a given water body
under the WFD, data on its aquatic biodiversity are obtained and
compared to reference status. The mismatch between these two metrics
then is used to derive the respective ecological status class. While the
workflow to carry out the assessment is well established, it relies only
on few biological groups (typically fish, macroinvertebrates and a few
algal taxa such as diatoms), is time consuming and remains at a lower
taxonomic resolution, so that the identifications can be done routinely
by non-experts with an acceptable learning curve. Here, novel genetic
and genomic tools provide new solutions to speed up the process and
allow to include a much greater proportion of biodiversity in the
assessment process. further, results are easily comparable through the
genetic 'barcodes' used to identify organisms.
The aim of the large international COST Action DNAqua-Net
(http://dnaqua.net/) is to develop strategies on how to include novel
genetic tools in bioassessment of aquatic ecosystems in Europe and
beyond and how to standardize these among the participating countries.
It is the ambition of the network to have these new genetic tools
accepted in future legal frameworks such as the EU-Water Framework
Directive (WFD; 2000/60/EC) and the Marine Strategy Framework Directive
(2008/56/EC). However, a prerequisite is that various aspects that start
from the validation and completion of DNA Barcode reference databases,
to the lab and field protocols, to the analysis processes as well as the
subsequently derived biotic indices and metrics are dealt with and
commonly agreed upon. Furthermore, many pragmatic questions such as
adequate short and long-term storage of samples or specimens for further
processing or to serve as an accessible reference need also be
addressed. In Europe the conformity and backward compatibility of the
new methods with the existing legislation and workflows are further of
high importance. Without rigorous harmonization and inter-calibration
concepts, the implementation of the powerful new genetic tools will be
substantially delayed in real-world legal framework applications.
After a short introduction on the structure and vision of DNAqua-Net, we
discuss how the DNAqua-Net community considers possibilities to include
novel DNA-based approaches into current bioassessment and how formal
standardization e.g. through the framework of CEN (The European
Committee for Standardization) may aid in that process (Hering et al.
2018, Leese et al. 2016, Leese et al. 2018. Further we explore how TDWG
data standards can further facilitate swift adoption of the genetic
methods in routine use. We further present potential impacts of the
legislative requirements of the Nagoya Protocol on the exchange of
genetic resources and their implications for biomonitoring. Last but not
least, we will touch upon the rather unexpected influence that the new
General Data Protection Regulation (GDPR) may have on the bioassessment
work in practice.
Abstract
Although they are hyperdiverse and intensively studied, parasites
present major challenges when it comes to phylogenetics, taxonomy, and
biodiversity informatics. The collection of any parasitic organism
entails the linking of at least two specimens - the parasite and the
host. If the parasite has a complex life cycle, then this becomes
further complicated by requiring the linking of three or more hosts,
such as the parasite, its intermediate host (vector) and its definitive
host(s). Parasites are sometimes collected as byproduct of another
collection event and are not studied immediately - which has the
potential to disconnect them further in terms of information content and
continuity- and the converse if also common - parasites can be collected
by parasitologists, who do not necessarily take host vouchers or
incorporate host taxonomy, let alone other metadata for these events.
Using the specific example of the malaria parasites (Order Haemosporida)
I will present examples of the specific challenges that have accompanied
the study of these parasites including issues of delimiting species,
phylogenetic study, including genetic oddities that are unique to these
organisms, and taxonomic quandries that we now find ourselves in, along
with other problems with maintaining continuity of information in a
group that is both diverse biologically and important medically.
Abstract
Madagascar is one of the world's hottest biodiversity hotspots and a
natural laboratory for evolutionary research. Tenrecs (Tenrecidae; 32
currently recognized species) -- small placental mammals endemic to
Madagascar -- colonized the island \>35 million years ago and have
evolved a stunning range of behaviors and morphologies, including
heterothermic species; species with hedgehog-like spines; and fossorial,
aquatic, and scansorial ecotypes. In 2016, we produced the first
taxonomically complete phylogeny of tenrecs, which has served as a
framework for studying morphological evolution, phylogeography, and
species limits. Most recently, we have built on this phylogeny to
incorporate an enormous database of genetic, morphometric, and
geographic data from \>800 vouchered tenrec specimens. These data have
revealed interesting and unexpected aspects of their evolutionary
history, including decoupled diversification of the cranium and
postcranium. Using a machine learning approach, we have also uncovered
numerous new, cryptic species in the family Tenrecidae. As phylogenetic
and phenotypic data become more readily available through online
repositories, we expect that the same approaches can be applied to other
taxonomic groups, providing unprecented resolution of the tree of life.
# Loop over all abstracts and write the output to abstracts.txt
for abstract in $(cat TDWG_abstracts.txt); do
# Strip off just abstract number
anum=$(echo $abstract | sed 's/\/article\///g;s/\/download\/xml\///g;')
# Download XML representation of Abstracts
wget "https://biss.pensoft.net${i}" -O $anum.xml ; done
# Extract just Abstract text from XML using XPATH
xmllint --xpath "/article/front/article-meta/abstract" $anum.xml | pandoc --from html --to markdown >> abstracts.txt
done
/article/27339/download/xml/
/article/26369/download/xml/
/article/25437/download/xml/
/article/26922/download/xml/
/article/26860/download/xml/
/article/26516/download/xml/
/article/26323/download/xml/
/article/26304/download/xml/
/article/26262/download/xml/
/article/26235/download/xml/
/article/26177/download/xml/
/article/26080/download/xml/
/article/26075/download/xml/
/article/25842/download/xml/
/article/25738/download/xml/
/article/25661/download/xml/
/article/25577/download/xml/
/article/25223/download/xml/
/article/26168/download/xml/
/article/27244/download/xml/
/article/26490/download/xml/
/article/26367/download/xml/
/article/26286/download/xml/
/article/26104/download/xml/
/article/26102/download/xml/
/article/25960/download/xml/
/article/25864/download/xml/
/article/25828/download/xml/
/article/25890/download/xml/
/article/25885/download/xml/
/article/25724/download/xml/
/article/25723/download/xml/
/article/25881/download/xml/
/article/25836/download/xml/
/article/25876/download/xml/
/article/25564/download/xml/
/article/25560/download/xml/
/article/25535/download/xml/
/article/25481/download/xml/
/article/26122/download/xml/
/article/25852/download/xml/
/article/26731/download/xml/
/article/25869/download/xml/
/article/25693/download/xml/
/article/25658/download/xml/
/article/25165/download/xml/
/article/25641/download/xml/
/article/25586/download/xml/
/article/25700/download/xml/
/article/25298/download/xml/
/article/26749/download/xml/
/article/25651/download/xml/
/article/25289/download/xml/
/article/25525/download/xml/
/article/25282/download/xml/
/article/25748/download/xml/
/article/25694/download/xml/
/article/25653/download/xml/
/article/25585/download/xml/
/article/26665/download/xml/
/article/25838/download/xml/
/article/25450/download/xml/
/article/25439/download/xml/
/article/25394/download/xml/
/article/25268/download/xml/
/article/25148/download/xml/
/article/25582/download/xml/
/article/25657/download/xml/
/article/25608/download/xml/
/article/25438/download/xml/
/article/25395/download/xml/
/article/25351/download/xml/
/article/25324/download/xml/
/article/25317/download/xml/
/article/25310/download/xml/
/article/25176/download/xml/
/article/26808/download/xml/
/article/26060/download/xml/
/article/25474/download/xml/
/article/25456/download/xml/
/article/26836/download/xml/
/article/25840/download/xml/
/article/25812/download/xml/
/article/25811/download/xml/
/article/25805/download/xml/
/article/25642/download/xml/
/article/24749/download/xml/
/article/25306/download/xml/
/article/24930/download/xml/
/article/25647/download/xml/
/article/25646/download/xml/
/article/25635/download/xml/
/article/25580/download/xml/
/article/25579/download/xml/
/article/26009/download/xml/
/article/25983/download/xml/
/article/25982/download/xml/
/article/25953/download/xml/
/article/25604/download/xml/
/article/25936/download/xml/
/article/25776/download/xml/
/article/25739/download/xml/
/article/25727/download/xml/
/article/25698/download/xml/
/article/25589/download/xml/
/article/25614/download/xml/
/article/25478/download/xml/
/article/25409/download/xml/
/article/25345/download/xml/
/article/25343/download/xml/
/article/26514/download/xml/
/article/25969/download/xml/
/article/25415/download/xml/
/article/25410/download/xml/
/article/25990/download/xml/
/article/25488/download/xml/
/article/25487/download/xml/
/article/25486/download/xml/
/article/25121/download/xml/
/article/24991/download/xml/
/article/27087/download/xml/
/article/26658/download/xml/
/article/26615/download/xml/
/article/26471/download/xml/
/article/25728/download/xml/
/article/25914/download/xml/
/article/25664/download/xml/
/article/26561/download/xml/
/article/25699/download/xml/
/article/27251/download/xml/
/article/25762/download/xml/
/article/25833/download/xml/
/article/25749/download/xml/
/article/25637/download/xml/
/article/25261/download/xml/
/article/25260/download/xml/
/article/29123/download/xml/
/article/28479/download/xml/
/article/28364/download/xml/
/article/28131/download/xml/
/article/28158/download/xml/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment