Last active
October 18, 2018 11:56
-
-
Save damianooldoni/e669e34e81785aae5bb68949ec0374bb to your computer and use it in GitHub Desktop.
Check whether number of occurrences for some species is the same for 6 different retrieval channels: GBIF website interface, GBIF csv download, GBIF DwCA download, csv as retrieved by `rgbif`, DwcA as retrieved by `rgbif`. There is also a final check taking into account a R-based way to trigger a query (`rgbif::occ_download()`).
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
--- | |
title: "Compare CSV and DwC files from GBIF internetsite, GBIF manual downloads and `rgbif` authomatic downloads" | |
author: | |
- Damiano Oldoni | |
date: "October 16, 2018" | |
output: | |
html_document: | |
toc: true | |
toc_depth: 3 | |
toc_float: true | |
number_sections: true | |
--- | |
# Setup | |
```{r setup, include=FALSE} | |
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE) | |
``` | |
## Load libraries | |
Load libraries: | |
```{r load_libraries} | |
# Tidyverse packages | |
library(dplyr) | |
library(purrr) | |
library(readr) | |
# GBIF related packages | |
library(rgbif) | |
``` | |
# Define species under study | |
We are interested in the following alien species: | |
species | kingdom | GBIF backbone key | |
--- | --- | --- | |
Baccharis halimifolia | Plantae | 3129663 | |
Impatiens glandulifera | Plantae | 2891770 | |
Impatiens capensis | Plantae | 2891774 | |
Hydrocotyle ranunculoides | Plantae | 7978544 | |
Branta canadensis | Animalia | 5232437 | |
Harmonia axyridis | Animalia | 4989904 | |
# Number of occurrences: GBIF website interface | |
GBIF backbone key | species | n_online | |
--- | --- | --- | |
2891770 | Impatiens glandulifera | 7612 | |
2891774 | Impatiens capensis | 111 | |
3129663 | Baccharis halimifolia | 193 | |
4989904 | Harmonia axyridis | 18592 | |
7978544 | Hydrocotyle ranunculoides | 1889 | |
Moreover, there is no difference from online queries where taxonKey or speciesKey are given. For example: | |
- https://www.gbif.org/occurrence/search?country=BE&species_key=2891770 | |
- https://www.gbif.org/occurrence/search?country=BE&taxon_key=2891770 | |
return same number of occurrences. | |
# Download data from GBIF as csv files | |
Species with related GBIF download keys: | |
```{r species_under_study_csv} | |
alien_species <- data.frame( | |
species = c("Baccharis halimifolia", | |
"Impatiens glandulifera", | |
"Impatiens capensis", | |
"Hydrocotyle ranunculoides", | |
"Harmonia axyridis"), | |
taxonKey = c(3129663, | |
2891770, | |
2891774, | |
7978544, | |
# 5232437, uncomment if problem with db "0003489-181003121212138" is solved | |
4989904), | |
gbif_download_key = c("0006601-181003121212138", # Baccharis halimifolia | |
"0006598-181003121212138", # Impatiens glandulifera | |
"0006594-181003121212138", # Impatiens capensis | |
"0006579-181003121212138", # Hydrocotyle ranunculoides | |
"0006589-181003121212138"), # Harmonia axyridis | |
stringsAsFactors = FALSE | |
) %>% | |
as_tibble() | |
``` | |
Get csv files if not already downloaded: | |
```{r} | |
csv_files <- map_chr(alien_species$gbif_download_key, function(x) { | |
paste0("../data/interim/", x, ".csv") | |
}) | |
map2(csv_files, alien_species$gbif_download_key, function(x, y) { | |
if (file.exists(x)) { | |
paste0("Text file ",x," already exists.") | |
} else { | |
occ <- occ_download_get(key = y, | |
overwrite = T, | |
path = "../data/interim/") | |
fn <- paste0(y, ".csv") | |
unzip(zipfile = occ, files = fn, | |
exdir = "../data/interim") | |
} | |
}) | |
``` | |
Import csv to data frame in R | |
```{r create_occ_df_from_csv} | |
occ_df <- map_df(csv_files, function(x) { | |
read_delim( file = x, delim = "\t", | |
escape_double = FALSE, trim_ws = TRUE, | |
col_types = cols(recordNumber = col_character(), | |
catalogNumber = col_character())) | |
}) | |
``` | |
Number of occurrences per species: | |
```{r} | |
count_occ_df <- occ_df %>% | |
group_by(speciesKey, species) %>% | |
count() %>% | |
rename(n_csv = n) | |
count_occ_df | |
``` | |
# Download data from GBIF as Darwin Core Archive | |
Species with related GBIF download keys: | |
```{r species_under_study_dwc} | |
alien_species_dwc <- alien_species %>% | |
mutate(gbif_download_key = c("0006667-181003121212138", | |
"0006669-181003121212138", | |
"0006670-181003121212138", | |
"0006673-181003121212138", | |
"0006671-181003121212138")) | |
``` | |
```{r darwin_core} | |
csv_files_dwc <- map_chr(alien_species_dwc$gbif_download_key, function(x) { | |
paste0("../data/interim/", x, ".csv") | |
}) | |
map2(csv_files_dwc, alien_species_dwc$gbif_download_key, function(x, y) { | |
if (file.exists(x)) { | |
paste0("Text file ",x," already exists.") | |
} else { | |
fn <- "occurrence.txt" | |
unzip(zipfile = paste0("../data/interim/", y, ".zip"), files = fn, | |
exdir = "../data/interim") | |
file.rename(from = "../data/interim/occurrence.txt", | |
to = x) | |
} | |
}) | |
``` | |
Import csv from DwC to data frame in R: | |
```{r create_occ_df_dwc} | |
occ_df_dwc <- map_df(csv_files_dwc, function(x) { | |
read_delim( file = x, delim = "\t", | |
escape_double = FALSE, trim_ws = TRUE, | |
col_types = cols(taxonID = col_character(), | |
recordNumber = col_character(), | |
organismQuantity = col_character(), | |
catalogNumber = col_character())) | |
}) | |
``` | |
Number of occurrences per species: | |
```{r} | |
count_occ_df_dwc <- occ_df_dwc %>% | |
group_by(speciesKey, species) %>% | |
count() %>% | |
rename(n_dwc = n) | |
count_occ_df_dwc | |
``` | |
Compare: | |
```{r compare1} | |
comparison_table <- count_occ_df %>% | |
full_join(count_occ_df_dwc, | |
by = c("speciesKey", "species")) | |
comparison_table | |
``` | |
# Download csv data from GBIF via `rgbif` | |
Download csv data via `rgbif::occ_download_get()`: | |
```{r get_data_via_rgbif} | |
rgbif_csv_files <- map_chr(alien_species$gbif_download_key, function(x) { | |
paste0("../data/interim/rgbif/", x, ".csv") | |
}) | |
map(alien_species$gbif_download_key, function(x) { | |
if (file.exists(paste0("../data/interim/rgbif/", x, ".zip"))) { | |
paste0("DwC file ",x,".zip already exists.") | |
} else { | |
occ_download_get(key = x, path = "../data/interim/rgbif/") | |
} | |
}) | |
``` | |
Import csv to data frame in R | |
```{r create_occ_df_from_csv_rgbif} | |
map2(rgbif_csv_files, alien_species$gbif_download_key, function(x, y) { | |
if (file.exists(x)) { | |
paste0("Text file ",x," already exists.") | |
} else { | |
fn <- paste0(y,".csv") | |
unzip(zipfile = paste0("../data/interim/rgbif/", y, ".zip"), files = fn, | |
exdir = "../data/interim/rgbif") | |
} | |
}) | |
rgbif_occ_df <- map_df(rgbif_csv_files, function(x) { | |
read_delim(file = x, delim = "\t", | |
escape_double = FALSE, trim_ws = TRUE, | |
col_types = cols(recordNumber = col_character(), | |
catalogNumber = col_character())) | |
}) | |
``` | |
Number of occurrences per species: | |
```{r count_rgbif_csv} | |
count_rgbif_occ_df <- rgbif_occ_df %>% | |
group_by(speciesKey, species) %>% | |
count() %>% | |
rename(n_csv_rgbif = n) | |
count_rgbif_occ_df | |
``` | |
```{r compare2} | |
comparison_table <- comparison_table %>% | |
full_join(count_rgbif_occ_df, | |
by = c("speciesKey", "species")) | |
comparison_table | |
``` | |
# Download data from GBIF as DwC archives | |
Download DwC archives via `rgbif::occ_download_get()` and extract csv occurrence data from them: | |
```{r get_dwc_data_via_rgbif} | |
rgbif_csv_files_dwc <- map_chr(alien_species_dwc$gbif_download_key, function(x) { | |
paste0("../data/interim/rgbif/", x, ".csv") | |
}) | |
map2(rgbif_csv_files_dwc, alien_species_dwc$gbif_download_key, function(x, y){ | |
if (file.exists(x)) { | |
paste0("Text file ",x," already exists.") | |
} else { | |
occ <- occ_download_get(key = y, | |
overwrite = T, | |
path = "../data/interim/rgbif/") | |
fn <- "occurrence.txt" | |
unzip(zipfile = occ, files = fn, | |
exdir = "../data/interim/rgbif") | |
file.rename(from = "../data/interim/rgbif/occurrence.txt", | |
to = x) | |
} | |
}) | |
``` | |
Import the extracted csv file to data frame in R | |
```{r create_occ_df_from_dwc_via_rgbif} | |
rgbif_occ_df_dwc <- map_df(rgbif_csv_files_dwc, function(x) { | |
read_delim( file = x, delim = "\t", | |
escape_double = FALSE, trim_ws = TRUE, | |
col_types = cols(recordNumber = col_character(), | |
catalogNumber = col_character(), | |
taxonID = col_character(), | |
organismQuantity = col_character())) | |
}) | |
``` | |
Number of occurrences per species: | |
```{r} | |
count_rgbif_occ_df_dwc <- rgbif_occ_df_dwc %>% | |
group_by(speciesKey, species) %>% | |
count() %>% | |
rename(n_dwc_rgbif = n) | |
count_rgbif_occ_df_dwc | |
``` | |
Compare: | |
```{r compare3} | |
comparison_table <- comparison_table %>% | |
full_join(count_rgbif_occ_df_dwc, | |
by = c("speciesKey", "species")) | |
comparison_table | |
``` | |
# Download data previously queried via `rgbif::occ_download()` | |
We use `rgbif` to query data within R. We use function `occ_download()` for it. We triggered a download for the same species with key `0008278-181003121212138`. We download the related DwC archive and extract the file containing the occurrences: | |
```{r retrieve_download} | |
gbif_key <- "0008278-181003121212138" | |
rgbif_query_csv_files_dwc <- paste0("../data/interim/rgbif/rgbif_query_", | |
gbif_key, ".csv") | |
# downloaded zip file is in subdirectory ./data/interim/rgbif | |
file_path <- paste0("../data/interim/rgbif/") | |
if (!file.exists(file_path)) | |
dir.create(file_path) | |
occ <- occ_download_get(key = gbif_key, overwrite = TRUE, path = file_path) | |
fn <- "occurrence.txt" | |
unzip(zipfile = occ, files = fn, | |
exdir = "../data/interim/rgbif/.") | |
file.rename(from = "../data/interim/rgbif/occurrence.txt", | |
to = rgbif_query_csv_files_dwc) | |
``` | |
Import text file to R: | |
```{r import_to_R_df} | |
rgbif_query_occ_df_dwc <- read_delim( | |
file = rgbif_query_csv_files_dwc, delim = "\t", | |
escape_double = FALSE, trim_ws = TRUE, | |
col_types = cols(recordNumber = col_character(), | |
catalogNumber = col_character(), | |
taxonID = col_character(), | |
organismQuantity = col_character())) | |
``` | |
Number of occurrences per species: | |
```{r _n_occ_per_species_query_rgbif} | |
count_rgbif_query_occ_df_dwc <- rgbif_query_occ_df_dwc %>% | |
group_by(speciesKey, species) %>% | |
count() %>% | |
rename(n_dwc_rgbif_query = n) | |
count_rgbif_query_occ_df_dwc | |
``` | |
Compare: | |
```{r compare3} | |
comparison_table <- comparison_table %>% | |
full_join(count_rgbif_query_occ_df_dwc, | |
by = c("speciesKey", "species")) | |
comparison_table | |
``` | |
# Conclusion | |
We can conclude that the manually downloaded occurrence data (csv or DwCA) and the downloads via `rgbif::occ_download_get()` (csv or DwCA) are consistent. No data loss detected. The same holds true when the query is assembled in R and sent to GBIF via `rgbif::occ_download()` function. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment