Skip to content

Instantly share code, notes, and snippets.

@damianooldoni
Last active October 18, 2018 11:56
Show Gist options
  • Save damianooldoni/e669e34e81785aae5bb68949ec0374bb to your computer and use it in GitHub Desktop.
Save damianooldoni/e669e34e81785aae5bb68949ec0374bb to your computer and use it in GitHub Desktop.
Check whether number of occurrences for some species is the same for 6 different retrieval channels: GBIF website interface, GBIF csv download, GBIF DwCA download, csv as retrieved by `rgbif`, DwcA as retrieved by `rgbif`. There is also a final check taking into account a R-based way to trigger a query (`rgbif::occ_download()`).
---
title: "Compare CSV and DwC files from GBIF internetsite, GBIF manual downloads and `rgbif` authomatic downloads"
author:
- Damiano Oldoni
date: "October 16, 2018"
output:
html_document:
toc: true
toc_depth: 3
toc_float: true
number_sections: true
---
# Setup
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
## Load libraries
Load libraries:
```{r load_libraries}
# Tidyverse packages
library(dplyr)
library(purrr)
library(readr)
# GBIF related packages
library(rgbif)
```
# Define species under study
We are interested in the following alien species:
species | kingdom | GBIF backbone key
--- | --- | ---
Baccharis halimifolia | Plantae | 3129663
Impatiens glandulifera | Plantae | 2891770
Impatiens capensis | Plantae | 2891774
Hydrocotyle ranunculoides | Plantae | 7978544
Branta canadensis | Animalia | 5232437
Harmonia axyridis | Animalia | 4989904
# Number of occurrences: GBIF website interface
GBIF backbone key | species | n_online
--- | --- | ---
2891770 | Impatiens glandulifera | 7612
2891774 | Impatiens capensis | 111
3129663 | Baccharis halimifolia | 193
4989904 | Harmonia axyridis | 18592
7978544 | Hydrocotyle ranunculoides | 1889
Moreover, there is no difference from online queries where taxonKey or speciesKey are given. For example:
- https://www.gbif.org/occurrence/search?country=BE&species_key=2891770
- https://www.gbif.org/occurrence/search?country=BE&taxon_key=2891770
return same number of occurrences.
# Download data from GBIF as csv files
Species with related GBIF download keys:
```{r species_under_study_csv}
alien_species <- data.frame(
species = c("Baccharis halimifolia",
"Impatiens glandulifera",
"Impatiens capensis",
"Hydrocotyle ranunculoides",
"Harmonia axyridis"),
taxonKey = c(3129663,
2891770,
2891774,
7978544,
# 5232437, uncomment if problem with db "0003489-181003121212138" is solved
4989904),
gbif_download_key = c("0006601-181003121212138", # Baccharis halimifolia
"0006598-181003121212138", # Impatiens glandulifera
"0006594-181003121212138", # Impatiens capensis
"0006579-181003121212138", # Hydrocotyle ranunculoides
"0006589-181003121212138"), # Harmonia axyridis
stringsAsFactors = FALSE
) %>%
as_tibble()
```
Get csv files if not already downloaded:
```{r}
csv_files <- map_chr(alien_species$gbif_download_key, function(x) {
paste0("../data/interim/", x, ".csv")
})
map2(csv_files, alien_species$gbif_download_key, function(x, y) {
if (file.exists(x)) {
paste0("Text file ",x," already exists.")
} else {
occ <- occ_download_get(key = y,
overwrite = T,
path = "../data/interim/")
fn <- paste0(y, ".csv")
unzip(zipfile = occ, files = fn,
exdir = "../data/interim")
}
})
```
Import csv to data frame in R
```{r create_occ_df_from_csv}
occ_df <- map_df(csv_files, function(x) {
read_delim( file = x, delim = "\t",
escape_double = FALSE, trim_ws = TRUE,
col_types = cols(recordNumber = col_character(),
catalogNumber = col_character()))
})
```
Number of occurrences per species:
```{r}
count_occ_df <- occ_df %>%
group_by(speciesKey, species) %>%
count() %>%
rename(n_csv = n)
count_occ_df
```
# Download data from GBIF as Darwin Core Archive
Species with related GBIF download keys:
```{r species_under_study_dwc}
alien_species_dwc <- alien_species %>%
mutate(gbif_download_key = c("0006667-181003121212138",
"0006669-181003121212138",
"0006670-181003121212138",
"0006673-181003121212138",
"0006671-181003121212138"))
```
```{r darwin_core}
csv_files_dwc <- map_chr(alien_species_dwc$gbif_download_key, function(x) {
paste0("../data/interim/", x, ".csv")
})
map2(csv_files_dwc, alien_species_dwc$gbif_download_key, function(x, y) {
if (file.exists(x)) {
paste0("Text file ",x," already exists.")
} else {
fn <- "occurrence.txt"
unzip(zipfile = paste0("../data/interim/", y, ".zip"), files = fn,
exdir = "../data/interim")
file.rename(from = "../data/interim/occurrence.txt",
to = x)
}
})
```
Import csv from DwC to data frame in R:
```{r create_occ_df_dwc}
occ_df_dwc <- map_df(csv_files_dwc, function(x) {
read_delim( file = x, delim = "\t",
escape_double = FALSE, trim_ws = TRUE,
col_types = cols(taxonID = col_character(),
recordNumber = col_character(),
organismQuantity = col_character(),
catalogNumber = col_character()))
})
```
Number of occurrences per species:
```{r}
count_occ_df_dwc <- occ_df_dwc %>%
group_by(speciesKey, species) %>%
count() %>%
rename(n_dwc = n)
count_occ_df_dwc
```
Compare:
```{r compare1}
comparison_table <- count_occ_df %>%
full_join(count_occ_df_dwc,
by = c("speciesKey", "species"))
comparison_table
```
# Download csv data from GBIF via `rgbif`
Download csv data via `rgbif::occ_download_get()`:
```{r get_data_via_rgbif}
rgbif_csv_files <- map_chr(alien_species$gbif_download_key, function(x) {
paste0("../data/interim/rgbif/", x, ".csv")
})
map(alien_species$gbif_download_key, function(x) {
if (file.exists(paste0("../data/interim/rgbif/", x, ".zip"))) {
paste0("DwC file ",x,".zip already exists.")
} else {
occ_download_get(key = x, path = "../data/interim/rgbif/")
}
})
```
Import csv to data frame in R
```{r create_occ_df_from_csv_rgbif}
map2(rgbif_csv_files, alien_species$gbif_download_key, function(x, y) {
if (file.exists(x)) {
paste0("Text file ",x," already exists.")
} else {
fn <- paste0(y,".csv")
unzip(zipfile = paste0("../data/interim/rgbif/", y, ".zip"), files = fn,
exdir = "../data/interim/rgbif")
}
})
rgbif_occ_df <- map_df(rgbif_csv_files, function(x) {
read_delim(file = x, delim = "\t",
escape_double = FALSE, trim_ws = TRUE,
col_types = cols(recordNumber = col_character(),
catalogNumber = col_character()))
})
```
Number of occurrences per species:
```{r count_rgbif_csv}
count_rgbif_occ_df <- rgbif_occ_df %>%
group_by(speciesKey, species) %>%
count() %>%
rename(n_csv_rgbif = n)
count_rgbif_occ_df
```
```{r compare2}
comparison_table <- comparison_table %>%
full_join(count_rgbif_occ_df,
by = c("speciesKey", "species"))
comparison_table
```
# Download data from GBIF as DwC archives
Download DwC archives via `rgbif::occ_download_get()` and extract csv occurrence data from them:
```{r get_dwc_data_via_rgbif}
rgbif_csv_files_dwc <- map_chr(alien_species_dwc$gbif_download_key, function(x) {
paste0("../data/interim/rgbif/", x, ".csv")
})
map2(rgbif_csv_files_dwc, alien_species_dwc$gbif_download_key, function(x, y){
if (file.exists(x)) {
paste0("Text file ",x," already exists.")
} else {
occ <- occ_download_get(key = y,
overwrite = T,
path = "../data/interim/rgbif/")
fn <- "occurrence.txt"
unzip(zipfile = occ, files = fn,
exdir = "../data/interim/rgbif")
file.rename(from = "../data/interim/rgbif/occurrence.txt",
to = x)
}
})
```
Import the extracted csv file to data frame in R
```{r create_occ_df_from_dwc_via_rgbif}
rgbif_occ_df_dwc <- map_df(rgbif_csv_files_dwc, function(x) {
read_delim( file = x, delim = "\t",
escape_double = FALSE, trim_ws = TRUE,
col_types = cols(recordNumber = col_character(),
catalogNumber = col_character(),
taxonID = col_character(),
organismQuantity = col_character()))
})
```
Number of occurrences per species:
```{r}
count_rgbif_occ_df_dwc <- rgbif_occ_df_dwc %>%
group_by(speciesKey, species) %>%
count() %>%
rename(n_dwc_rgbif = n)
count_rgbif_occ_df_dwc
```
Compare:
```{r compare3}
comparison_table <- comparison_table %>%
full_join(count_rgbif_occ_df_dwc,
by = c("speciesKey", "species"))
comparison_table
```
# Download data previously queried via `rgbif::occ_download()`
We use `rgbif` to query data within R. We use function `occ_download()` for it. We triggered a download for the same species with key `0008278-181003121212138`. We download the related DwC archive and extract the file containing the occurrences:
```{r retrieve_download}
gbif_key <- "0008278-181003121212138"
rgbif_query_csv_files_dwc <- paste0("../data/interim/rgbif/rgbif_query_",
gbif_key, ".csv")
# downloaded zip file is in subdirectory ./data/interim/rgbif
file_path <- paste0("../data/interim/rgbif/")
if (!file.exists(file_path))
dir.create(file_path)
occ <- occ_download_get(key = gbif_key, overwrite = TRUE, path = file_path)
fn <- "occurrence.txt"
unzip(zipfile = occ, files = fn,
exdir = "../data/interim/rgbif/.")
file.rename(from = "../data/interim/rgbif/occurrence.txt",
to = rgbif_query_csv_files_dwc)
```
Import text file to R:
```{r import_to_R_df}
rgbif_query_occ_df_dwc <- read_delim(
file = rgbif_query_csv_files_dwc, delim = "\t",
escape_double = FALSE, trim_ws = TRUE,
col_types = cols(recordNumber = col_character(),
catalogNumber = col_character(),
taxonID = col_character(),
organismQuantity = col_character()))
```
Number of occurrences per species:
```{r _n_occ_per_species_query_rgbif}
count_rgbif_query_occ_df_dwc <- rgbif_query_occ_df_dwc %>%
group_by(speciesKey, species) %>%
count() %>%
rename(n_dwc_rgbif_query = n)
count_rgbif_query_occ_df_dwc
```
Compare:
```{r compare3}
comparison_table <- comparison_table %>%
full_join(count_rgbif_query_occ_df_dwc,
by = c("speciesKey", "species"))
comparison_table
```
# Conclusion
We can conclude that the manually downloaded occurrence data (csv or DwCA) and the downloads via `rgbif::occ_download_get()` (csv or DwCA) are consistent. No data loss detected. The same holds true when the query is assembled in R and sent to GBIF via `rgbif::occ_download()` function.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment