damianooldoni/check_correctness_GBIF_rgbif_downloads.Rmd

## check_correctness_GBIF_rgbif_downloads.Rmd
---
title: "Compare CSV and DwC files from GBIF internetsite, GBIF manual downloads and `rgbif` authomatic downloads"
author:
- Damiano Oldoni
date: "October 16, 2018"
output:
  html_document:
    toc: true
    toc_depth: 3
    toc_float: true
    number_sections: true
---

# Setup

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```

## Load libraries

Load libraries:

```{r load_libraries}
# Tidyverse packages
library(dplyr)
library(purrr)
library(readr)
# GBIF related packages
library(rgbif)
```


# Define species under study

We are interested in the following alien species:

species | kingdom | GBIF backbone key
--- | --- | ---
Baccharis halimifolia | Plantae | 3129663
Impatiens glandulifera | Plantae | 2891770
Impatiens capensis | Plantae | 2891774
Hydrocotyle ranunculoides | Plantae | 7978544
Branta canadensis | Animalia | 5232437
Harmonia axyridis | Animalia | 4989904

# Number of occurrences: GBIF website interface

GBIF backbone key | species | n_online
--- | --- | ---
2891770 | Impatiens glandulifera | 7612
2891774 | Impatiens capensis | 111
3129663 | Baccharis halimifolia | 193
4989904 | Harmonia axyridis | 18592
7978544 | Hydrocotyle ranunculoides | 1889

Moreover, there is no difference from online queries where taxonKey or speciesKey are given. For example:
- https://www.gbif.org/occurrence/search?country=BE&species_key=2891770
- https://www.gbif.org/occurrence/search?country=BE&taxon_key=2891770

return same number of occurrences.

# Download data from GBIF as csv files

Species with related GBIF download keys:

```{r species_under_study_csv}
alien_species <- data.frame(
  species = c("Baccharis halimifolia",
              "Impatiens glandulifera",
              "Impatiens capensis",
              "Hydrocotyle ranunculoides",
              "Harmonia axyridis"),
  taxonKey = c(3129663,
               2891770,
               2891774,
               7978544,
               # 5232437, uncomment if problem with db "0003489-181003121212138" is solved
               4989904),
  gbif_download_key = c("0006601-181003121212138", # Baccharis halimifolia
                        "0006598-181003121212138", # Impatiens glandulifera
                        "0006594-181003121212138", # Impatiens capensis
                        "0006579-181003121212138", # Hydrocotyle ranunculoides
                        "0006589-181003121212138"), # Harmonia axyridis
  stringsAsFactors = FALSE
  ) %>%
  as_tibble()
```

Get csv files if not already downloaded:

```{r}
csv_files <- map_chr(alien_species$gbif_download_key, function(x) {
  paste0("../data/interim/", x, ".csv")
})

map2(csv_files, alien_species$gbif_download_key, function(x, y) {
  if (file.exists(x)) {
    paste0("Text file ",x," already exists.")
  } else {
    occ <- occ_download_get(key = y,
                     overwrite = T,
                     path = "../data/interim/")
  fn <- paste0(y, ".csv")
  unzip(zipfile = occ, files = fn,
      exdir = "../data/interim")
  }
})
```

Import csv to data frame in R

```{r create_occ_df_from_csv}
occ_df <- map_df(csv_files, function(x) {
  read_delim( file = x, delim = "\t",
  escape_double = FALSE, trim_ws = TRUE,
  col_types = cols(recordNumber = col_character(),
                   catalogNumber = col_character()))
})
```

Number of occurrences per species:

```{r}
count_occ_df <- occ_df %>%
  group_by(speciesKey, species) %>%
  count() %>%
  rename(n_csv = n)
count_occ_df
```

# Download data from GBIF as Darwin Core Archive

Species with related GBIF download keys:

```{r species_under_study_dwc}
alien_species_dwc <- alien_species %>%
  mutate(gbif_download_key = c("0006667-181003121212138",
                               "0006669-181003121212138",
                               "0006670-181003121212138",
                               "0006673-181003121212138",
                               "0006671-181003121212138"))
```

```{r darwin_core}
csv_files_dwc <- map_chr(alien_species_dwc$gbif_download_key, function(x) {
  paste0("../data/interim/", x, ".csv")
})

map2(csv_files_dwc, alien_species_dwc$gbif_download_key, function(x, y) {
  if (file.exists(x)) {
    paste0("Text file ",x," already exists.")
  } else {
    fn <- "occurrence.txt"
    unzip(zipfile = paste0("../data/interim/", y, ".zip"), files = fn,
        exdir = "../data/interim")
    file.rename(from = "../data/interim/occurrence.txt",
                to = x)
  }
})
```

Import csv from DwC to data frame in R:

```{r create_occ_df_dwc}
occ_df_dwc <- map_df(csv_files_dwc, function(x) {
  read_delim( file = x, delim = "\t",
  escape_double = FALSE, trim_ws = TRUE,
  col_types = cols(taxonID = col_character(),
                   recordNumber = col_character(),
                   organismQuantity = col_character(),
                   catalogNumber = col_character()))
})
```

Number of occurrences per species:

```{r}
count_occ_df_dwc <- occ_df_dwc %>%
  group_by(speciesKey, species) %>%
  count() %>%
  rename(n_dwc = n)
count_occ_df_dwc
```

Compare:

```{r compare1}
comparison_table <- count_occ_df %>%
  full_join(count_occ_df_dwc,
            by = c("speciesKey", "species"))
comparison_table
```

# Download csv data from GBIF via `rgbif`

Download csv data via `rgbif::occ_download_get()`:

```{r get_data_via_rgbif}
rgbif_csv_files <- map_chr(alien_species$gbif_download_key, function(x) {
  paste0("../data/interim/rgbif/", x, ".csv")
})
map(alien_species$gbif_download_key, function(x) {
  if (file.exists(paste0("../data/interim/rgbif/", x, ".zip"))) {
    paste0("DwC file ",x,".zip already exists.")
  } else {
    occ_download_get(key = x, path = "../data/interim/rgbif/")
  }
})
```

Import csv to data frame in R

```{r create_occ_df_from_csv_rgbif}
map2(rgbif_csv_files, alien_species$gbif_download_key, function(x, y) {
  if (file.exists(x)) {
    paste0("Text file ",x," already exists.")
  } else {
    fn <- paste0(y,".csv")
    unzip(zipfile = paste0("../data/interim/rgbif/", y, ".zip"), files = fn,
        exdir = "../data/interim/rgbif")
  }
})

rgbif_occ_df <- map_df(rgbif_csv_files, function(x) {
  read_delim(file = x, delim = "\t",
  escape_double = FALSE, trim_ws = TRUE,
  col_types = cols(recordNumber = col_character(),
                   catalogNumber = col_character()))
})
```

Number of occurrences per species:

```{r count_rgbif_csv}
count_rgbif_occ_df <- rgbif_occ_df %>%
  group_by(speciesKey, species) %>%
  count() %>%
  rename(n_csv_rgbif = n)
count_rgbif_occ_df
```

```{r compare2}
comparison_table <- comparison_table %>%
  full_join(count_rgbif_occ_df,
            by = c("speciesKey", "species"))
comparison_table
```

# Download data from GBIF as DwC archives

Download DwC archives  via `rgbif::occ_download_get()` and extract csv occurrence data from them:

```{r get_dwc_data_via_rgbif}
rgbif_csv_files_dwc <- map_chr(alien_species_dwc$gbif_download_key, function(x) {
  paste0("../data/interim/rgbif/", x, ".csv")
})
map2(rgbif_csv_files_dwc, alien_species_dwc$gbif_download_key, function(x, y){
  if (file.exists(x)) {
    paste0("Text file ",x," already exists.")
  } else {
    occ <- occ_download_get(key = y,
                     overwrite = T,
                     path = "../data/interim/rgbif/")
  fn <- "occurrence.txt"
  unzip(zipfile = occ, files = fn,
      exdir = "../data/interim/rgbif")
  file.rename(from = "../data/interim/rgbif/occurrence.txt",
              to = x)
  }
})
```

Import the extracted csv file to data frame in R

```{r create_occ_df_from_dwc_via_rgbif}
rgbif_occ_df_dwc <- map_df(rgbif_csv_files_dwc, function(x) {
  read_delim( file = x, delim = "\t",
  escape_double = FALSE, trim_ws = TRUE,
  col_types = cols(recordNumber = col_character(),
                   catalogNumber = col_character(),
                   taxonID = col_character(),
                   organismQuantity = col_character()))
})
```

Number of occurrences per species:

```{r}
count_rgbif_occ_df_dwc <- rgbif_occ_df_dwc %>%
  group_by(speciesKey, species) %>%
  count() %>%
  rename(n_dwc_rgbif = n)
count_rgbif_occ_df_dwc
```

Compare:

```{r compare3}
comparison_table <- comparison_table %>%
  full_join(count_rgbif_occ_df_dwc,
            by = c("speciesKey", "species"))
comparison_table
```

# Download data previously queried via `rgbif::occ_download()`

We use `rgbif` to query data within R. We use function `occ_download()` for it. We triggered a download for the same species with key `0008278-181003121212138`. We download the related DwC archive and extract the file containing the occurrences:

```{r retrieve_download}
gbif_key <- "0008278-181003121212138"
rgbif_query_csv_files_dwc <- paste0("../data/interim/rgbif/rgbif_query_",
                                    gbif_key, ".csv")
# downloaded zip file is in subdirectory ./data/interim/rgbif
file_path <- paste0("../data/interim/rgbif/")
if (!file.exists(file_path))
  dir.create(file_path)
occ <- occ_download_get(key = gbif_key, overwrite = TRUE, path = file_path)
fn <- "occurrence.txt"
unzip(zipfile = occ, files = fn,
  exdir = "../data/interim/rgbif/.")
file.rename(from = "../data/interim/rgbif/occurrence.txt",
              to = rgbif_query_csv_files_dwc)
```

Import text file to R:

```{r import_to_R_df}
rgbif_query_occ_df_dwc <- read_delim(
  file = rgbif_query_csv_files_dwc, delim = "\t",
  escape_double = FALSE, trim_ws = TRUE,
  col_types = cols(recordNumber = col_character(),
                   catalogNumber = col_character(),
                   taxonID = col_character(),
                   organismQuantity = col_character()))
```

Number of occurrences per species:

```{r _n_occ_per_species_query_rgbif}
count_rgbif_query_occ_df_dwc <- rgbif_query_occ_df_dwc %>%
  group_by(speciesKey, species) %>%
  count() %>%
  rename(n_dwc_rgbif_query = n)
count_rgbif_query_occ_df_dwc
```

Compare:

```{r compare3}
comparison_table <- comparison_table %>%
  full_join(count_rgbif_query_occ_df_dwc,
            by = c("speciesKey", "species"))
comparison_table
```

# Conclusion

We can conclude that the manually downloaded occurrence data (csv or DwCA) and the downloads via `rgbif::occ_download_get()` (csv or DwCA) are consistent. No data loss detected. The same holds true when the query is assembled in R and sent to GBIF via `rgbif::occ_download()` function.
	---
	title: "Compare CSV and DwC files from GBIF internetsite, GBIF manual downloads and `rgbif` authomatic downloads"
	author:
	- Damiano Oldoni
	date: "October 16, 2018"
	output:
	html_document:
	toc: true
	toc_depth: 3
	toc_float: true
	number_sections: true
	---

	# Setup

	```{r setup, include=FALSE}
	knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
	```

	## Load libraries

	Load libraries:

	```{r load_libraries}
	# Tidyverse packages
	library(dplyr)
	library(purrr)
	library(readr)
	# GBIF related packages
	library(rgbif)
	```


	# Define species under study

	We are interested in the following alien species:

	species \| kingdom \| GBIF backbone key
	--- \| --- \| ---
	Baccharis halimifolia \| Plantae \| 3129663
	Impatiens glandulifera \| Plantae \| 2891770
	Impatiens capensis \| Plantae \| 2891774
	Hydrocotyle ranunculoides \| Plantae \| 7978544
	Branta canadensis \| Animalia \| 5232437
	Harmonia axyridis \| Animalia \| 4989904

	# Number of occurrences: GBIF website interface

	GBIF backbone key \| species \| n_online
	--- \| --- \| ---
	2891770 \| Impatiens glandulifera \| 7612
	2891774 \| Impatiens capensis \| 111
	3129663 \| Baccharis halimifolia \| 193
	4989904 \| Harmonia axyridis \| 18592
	7978544 \| Hydrocotyle ranunculoides \| 1889

	Moreover, there is no difference from online queries where taxonKey or speciesKey are given. For example:
	- https://www.gbif.org/occurrence/search?country=BE&species_key=2891770
	- https://www.gbif.org/occurrence/search?country=BE&taxon_key=2891770

	return same number of occurrences.

	# Download data from GBIF as csv files

	Species with related GBIF download keys:

	```{r species_under_study_csv}
	alien_species <- data.frame(
	species = c("Baccharis halimifolia",
	"Impatiens glandulifera",
	"Impatiens capensis",
	"Hydrocotyle ranunculoides",
	"Harmonia axyridis"),
	taxonKey = c(3129663,
	2891770,
	2891774,
	7978544,
	# 5232437, uncomment if problem with db "0003489-181003121212138" is solved
	4989904),
	gbif_download_key = c("0006601-181003121212138", # Baccharis halimifolia
	"0006598-181003121212138", # Impatiens glandulifera
	"0006594-181003121212138", # Impatiens capensis
	"0006579-181003121212138", # Hydrocotyle ranunculoides
	"0006589-181003121212138"), # Harmonia axyridis
	stringsAsFactors = FALSE
	) %>%
	as_tibble()
	```

	Get csv files if not already downloaded:

	```{r}
	csv_files <- map_chr(alien_species$gbif_download_key, function(x) {
	paste0("../data/interim/", x, ".csv")
	})

	map2(csv_files, alien_species$gbif_download_key, function(x, y) {
	if (file.exists(x)) {
	paste0("Text file ",x," already exists.")
	} else {
	occ <- occ_download_get(key = y,
	overwrite = T,
	path = "../data/interim/")
	fn <- paste0(y, ".csv")
	unzip(zipfile = occ, files = fn,
	exdir = "../data/interim")
	}
	})
	```

	Import csv to data frame in R

	```{r create_occ_df_from_csv}
	occ_df <- map_df(csv_files, function(x) {
	read_delim( file = x, delim = "\t",
	escape_double = FALSE, trim_ws = TRUE,
	col_types = cols(recordNumber = col_character(),
	catalogNumber = col_character()))
	})
	```

	Number of occurrences per species:

	```{r}
	count_occ_df <- occ_df %>%
	group_by(speciesKey, species) %>%
	count() %>%
	rename(n_csv = n)
	count_occ_df
	```

	# Download data from GBIF as Darwin Core Archive

	Species with related GBIF download keys:

	```{r species_under_study_dwc}
	alien_species_dwc <- alien_species %>%
	mutate(gbif_download_key = c("0006667-181003121212138",
	"0006669-181003121212138",
	"0006670-181003121212138",
	"0006673-181003121212138",
	"0006671-181003121212138"))
	```

	```{r darwin_core}
	csv_files_dwc <- map_chr(alien_species_dwc$gbif_download_key, function(x) {
	paste0("../data/interim/", x, ".csv")
	})

	map2(csv_files_dwc, alien_species_dwc$gbif_download_key, function(x, y) {
	if (file.exists(x)) {
	paste0("Text file ",x," already exists.")
	} else {
	fn <- "occurrence.txt"
	unzip(zipfile = paste0("../data/interim/", y, ".zip"), files = fn,
	exdir = "../data/interim")
	file.rename(from = "../data/interim/occurrence.txt",
	to = x)
	}
	})
	```

	Import csv from DwC to data frame in R:

	```{r create_occ_df_dwc}
	occ_df_dwc <- map_df(csv_files_dwc, function(x) {
	read_delim( file = x, delim = "\t",
	escape_double = FALSE, trim_ws = TRUE,
	col_types = cols(taxonID = col_character(),
	recordNumber = col_character(),
	organismQuantity = col_character(),
	catalogNumber = col_character()))
	})
	```

	Number of occurrences per species:

	```{r}
	count_occ_df_dwc <- occ_df_dwc %>%
	group_by(speciesKey, species) %>%
	count() %>%
	rename(n_dwc = n)
	count_occ_df_dwc
	```

	Compare:

	```{r compare1}
	comparison_table <- count_occ_df %>%
	full_join(count_occ_df_dwc,
	by = c("speciesKey", "species"))
	comparison_table
	```

	# Download csv data from GBIF via `rgbif`

	Download csv data via `rgbif::occ_download_get()`:

	```{r get_data_via_rgbif}
	rgbif_csv_files <- map_chr(alien_species$gbif_download_key, function(x) {
	paste0("../data/interim/rgbif/", x, ".csv")
	})
	map(alien_species$gbif_download_key, function(x) {
	if (file.exists(paste0("../data/interim/rgbif/", x, ".zip"))) {
	paste0("DwC file ",x,".zip already exists.")
	} else {
	occ_download_get(key = x, path = "../data/interim/rgbif/")
	}
	})
	```

	Import csv to data frame in R

	```{r create_occ_df_from_csv_rgbif}
	map2(rgbif_csv_files, alien_species$gbif_download_key, function(x, y) {
	if (file.exists(x)) {
	paste0("Text file ",x," already exists.")
	} else {
	fn <- paste0(y,".csv")
	unzip(zipfile = paste0("../data/interim/rgbif/", y, ".zip"), files = fn,
	exdir = "../data/interim/rgbif")
	}
	})

	rgbif_occ_df <- map_df(rgbif_csv_files, function(x) {
	read_delim(file = x, delim = "\t",
	escape_double = FALSE, trim_ws = TRUE,
	col_types = cols(recordNumber = col_character(),
	catalogNumber = col_character()))
	})
	```

	Number of occurrences per species:

	```{r count_rgbif_csv}
	count_rgbif_occ_df <- rgbif_occ_df %>%
	group_by(speciesKey, species) %>%
	count() %>%
	rename(n_csv_rgbif = n)
	count_rgbif_occ_df
	```

	```{r compare2}
	comparison_table <- comparison_table %>%
	full_join(count_rgbif_occ_df,
	by = c("speciesKey", "species"))
	comparison_table
	```

	# Download data from GBIF as DwC archives

	Download DwC archives via `rgbif::occ_download_get()` and extract csv occurrence data from them:

	```{r get_dwc_data_via_rgbif}
	rgbif_csv_files_dwc <- map_chr(alien_species_dwc$gbif_download_key, function(x) {
	paste0("../data/interim/rgbif/", x, ".csv")
	})
	map2(rgbif_csv_files_dwc, alien_species_dwc$gbif_download_key, function(x, y){
	if (file.exists(x)) {
	paste0("Text file ",x," already exists.")
	} else {
	occ <- occ_download_get(key = y,
	overwrite = T,
	path = "../data/interim/rgbif/")
	fn <- "occurrence.txt"
	unzip(zipfile = occ, files = fn,
	exdir = "../data/interim/rgbif")
	file.rename(from = "../data/interim/rgbif/occurrence.txt",
	to = x)
	}
	})
	```

	Import the extracted csv file to data frame in R

	```{r create_occ_df_from_dwc_via_rgbif}
	rgbif_occ_df_dwc <- map_df(rgbif_csv_files_dwc, function(x) {
	read_delim( file = x, delim = "\t",
	escape_double = FALSE, trim_ws = TRUE,
	col_types = cols(recordNumber = col_character(),
	catalogNumber = col_character(),
	taxonID = col_character(),
	organismQuantity = col_character()))
	})
	```

	Number of occurrences per species:

	```{r}
	count_rgbif_occ_df_dwc <- rgbif_occ_df_dwc %>%
	group_by(speciesKey, species) %>%
	count() %>%
	rename(n_dwc_rgbif = n)
	count_rgbif_occ_df_dwc
	```

	Compare:

	```{r compare3}
	comparison_table <- comparison_table %>%
	full_join(count_rgbif_occ_df_dwc,
	by = c("speciesKey", "species"))
	comparison_table
	```

	# Download data previously queried via `rgbif::occ_download()`

	We use `rgbif` to query data within R. We use function `occ_download()` for it. We triggered a download for the same species with key `0008278-181003121212138`. We download the related DwC archive and extract the file containing the occurrences:

	```{r retrieve_download}
	gbif_key <- "0008278-181003121212138"
	rgbif_query_csv_files_dwc <- paste0("../data/interim/rgbif/rgbif_query_",
	gbif_key, ".csv")
	# downloaded zip file is in subdirectory ./data/interim/rgbif
	file_path <- paste0("../data/interim/rgbif/")
	if (!file.exists(file_path))
	dir.create(file_path)
	occ <- occ_download_get(key = gbif_key, overwrite = TRUE, path = file_path)
	fn <- "occurrence.txt"
	unzip(zipfile = occ, files = fn,
	exdir = "../data/interim/rgbif/.")
	file.rename(from = "../data/interim/rgbif/occurrence.txt",
	to = rgbif_query_csv_files_dwc)
	```

	Import text file to R:

	```{r import_to_R_df}
	rgbif_query_occ_df_dwc <- read_delim(
	file = rgbif_query_csv_files_dwc, delim = "\t",
	escape_double = FALSE, trim_ws = TRUE,
	col_types = cols(recordNumber = col_character(),
	catalogNumber = col_character(),
	taxonID = col_character(),
	organismQuantity = col_character()))
	```

	Number of occurrences per species:

	```{r _n_occ_per_species_query_rgbif}
	count_rgbif_query_occ_df_dwc <- rgbif_query_occ_df_dwc %>%
	group_by(speciesKey, species) %>%
	count() %>%
	rename(n_dwc_rgbif_query = n)
	count_rgbif_query_occ_df_dwc
	```

	Compare:

	```{r compare3}
	comparison_table <- comparison_table %>%
	full_join(count_rgbif_query_occ_df_dwc,
	by = c("speciesKey", "species"))
	comparison_table
	```

	# Conclusion

	We can conclude that the manually downloaded occurrence data (csv or DwCA) and the downloads via `rgbif::occ_download_get()` (csv or DwCA) are consistent. No data loss detected. The same holds true when the query is assembled in R and sent to GBIF via `rgbif::occ_download()` function.