Skip to content

Instantly share code, notes, and snippets.

@sckott
Forked from ibartomeus/clean_species
Last active October 5, 2021 14:19
Show Gist options
  • Save sckott/c1e2cb547d9f22bd314da50fe9c7b503 to your computer and use it in GitHub Desktop.
Save sckott/c1e2cb547d9f22bd314da50fe9c7b503 to your computer and use it in GitHub Desktop.
Cleaning species taxonomy using taxize. I want to correct synonyms and typo's and drop incomplete cases.
# I have >1000 bees to check its name, so I want to automatize taxize for
# fixing misspellings when possible
# updating synonims to accepted names
# keeping ONLY accepted species (full resolved)
# As taxize has many functions I may be not being ptimal, commenets wellcomed.
# If you only want to use the function skip to the end,
# where I placed a wrap up function.
#example: good, synomin, typo, unexisting, genus only.
species <- c("Osmia rufa", "Osmia bicornis", "Osmia ruffa",
"Osmia wikifluqie", "Osmia sp.")
#I want to correct synonim and typo's and drop incomplete cases.
library(taxize)
library(dplyr)
# First - fix synonims
temp <- synonyms(species, db="itis")
synonym_ids <- grep(pattern = "acc_name", temp) #is this the optimal solution?
accepted_names <- unlist(lapply(temp[synonym_ids], '[', "acc_name"), use.names = FALSE)
species[synonym_ids] <- accepted_names
#honestly, taxize is great, but doing things like update synonims needs
# quite a lot of thougth to get coded (grep's and lapplies!). Or I am missing something?
#> Scott: I agree that this could be easier
#> Opened a new issue (https://github.com/ropensci/taxize/issues/533) to make it
#> easier to go from synonyms() output to extract names into a vector (or possibly
#> optionally add to a data.frame). Also, different synonyms() sources have different
#> output data, which adds complexity
# Second - fix misspellings
species2 <- unique(species)
temp <- gnr_resolve(species2, best_match_only = TRUE, canonical = TRUE)
temp
# quite good, but matched name can be a genus only...
#> Scott: what do you mean here?
#> I see that your non-existant taxon "Osmia wikifluqie" returns
#> just "Osmia". Did you expect something else?
species2 <- temp$matched_name2
# here We will need to recover repeated species in an eficient way, as the are dropped.
# Third - keep only accepted names.
itis_acceptname(get_tsn(species2))
vapply(x, itis_acceptname, "")
#error due to "not found" species having non compatible outputs
#> Scott: the species2 vector has all accepted names when I run through this.
#> Also, itis_acceptname is not vectorized, so I showed a vapply example above
#> Ahhhh, I see that you used a loop below, but lapply/vapply is probably easier
#> perhaps we should make itis_acceptname
#> vectorized (https://github.com/ropensci/taxize/issues/534)
out <- list()
for(i in 1:length(species2)){
out[[i]] <- itis_acceptname(get_tsn(species2[i]))
}
#All accepted, wich is not what I want.
#this provides nicer output and can be used to drop unknown species, AND keep synonims.
taxas <- tax_name(query = species2, get = "species", verbose = TRUE)
#fails because not all has species. in a for loop will work.
out <- list()
for(i in 1:length(species2)){
out[[i]] <- tax_name(species2[i], get = "species")
}
out2 <- plyr::ldply(out, data.frame)
species2[-which(is.na(out2$species))]
#note, using genus do not work, because all has genus now.
taxas <- tax_name(query = species2, get = "genus", verbose = TRUE)
#> Scott: What taxonomic names did you want to end up with?
#> That will help me find the best solution
# session info
devtools::session_info()
Session info -----------------------------------------------------
setting value
version R version 3.3.0 Patched (2016-05-09 r70593)
system x86_64, darwin13.4.0
ui RStudio (0.99.896)
language (EN)
collate en_US.UTF-8
tz America/Los_Angeles
date 2016-05-17
Packages ---------------------------------------------------------
package * version date source
ape * 3.4 2015-11-29 CRAN (R 3.3.0)
assertthat 0.1 2013-12-06 CRAN (R 3.3.0)
bold * 0.3.5 2016-03-28 local
chron 2.3-47 2015-06-24 CRAN (R 3.3.0)
codetools 0.2-14 2015-07-15 CRAN (R 3.3.0)
crayon 1.3.1 2015-07-13 CRAN (R 3.3.0)
curl 0.9.7 2016-04-10 CRAN (R 3.3.0)
data.table * 1.9.6 2015-09-19 CRAN (R 3.3.0)
devtools * 1.11.1 2016-04-21 CRAN (R 3.3.0)
digest 0.6.9 2016-01-08 CRAN (R 3.3.0)
foreach * 1.4.3 2015-10-13 CRAN (R 3.3.0)
httr * 1.1.0 2016-01-28 CRAN (R 3.3.0)
iterators 1.0.8 2015-10-13 CRAN (R 3.3.0)
jsonlite * 0.9.20 2016-05-10 CRAN (R 3.3.0)
lattice 0.20-33 2015-07-14 CRAN (R 3.3.0)
magrittr 1.5 2014-11-22 CRAN (R 3.3.0)
memoise 1.0.0 2016-01-29 CRAN (R 3.3.0)
nlme 3.1-128 2016-05-10 CRAN (R 3.3.0)
plyr * 1.8.3 2015-06-12 CRAN (R 3.3.0)
R6 2.1.2 2016-01-26 CRAN (R 3.3.0)
Rcpp 0.12.5 2016-05-14 CRAN (R 3.3.0)
rentrez 1.0.2 2016-04-21 CRAN (R 3.3.0)
reshape 0.8.5 2014-04-23 CRAN (R 3.3.0)
reshape2 * 1.4.1 2014-12-06 CRAN (R 3.3.0)
rncl 0.6.0 2015-07-22 CRAN (R 3.3.0)
rotl * 3.0.0 2016-04-26 CRAN (R 3.3.0)
roxygen2 5.0.1 2015-11-11 CRAN (R 3.3.0)
rredlist * 0.1.0 2016-01-26 CRAN (R 3.3.0)
rstudioapi 0.5 2016-01-24 CRAN (R 3.3.0)
stringi 1.0-1 2015-10-22 CRAN (R 3.3.0)
stringr * 1.0.0 2015-04-30 CRAN (R 3.3.0)
taxize * 0.7.6.9100 <NA> local
testthat * 1.0.2 2016-04-23 CRAN (R 3.3.0)
withr 1.0.1 2016-02-04 CRAN (R 3.3.0)
XML 3.98-1.4 2016-03-01 CRAN (R 3.3.0)
xml2 * 0.1.2 2015-09-01 CRAN (R 3.3.0)
@ibartomeus
Copy link

Thanks a lot! A couple of answers:

> I see that your non-existant taxon "Osmia wikifluqie" returns just "Osmia". Did you expect something else?

I expected not to return anything (i.e. wrong name). But returning the Genus is ok and probably makes sense. Its good as far as you know this is the behaviour.

> Scott: What taxonomic names did you want to end up with? That will help me find the best solution

The code works fine for me and returns what I want. Thanks for the vapply() suggestion and for the issues, which I think will improve usability.

@fdschneider
Copy link

Hi Scott, Hi Ignasi,
Thanks for sharing this. I wrote a function that aims at doing pretty much the same making use of taxize. The input is a vector of species names and the output is a table of accepted species names according to GBIF backbone taxonomy. It is not working for higher taxa, I think, because I filter by the 'rank' field to 'species', but this could be fixed.

Comments are welcome:
https://gist.github.com/fdschneider/69e61b14c12ccdda780fbc1c5f0a4f1c

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment