-
-
Save sckott/c1e2cb547d9f22bd314da50fe9c7b503 to your computer and use it in GitHub Desktop.
Cleaning species taxonomy using taxize. I want to correct synonyms and typo's and drop incomplete cases.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# I have >1000 bees to check its name, so I want to automatize taxize for | |
# fixing misspellings when possible | |
# updating synonims to accepted names | |
# keeping ONLY accepted species (full resolved) | |
# As taxize has many functions I may be not being ptimal, commenets wellcomed. | |
# If you only want to use the function skip to the end, | |
# where I placed a wrap up function. | |
#example: good, synomin, typo, unexisting, genus only. | |
species <- c("Osmia rufa", "Osmia bicornis", "Osmia ruffa", | |
"Osmia wikifluqie", "Osmia sp.") | |
#I want to correct synonim and typo's and drop incomplete cases. | |
library(taxize) | |
library(dplyr) | |
# First - fix synonims | |
temp <- synonyms(species, db="itis") | |
synonym_ids <- grep(pattern = "acc_name", temp) #is this the optimal solution? | |
accepted_names <- unlist(lapply(temp[synonym_ids], '[', "acc_name"), use.names = FALSE) | |
species[synonym_ids] <- accepted_names | |
#honestly, taxize is great, but doing things like update synonims needs | |
# quite a lot of thougth to get coded (grep's and lapplies!). Or I am missing something? | |
#> Scott: I agree that this could be easier | |
#> Opened a new issue (https://github.com/ropensci/taxize/issues/533) to make it | |
#> easier to go from synonyms() output to extract names into a vector (or possibly | |
#> optionally add to a data.frame). Also, different synonyms() sources have different | |
#> output data, which adds complexity | |
# Second - fix misspellings | |
species2 <- unique(species) | |
temp <- gnr_resolve(species2, best_match_only = TRUE, canonical = TRUE) | |
temp | |
# quite good, but matched name can be a genus only... | |
#> Scott: what do you mean here? | |
#> I see that your non-existant taxon "Osmia wikifluqie" returns | |
#> just "Osmia". Did you expect something else? | |
species2 <- temp$matched_name2 | |
# here We will need to recover repeated species in an eficient way, as the are dropped. | |
# Third - keep only accepted names. | |
itis_acceptname(get_tsn(species2)) | |
vapply(x, itis_acceptname, "") | |
#error due to "not found" species having non compatible outputs | |
#> Scott: the species2 vector has all accepted names when I run through this. | |
#> Also, itis_acceptname is not vectorized, so I showed a vapply example above | |
#> Ahhhh, I see that you used a loop below, but lapply/vapply is probably easier | |
#> perhaps we should make itis_acceptname | |
#> vectorized (https://github.com/ropensci/taxize/issues/534) | |
out <- list() | |
for(i in 1:length(species2)){ | |
out[[i]] <- itis_acceptname(get_tsn(species2[i])) | |
} | |
#All accepted, wich is not what I want. | |
#this provides nicer output and can be used to drop unknown species, AND keep synonims. | |
taxas <- tax_name(query = species2, get = "species", verbose = TRUE) | |
#fails because not all has species. in a for loop will work. | |
out <- list() | |
for(i in 1:length(species2)){ | |
out[[i]] <- tax_name(species2[i], get = "species") | |
} | |
out2 <- plyr::ldply(out, data.frame) | |
species2[-which(is.na(out2$species))] | |
#note, using genus do not work, because all has genus now. | |
taxas <- tax_name(query = species2, get = "genus", verbose = TRUE) | |
#> Scott: What taxonomic names did you want to end up with? | |
#> That will help me find the best solution | |
# session info | |
devtools::session_info() | |
Session info ----------------------------------------------------- | |
setting value | |
version R version 3.3.0 Patched (2016-05-09 r70593) | |
system x86_64, darwin13.4.0 | |
ui RStudio (0.99.896) | |
language (EN) | |
collate en_US.UTF-8 | |
tz America/Los_Angeles | |
date 2016-05-17 | |
Packages --------------------------------------------------------- | |
package * version date source | |
ape * 3.4 2015-11-29 CRAN (R 3.3.0) | |
assertthat 0.1 2013-12-06 CRAN (R 3.3.0) | |
bold * 0.3.5 2016-03-28 local | |
chron 2.3-47 2015-06-24 CRAN (R 3.3.0) | |
codetools 0.2-14 2015-07-15 CRAN (R 3.3.0) | |
crayon 1.3.1 2015-07-13 CRAN (R 3.3.0) | |
curl 0.9.7 2016-04-10 CRAN (R 3.3.0) | |
data.table * 1.9.6 2015-09-19 CRAN (R 3.3.0) | |
devtools * 1.11.1 2016-04-21 CRAN (R 3.3.0) | |
digest 0.6.9 2016-01-08 CRAN (R 3.3.0) | |
foreach * 1.4.3 2015-10-13 CRAN (R 3.3.0) | |
httr * 1.1.0 2016-01-28 CRAN (R 3.3.0) | |
iterators 1.0.8 2015-10-13 CRAN (R 3.3.0) | |
jsonlite * 0.9.20 2016-05-10 CRAN (R 3.3.0) | |
lattice 0.20-33 2015-07-14 CRAN (R 3.3.0) | |
magrittr 1.5 2014-11-22 CRAN (R 3.3.0) | |
memoise 1.0.0 2016-01-29 CRAN (R 3.3.0) | |
nlme 3.1-128 2016-05-10 CRAN (R 3.3.0) | |
plyr * 1.8.3 2015-06-12 CRAN (R 3.3.0) | |
R6 2.1.2 2016-01-26 CRAN (R 3.3.0) | |
Rcpp 0.12.5 2016-05-14 CRAN (R 3.3.0) | |
rentrez 1.0.2 2016-04-21 CRAN (R 3.3.0) | |
reshape 0.8.5 2014-04-23 CRAN (R 3.3.0) | |
reshape2 * 1.4.1 2014-12-06 CRAN (R 3.3.0) | |
rncl 0.6.0 2015-07-22 CRAN (R 3.3.0) | |
rotl * 3.0.0 2016-04-26 CRAN (R 3.3.0) | |
roxygen2 5.0.1 2015-11-11 CRAN (R 3.3.0) | |
rredlist * 0.1.0 2016-01-26 CRAN (R 3.3.0) | |
rstudioapi 0.5 2016-01-24 CRAN (R 3.3.0) | |
stringi 1.0-1 2015-10-22 CRAN (R 3.3.0) | |
stringr * 1.0.0 2015-04-30 CRAN (R 3.3.0) | |
taxize * 0.7.6.9100 <NA> local | |
testthat * 1.0.2 2016-04-23 CRAN (R 3.3.0) | |
withr 1.0.1 2016-02-04 CRAN (R 3.3.0) | |
XML 3.98-1.4 2016-03-01 CRAN (R 3.3.0) | |
xml2 * 0.1.2 2015-09-01 CRAN (R 3.3.0) |
Hi Scott, Hi Ignasi,
Thanks for sharing this. I wrote a function that aims at doing pretty much the same making use of taxize. The input is a vector of species names and the output is a table of accepted species names according to GBIF backbone taxonomy. It is not working for higher taxa, I think, because I filter by the 'rank' field to 'species', but this could be fixed.
Comments are welcome:
https://gist.github.com/fdschneider/69e61b14c12ccdda780fbc1c5f0a4f1c
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Thanks a lot! A couple of answers:
> I see that your non-existant taxon "Osmia wikifluqie" returns just "Osmia". Did you expect something else?
I expected not to return anything (i.e. wrong name). But returning the Genus is ok and probably makes sense. Its good as far as you know this is the behaviour.
> Scott: What taxonomic names did you want to end up with? That will help me find the best solution
The code works fine for me and returns what I want. Thanks for the vapply() suggestion and for the issues, which I think will improve usability.