Skip to content

Instantly share code, notes, and snippets.

@bpbond
Last active March 4, 2023 13:49
Show Gist options
  • Save bpbond/bb4e9e4b2fd743332fb45833ae3c19d0 to your computer and use it in GitHub Desktop.
Save bpbond/bb4e9e4b2fd743332fb45833ae3c19d0 to your computer and use it in GitHub Desktop.
Extract coauthor names from downloaded ORCID data
# BBL 2023-03-03
library(bib2df)
library(stringr)
# Read in Bibtex file downloaded from ORCID
x <- bib2df("~/Downloads/works.bib")
# Usually we're only interested in last five years or something like that
x$YEAR <- as.numeric(x$YEAR)
x <- subset(x, YEAR >= 2018)
# Extract the authors
coauthors <- trimws(unlist(x$AUTHOR))
# Some coauthors are listed {last, first} and some {first last}
# Identify which is which and standardize
lastfirst <- grepl(",", coauthors, fixed = TRUE)
lastnames <- stringr::word(coauthors[!lastfirst], -1)
words <- str_count(coauthors[!lastfirst], pattern = " ") + 1
firstnames <- word(coauthors[!lastfirst], start = 1, end = words - 1)
coauthors[!lastfirst] <- paste(lastnames, firstnames, sep = ", ")
# Remove duplicates, accounting for case, sort, and capitalize consistently
coauthors <- sort(str_unique(coauthors, ignore_case = TRUE))
coauthors <- str_to_title(coauthors)
# The resulting list isn't perfect -- two word family names will be mishandled,
# coauthors listed alternately by initials and names will be duplicated, etc. --
# and has no institution info :( but it's a start
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment