Skip to content

Instantly share code, notes, and snippets.

@douglasrizzo
Last active June 27, 2022 00:19
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save douglasrizzo/b1d324d0698120ebf8b1c0c91d8c251c to your computer and use it in GitHub Desktop.
Save douglasrizzo/b1d324d0698120ebf8b1c0c91d8c251c to your computer and use it in GitHub Desktop.
Merging inconsistent author names in a bib file
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@reox
Copy link

reox commented Jan 20, 2020

This is a nice approach!

I think a good metric for names would also be co-authors. There is the problem with authors who share the same initials but you have only one of them expanded in your database. Especially nasty would be if you have A. Smith which gets expanded to Ann B. Smith but actually you wanted Andrew C. Smith.
With the co-authors, you could build a score for how certain you are for the name replacement.

@douglasrizzo
Copy link
Author

You mean, I check if A. Smith has more co-authors in common with Ann B. Smith or Andrew C. Smith and select the correct full name for A. Smith based on that? That also seems like a good heuristic, but the bib file would need to have all of these examples, whereas in the current approach, I just work with names, first letters and accents...

By the way, I did not end up applying this to my database. But it was a nice experiment.

@reox
Copy link

reox commented Jan 21, 2020

You mean, I check if A. Smith has more co-authors in common with Ann B. Smith or Andrew C. Smith and select the correct full name for A. Smith based on that?

yes, exactly!

but the bib file would need to have all of these examples

that is true. But you could at least give some warning if there are multiple replacements found for a single name and not replace it at all.

I did not end up applying this to my database

I would also just do this as a test, as I know I have these instances of names in the database. Sometimes it is very confusing as both persons work in the same field...

@reox
Copy link

reox commented Feb 25, 2020

It came to my mind that one really good solution would be ORCID. I think many articles already implement them and add them to the PDF.
Hence, it would be possible to grep them from the files and have then a ground truth for the names.

@douglasrizzo
Copy link
Author

@reox the ORCID is indeed the solution to this problem, as it would work as an ID or primary key for all authors. Unfortunately, it is very rare to find ORCID information in bib files. We would need to:

  • have the PDF files of all the papers whose authors' names we want to normalize;
  • assume the PDFs contain their ORCIDs, and;
  • find a way to scrape it from PDF files with different formatting.

Maybe if there was a service that gave us author information if we gave them the DOI of the paper, or its title, we would have an easier time. I know JabRef has a way to acquire paper information, sometimes including full author names, through a paper's DOI.

@mhoban
Copy link

mhoban commented Jun 27, 2022

@douglasrizzo thanks for this! I adapted your code to use with pyzotero so it wouldn't be necessary to go through a bibtex intermediary. https://gist.github.com/mhoban/3564f789a934028f9898b0a316588dd1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment