Skip to content

Instantly share code, notes, and snippets.

@daeh
Last active April 12, 2024 19:44
Show Gist options
  • Save daeh/abc6d46d897b58a657699fa1a408573e to your computer and use it in GitHub Desktop.
Save daeh/abc6d46d897b58a657699fa1a408573e to your computer and use it in GitHub Desktop.
Import Papers 3 library into Zotero
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""Script to facilitate the import of a Readcube Papers 3 library into Zotero
__Purpose of this script__
If you export your Readcube (Mekentosj) Papers3 library as a BibTeX file, the file paths to the PDFs are not formatted
correctly for Zotero to import them.
The specific issues include that:
* Papers3 does not export the file paths in a way that Zotero can understand.
* Papers3 does not export the paths to supplementary files, so only the primary PDF is imported into Zotero.
* Papers3 will export the primary PDF multiple times so you'll end up with multiple copies of the same PDF in Zotero.
* Papers3 includes superfluous supplementary files that you typically don't want to import into Zotero (e.g. *.html and
*.webarchive files).
This script will take the BibTeX file you exported from Papers3 and modify the file paths so that they can be imported into
Zotero.
__Usage__
This script takes as input a BibTeX library exported from readcube/mekentosj Papers3 and outputs a BibTeX library for Zotero
to import.
The script preserves your Papers citekeys, adds supplementary files from the Papers3 Library, removes duplicate links to
PDFs, and removes extraneous *.html and *.webarchive files that are often created by importing articles into Paper from
a web browser.
__Instructions__
* Make sure to have Better BibTeX pre-installed to Zotero if you want to preserve the Papers citekeys.
* Export your Papers3 library as a *.bib file.
Export > BibTeX Library
Make sure to set the "BibTex Record" option to "Complete". This will cause papers to include the paths to the main PDF
(or whatever) file in the *.bib export
* Run this script with python 3.7 or higher to generate the file, 'zotero_import.bib', in the same location as the BibTeX
library export.
* You can pass the script the paths to the Papers3 library and the BibTeX library export as command line arguments,
e.g.:
python Papers3_to_Zotero.py --papers "~/Documents/Library.papers3" --bibtex "~/Desktop/Library.bib"
* Or you can modify the script by updating the 'papers_lib_hardcoded' and 'bibtex_lib_hardcoded' variables with the
paths to your Papers3 library and the BibTeX library that you just exported. E.g.:
papers_lib_hardcoded = "~/Documents/User Library/Library.papers3" ### Path to Papers3 Library
bibtex_lib_hardcoded = "~/Desktop/full_library_export.bib" ### Path to Papers BibTeX library export
* Running the script will generate a new BibTeX file, 'zotero_import.bib', in the same location as the BibTeX library
export.
* Import the 'zotero_import.bib' file that gets generated with Zotero.
* Be sure to check the 'Import errors found:' file if Zotero generates one (if it exists, it will be in whatever folder you
imported the library to; sort by title to find it).
* Also check that special characters in titles and journal names were imported correctly. Sometimes '{\&}' in the
zotero_import.bib will be imported as '<span class="nocase">&</span>'. I'm not sure why or when this happens. You can
search for "</span>" to check.
__NOTE__
The Collections groupings are not preserved with this method. This is one way to manually get your Papers3 Collections into
Zotero after following the above instructions:
* Export each collection as a BibTex library ("Export" set to "Selected Collection" and "BibTex Record" set to "Standard").
This will prevent any file paths from being included in the *.bib file.
* Import that *.bib file directly to Zotero with the option to "Place imported collections and items into new collection"
selected.
* Then merge the duplicate records. That will give you a new collection with links to right papers from your Zotero library.
* In this strategy, you have to do that for each one of your Papers3 Collections. Not ideal but maybe tolerable.
__Author__
Dae Houlihan
__Source__
https://gist.github.com/daeh/abc6d46d897b58a657699fa1a408573e
"""
import argparse
import re
import sys
from pathlib import Path
from warnings import warn
def main(papers=None, bibtex=None):
################################################
### Update these paths or pass via command line:
################################################
### Path to Papers3 Library ###
papers_lib_hardcoded = "~/Documents/Library.papers3"
### Path to the BibTeX export of the Papers3 Library ###
bibtex_lib_hardcoded = "~/Desktop/library.bib"
################################################
papers_lib = papers_lib_hardcoded if papers is None else papers
bibtex_lib = bibtex_lib_hardcoded if bibtex is None else bibtex
papers_library = Path(papers_lib).expanduser()
bibtex_library = Path(bibtex_lib).expanduser()
papers_library_string = str(papers_library).replace(r"(", r"\(").replace(r")", r"\)") + r"/"
if papers_library_string[-9:] != ".papers3/":
raise Exception(
f"The variable 'papers_library' should end in with '.papers3' but is rather: \n\t{str(papers_library)}"
)
if not papers_library.is_dir():
raise Exception(
f"The path you provided to the Papers3 library does not seem to exist or is not a directory: \n\t{str(papers_library)}"
)
if not (bibtex_library.is_file() and bibtex_library.suffix == ".bib"):
raise Exception(
f"The path you provided to the BibTeX Library file you exported from Papers3 does not seem to exist or is not '.bib' file: \n\t{str(bibtex_library)}"
)
out, missing = list(), list()
with open(bibtex_library, "r") as btlib:
for line in btlib:
if line.startswith("file = {"):
templine = re.sub(r"^file = {{(.*?)}},?", r"file = {\1},", line, flags=re.M)
newline = re.sub(r"^file = {(.*?);(\1)},?", r"file = {\1},", templine, flags=re.M)
assert ";" not in newline # assert that this line references only one file
search_str = r"^file = {.*?:" + papers_library_string + r"(.*?\..*?):(.*?/.*?)},?"
filepath_relative = re.search(search_str, newline)
assert isinstance(
filepath_relative, re.Match
), f"Unable to match regex expression:: \n{search_str} \nwith entry from BibTex:: \n{newline}"
primary_file_path = papers_library / filepath_relative.group(1)
if not primary_file_path.is_file():
warn(f"The linked file was not found: {primary_file_path}", UserWarning)
missing.append(primary_file_path)
supp_files = list()
for dir_extra in ["Supplemental", "Media"]:
supp_dir = primary_file_path.parents[0] / dir_extra
if supp_dir.exists():
for x in supp_dir.iterdir():
if (
x.is_file()
and x.suffix not in [".html", ".webarchive"]
and str(x) != str(primary_file_path)
):
supp_files.append(x)
if len(supp_files) > 0:
search_str_supp = (
r"(^file = {.*?:" + papers_library_string + r".*?\..*?:application/.*?)},?"
)
primary_line = re.search(search_str_supp, newline)
assert isinstance(
primary_line, re.Match
), f"Unable to match regex expression:: \n{search_str_supp} \nwith entry from BibTex:: \n{newline}"
newline = primary_line.group(1)
for x in supp_files:
print(f"adding supplementary file for {x.name}")
newline += f';{x.with_suffix("").name + " Supp" + x.suffix}:{x}:application/{x.suffix}'
newline += "},\n"
out.append(newline)
else:
out.append(line)
### New BibTeX record to import into Zotero
modified_lib = bibtex_library.parents[0] / "zotero_import.bib"
with open(modified_lib, "w", encoding="utf-8") as outfile:
for item in out:
outfile.write(item)
if missing:
print("\n\nList of missing files::\n")
for mf in missing:
print(mf)
print(
f"\n\nScript completed but {len(missing)} files referenced in the BibTeX library were not located. They are listed above."
)
else:
print(
f"\n\nScript appears to have completed successfully. You can now import this file into Zotero (make sure Better BibTeX is already installed): \n\t{str(modified_lib)}"
)
return 0
def _cli():
parser = argparse.ArgumentParser(
description=__doc__, formatter_class=argparse.ArgumentDefaultsHelpFormatter, argument_default=argparse.SUPPRESS
)
parser.add_argument("-p", "--papers", help="Path to Papers3 Library")
parser.add_argument("-b", "--bibtex", help="Path to the BibTeX export")
args = parser.parse_args()
return vars(args)
if __name__ == "__main__":
sys.exit(main(**_cli()))
@daeh
Copy link
Author

daeh commented Oct 6, 2022

Hi @cl-bu — It has been a while since I wrote this so it's possible something about Papers3 has changed. I can take a quick look and see if there's a simple fix if you upload the Library.bib file. If you're up for it, you can also send me a zip of Library.papers3 via a filesharing service: daeda [at] mit.edu (depending on where the error is coming from I might not need the library, but it could help me figure out if there's a simple solution).

@cl-bu
Copy link

cl-bu commented Oct 6, 2022

Hi @daeh,
thank you - that is very kind! I will send you the .bib file in a couple of minutes, and am currently uploading my papers library to my university's file sharing service. The .zip file is roughly 60 gigabyte, so it will take some time before I can send you a link.

I hope that your script can help me solve my biggest headache regarding the transfer of my library from Papers 3 to Zotero. I have a rather large library (ca. 6,300 entries) with supplemental files attached to many of these entries. When I export my library to a bibtex file (set to complete record), two problems arise: (1) All supplemental files are listed as .pdf files (even though they are usually .docx files), and (2) the respective file path to these supplemental files omits the folder in which those files are actually stored (e.g., the correct file path for a supplemental file ought to be: /Users/myusername/Materialien/Library.papers3/authorname/supplemental/, but it always omits the "/supplemental" part of the path).

I am working in Chinese studies, therefore many library entries are in Chinese - not sure if this could cause problems when running the script.

Again, many thanks!

@daeh
Copy link
Author

daeh commented Oct 7, 2022

@cl-bu I updated the script to be more robust to your file names.

One of the reasons I wrote this script was that Papers3 was not exporting the paths to supplementary files, only the primary file. This script scans your Papers3 database and add all the supplementary files to the zotero_import.bib file that the script generates.

Are you saying that Papers3 is exporting the supplementary files path but is changing the filenames of the supplementary files? If so, that would be a new Papers3 behavior that I haven't seen. But I suspect that what you're seeing is the original behavior, and the script should work fine now that it can handle your filenames. If you get the current script off the gist, and then run

python Papers3_to_Zotero.py --papers "~/Materialien/Library.papers3" --bibtex "~/Desktop/Library.bib"

the zotero_import.bib file should include all of your supplementary files, including the .docx documents. Let me know if that doesn’t work.

@cl-bu
Copy link

cl-bu commented Oct 7, 2022

@daeh,
thank you for the updated script! I will let you know how it works.

In Papers3, if I select from the "File" drop down menu the command Export > BibTex Library, the resulting document does indeed only contain .pdf filenames (irrespective of their actual format, usually .docx, but on occasion also .txt. or .djvu). The exported file path to supplemental is invariably wrong, always lacking the last "/supplemental/" folder part.

Unfortunately, I am using the last and most recent version of Papers3 (3.4.25, Readcube does not support this app anymore; besides, you also cannot import .docx and other non-pdf files to their own Readcube Papers 4 app (I have been in contact with Readcube's support team)). I am not sure if I could find an older version of Papers3 online.

Within the package contents of the Papers3 app, there is a file called MTExporterBibTeX.nib (the file path is /Applications/Papers\ 3\ (Legacy).app/Contents/Resources/English.lproj/MTExporterBibTeX.nib), which I guess might be responsible for producing a .bib file of my Papers Library, but I neither know the file format nor do I know how to make changes to it.

@daeh
Copy link
Author

daeh commented Oct 7, 2022

@cl-bu One of the reasons I wrote this script was that Papers3 was not exporting the paths to supplementary files, only the primary file. This script scans your Papers3 database and add all the supplementary files to the zotero_import.bib file that the script generates.

I suspect this what you're seeing and the script should work fine now that it can handle your filenames. If you get the current script off the gist, and then run

python Papers3_to_Zotero.py --papers "~/Materialien/Library.papers3" --bibtex "~/Desktop/Library.bib"

the zotero_import.bib file should include all of your supplementary files, including the .docx documents. Let me know if that doesn’t work.

@cl-bu
Copy link

cl-bu commented Oct 8, 2022

@daeh, now everything worked perfectly - many thanks for updating the script! I had to tinker with the folder organization of my Papers3 library (I used to organize my files on basis of author name only, and therefore too many files without clear authorship (when no author was given, etc.) were collected in a folder called "Unknown" which messed up the python script's assignment of supplementary files, but after implementing a more specific folder organization scheme (organized on basis of authorship, editorship, source, and year) I am getting very good results. Thanks to you I am now finally able to migrate my Papers3 library to Zotero in a way that it remains usable for me.
Next, I will use your above-described method for exporting my Papers 3 collections (unfortunately quite a lot...).

@chandrachekuri
Copy link

This is 2024 but I still managed to export from my old Paper3 database/application and import to Zotero. I just wanted to add my note of thanks.

@daeh
Copy link
Author

daeh commented Apr 12, 2024

Wow. 5 years is a much longer lifespan for this than I expected. Glad it was useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment