Skip to content

Instantly share code, notes, and snippets.

@SiccarPoint
Last active October 12, 2022 09:09
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save SiccarPoint/2729c64e4a5fa009d0e92fda68b95ddf to your computer and use it in GitHub Desktop.
Save SiccarPoint/2729c64e4a5fa009d0e92fda68b95ddf to your computer and use it in GitHub Desktop.
Clean up an imported reference xml in Zotero that contains bad html tags from e.g. Papers3 export
# Guaranteed for mac only due to syntax of sed function, but changes for
# other OS should be minor.
# Ensure you have sqlite3 on your machine.
# Remember to set unix permissions on this file before running:
# chmod 755 fix_html_xml.sh
# and run:
# ./fix_html_xml.sh
# This script should run clean! If you see errors, it won't have worked.
# Will probably take a while to run (mins)
# see also https://forums.zotero.org/discussion/77910/switching-to-zotero-from-papers for other cleanup advice
# First, from Papers, export library as xml.
# Script assumes you then import the xml into Zotero to produce the imperfect
# db as zotero.sqlite, though it should be fairly obvious how to modify this
# to work on the native xml export.
# Make sure zotero is closed or the db will be write-protected.
# Note that even after cleaning, the export-import process is not perfect;
# notably, book chapters get corrupted such that the Book name itself is
# removed from the record. This probably requires manual correction after
# import.
# Create a backup of the old zotero.sqlite:
cp zotero.sqlite zotero_backup_at_import.sqlite
# then copy it to plain text sql:
sqlite3 zotero.sqlite .dump > zotero.sql
# or alternatively (interactive):
# sqlite3 "zotero.sqlite"
# # in the prompt, do
# # > .output zotero.sql
# # > .dump
# # > .exit
# Now replace the bad HTML tags in the file
# First few of these may exist if you've exported you library before:
# print "&<"
sed -i "" -e "s/&amp;lt;/</g" zotero.sql
# print "&>"
sed -i "" -e "s/&amp;gt;/>/g" zotero.sql
# print "&-"
sed -i "" -e "s/&amp;ndash;/-/g" zotero.sql # is - correct?
# print "&quot"
sed -i '' -e 's/&amp;quot;/"/g' zotero.sql # note quote change
# print "&&"
sed -i '' -e 's/&amp;amp;/\&/g' zotero.sql # escape the &
# then
# print "<"
sed -i "" -e "s/&lt;/</g" zotero.sql
# print ">"
sed -i "" -e "s/&gt;/>/g" zotero.sql
# print "-"
sed -i "" -e "s/&ndash;/-/g" zotero.sql # is - correct?
# print "quot"
sed -i '' -e 's/&quot;/"/g' zotero.sql
# print "&'"
sed -i "" -e "s/&amp;apos;/''/g" zotero.sql
# ^double quote loads as single in sqlite db
# print "'"
sed -i "" -e "s/&apos;/''/g" zotero.sql
# print "& alone"
sed -i "" -e "s/&amp;/\&/g" zotero.sql # final clean up of true &!
# Remove the old db (remember, it's backed up)
rm zotero.sqlite
sqlite3 zotero_in.sqlite < zotero.sql
# then rename zotero_in.sqlite as zotero.sqlite, retaining a backup
cp zotero_in.sqlite zotero.sqlite
# Once happy you can safely remove zotero.sql and zotero_in.sqlite:
rm zotero.sql
rm zotero_in.sqlite
# Probably safer to retain zotero_backup_at_import.sqlite
# If any HTML tags survive, modify the above or delete by hand in Zotero
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment