Skip to content

Instantly share code, notes, and snippets.

@SimonSapin
Created December 21, 2020 10:51
Show Gist options
  • Save SimonSapin/b6b6aab9a86939946a1b7b9b03387b9d to your computer and use it in GitHub Desktop.
Save SimonSapin/b6b6aab9a86939946a1b7b9b03387b9d to your computer and use it in GitHub Desktop.
Make a PDF file editable in LibreOffice, using Poppler and Inkscape
#!/usr/bin/env python3
"""
Make a PDF file editable in LibreOffice.
LibreOffice Draw can import and export PDF files, effectively making it a PDF editor.
However text in imported documents often looks broken, such as with
text rendered with a different font and overflowing into the next column.
What I suspect happens is that PDF files typically embed every font they use,
but LibreOffice’s internal document model does not support embedded fonts
so the importer tries to find a similar system font.
That system font can have different metrics,
such that a given run of text takes more horizontal space at the same "font size".
Inkscape can also import and export PDF files, also effectively making it a PDF editor.
Its importer can optionally convert all text glyphs into their outlines,
working around any font matching issue
at the cost of making the file size larger and the existing text not (easily) editable.
However Inkscape’s internal document model does not know about pages,
so it can only work with one page at a time.
This script combines both by using:
* Inkscape’s headless mode and command-line interface to convert glyphs to outlines
in a given PDF file, one page at a time, in a temporary directory.
* Poppler’s command-line tools/utils to find out the number of pages
and combine the resulting one-page PDF files into one multi-page PDF file.
* LibreOffice in headless mode to convert that PDF file to an OpenDocument Graphics file
in the original directory, with a given suffix added to the file name.
* LibreOffice in GUI mode in a detached process to open the new file and let you start editing.
We could merge these last two steps and open the PDF file in GUI mode,
but separating them allows the “Save” button to do the right thing.
The following executables are expected to be in `$PATH`:
* `libreoffice`
* `inkscape`
* `pdfinfo` and `pdfunite` from Poppler. Some distributions have them in a `poppler-utils` package.
"""
import os
import sys
import subprocess
import tempfile
def main(filename, suffix):
"""
Example usage:
pdfedit.py Important_document.pdf editable
… creates `Important_document - editable.odg` and opens it in LibreOffice.
"""
destination_dir = os.path.dirname(filename)
assert filename.endswith(".pdf")
name = os.path.basename(filename)[:-len(".pdf")]
destination_name = f"{name} - {suffix}"
destination_extension = "odg"
destination = os.path.join(destination_dir, f"{destination_name}.{destination_extension}")
assert not os.path.exists(destination)
info = subprocess.check_output(["pdfinfo", filename]).decode("utf8")
for line in info.splitlines():
if line.startswith("Pages:"):
page_count = int(line.split()[1])
break
with tempfile.TemporaryDirectory() as tmp:
page_numbers = range(1, page_count + 1)
page_pdfs = [os.path.join(tmp, f"page_{n}.pdf") for n in page_numbers]
for n, pdf in zip(page_numbers, page_pdfs):
subprocess.check_call([
"inkscape",
"--pdf-poppler",
f"--pdf-page={n}",
"--export-filename=" + pdf,
filename,
])
united = os.path.join(tmp, destination_name + ".pdf")
subprocess.check_call(["pdfunite"] + page_pdfs + [united])
subprocess.check_call([
"libreoffice",
"--convert-to", destination_extension,
"--outdir", os.path.dirname(filename),
united,
])
subprocess.Popen(["libreoffice", destination], start_new_session=True)
if __name__ == "__main__":
main(*sys.argv[1:])
@SimonSapin
Copy link
Author

Left as an exercise for the reader: run Inkscape processes for each page in parallel.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment