Skip to content

Instantly share code, notes, and snippets.

@fionn
Last active June 23, 2024 17:25
Show Gist options
  • Save fionn/a0453986b63bb319f29e72cc0ea2ce3a to your computer and use it in GitHub Desktop.
Save fionn/a0453986b63bb319f29e72cc0ea2ce3a to your computer and use it in GitHub Desktop.
Find metadata of images embedded in PDFs
venv/
__pycache__/
.mypy_cache/
export PIP_DISABLE_PIP_VERSION_CHECK=1
venv: requirements.txt requirements_dev.txt
@python3 -m venv $@
@source $@/bin/activate && pip install -r $< -r requirements_dev.txt
@echo "enter virtual environment: source $@/bin/activate"
#!/usr/bin/env python3
import sys
import hashlib
from io import BytesIO
from pathlib import Path
import pymupdf # type: ignore
from PIL import Image, UnidentifiedImageError
def main() -> None:
"""Entry point"""
document_path = Path(sys.argv[1]).expanduser()
with document_path.open("r") as fd:
document = pymupdf.open(fd)
print(document.get_xml_metadata())
hashes: set[bytes] = set()
for xref in range(1, document.xref_length()):
if stream := document.xref_stream(xref):
if document.xref_get_key(xref, "Subtype") == ("name", "/Image"):
checksum = hashlib.md5(stream)
if checksum.digest() not in hashes:
hashes.add(checksum.digest())
print(xref, " \t", checksum.hexdigest())
try:
image = Image.open(BytesIO(stream))
if exif := image.getexif():
print(exif)
except UnidentifiedImageError:
print("Unidentified image type")
if __name__ == "__main__":
main()
@fionn
Copy link
Author

fionn commented Jun 18, 2024

Run

pip install -r requirements.txt

to install dependencies. Then change the path and execute with ./pdf-exif.py path/to/file.pdf.

@fionn
Copy link
Author

fionn commented Jun 23, 2024

To see metadata, pdfinfo -meta path/to/file.pdf or exiftool path/to/file.pdf.

To strip metadata (if embedded objects do not contain metadata!),

exiftool -all:all="" path/to/file.pdf              # to remove metadata pointers
mutool clean -gggglcs path/to/file.pdf             # to garbage collect, possibly unnecessary
qpdf --linearize path/to/file.pdf path/to/out.pdf  # to actually remove the metadata

and test:

  • smoke test with pdfinfo -meta path/to/out.pdf,
  • open the binary file (with e.g. vim -b or xxd) and search for metadata tags and ensure no matches.

You can then add metadata such as Title, Subject and so on using exiftool.


Tools (recommended to install with Homebrew):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment