Last active
June 23, 2024 17:25
-
-
Save fionn/a0453986b63bb319f29e72cc0ea2ce3a to your computer and use it in GitHub Desktop.
Find metadata of images embedded in PDFs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
venv/ | |
__pycache__/ | |
.mypy_cache/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
export PIP_DISABLE_PIP_VERSION_CHECK=1 | |
venv: requirements.txt requirements_dev.txt | |
@python3 -m venv $@ | |
@source $@/bin/activate && pip install -r $< -r requirements_dev.txt | |
@echo "enter virtual environment: source $@/bin/activate" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python3 | |
import sys | |
import hashlib | |
from io import BytesIO | |
from pathlib import Path | |
import pymupdf # type: ignore | |
from PIL import Image, UnidentifiedImageError | |
def main() -> None: | |
"""Entry point""" | |
document_path = Path(sys.argv[1]).expanduser() | |
with document_path.open("r") as fd: | |
document = pymupdf.open(fd) | |
print(document.get_xml_metadata()) | |
hashes: set[bytes] = set() | |
for xref in range(1, document.xref_length()): | |
if stream := document.xref_stream(xref): | |
if document.xref_get_key(xref, "Subtype") == ("name", "/Image"): | |
checksum = hashlib.md5(stream) | |
if checksum.digest() not in hashes: | |
hashes.add(checksum.digest()) | |
print(xref, " \t", checksum.hexdigest()) | |
try: | |
image = Image.open(BytesIO(stream)) | |
if exif := image.getexif(): | |
print(exif) | |
except UnidentifiedImageError: | |
print("Unidentified image type") | |
if __name__ == "__main__": | |
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pymupdf | |
pillow |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pylint | |
mypy |
To see metadata, pdfinfo -meta path/to/file.pdf
or exiftool path/to/file.pdf
.
To strip metadata (if embedded objects do not contain metadata!),
exiftool -all:all="" path/to/file.pdf # to remove metadata pointers
mutool clean -gggglcs path/to/file.pdf # to garbage collect, possibly unnecessary
qpdf --linearize path/to/file.pdf path/to/out.pdf # to actually remove the metadata
and test:
- smoke test with
pdfinfo -meta path/to/out.pdf
, - open the binary file (with e.g.
vim -b
orxxd
) and search for metadata tags and ensure no matches.
You can then add metadata such as Title
, Subject
and so on using exiftool
.
Tools (recommended to install with Homebrew):
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Run
to install dependencies. Then change the path and execute with
./pdf-exif.py path/to/file.pdf
.