Last active
June 23, 2024 17:25
-
-
Save fionn/a0453986b63bb319f29e72cc0ea2ce3a to your computer and use it in GitHub Desktop.
Find metadata of images embedded in PDFs
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
venv/ | |
__pycache__/ | |
.mypy_cache/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
export PIP_DISABLE_PIP_VERSION_CHECK=1 | |
venv: requirements.txt requirements_dev.txt | |
@python3 -m venv $@ | |
@source $@/bin/activate && pip install -r $< -r requirements_dev.txt | |
@echo "enter virtual environment: source $@/bin/activate" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#!/usr/bin/env python3 | |
import sys | |
import hashlib | |
from io import BytesIO | |
from pathlib import Path | |
import pymupdf # type: ignore | |
from PIL import Image, UnidentifiedImageError | |
def main() -> None: | |
"""Entry point""" | |
document_path = Path(sys.argv[1]).expanduser() | |
with document_path.open("r") as fd: | |
document = pymupdf.open(fd) | |
print(document.get_xml_metadata()) | |
hashes: set[bytes] = set() | |
for xref in range(1, document.xref_length()): | |
if stream := document.xref_stream(xref): | |
if document.xref_get_key(xref, "Subtype") == ("name", "/Image"): | |
checksum = hashlib.md5(stream) | |
if checksum.digest() not in hashes: | |
hashes.add(checksum.digest()) | |
print(xref, " \t", checksum.hexdigest()) | |
try: | |
image = Image.open(BytesIO(stream)) | |
if exif := image.getexif(): | |
print(exif) | |
except UnidentifiedImageError: | |
print("Unidentified image type") | |
if __name__ == "__main__": | |
main() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pymupdf | |
pillow |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
pylint | |
mypy |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
To see metadata,
pdfinfo -meta path/to/file.pdf
orexiftool path/to/file.pdf
.To strip metadata (if embedded objects do not contain metadata!),
and test:
pdfinfo -meta path/to/out.pdf
,vim -b
orxxd
) and search for metadata tags and ensure no matches.You can then add metadata such as
Title
,Subject
and so on usingexiftool
.Tools (recommended to install with Homebrew):
exiftool
(www, brew:exiftool
),mutool
(www, brew:mupdf-tools
),qpdf
(www, brew:qpdf
).