Skip to content

Instantly share code, notes, and snippets.

@hubgit
Last active December 16, 2023 14:09
Show Gist options
  • Save hubgit/6078384 to your computer and use it in GitHub Desktop.
Save hubgit/6078384 to your computer and use it in GitHub Desktop.
Remove metadata from a PDF file, using exiftool and qpdf. Note that embedded objects may still contain metadata.

Anonymising PDFs

PDF metadata

Metadata in PDF files can be stored in at least two places:

  • the Info Dictionary, a limited set of key/value pairs
  • XMP packets, which contain RDF statements expressed as XML

PDF files

A PDF file contains a) objects and b) pointers to those objects.

When information is added to a PDF file, it is appended to the end of the file and a pointer is added.

When information is removed from a PDF file, the pointer is removed, but the actual data may not be removed.

To remove previously-deleted data, the PDF file must be rebuilt.

pdftk

pdftk can be used to update the Info Dictionary of a PDF file. See pdftk-unset-info-dictionary-values.php below for an example. As noted in the pdftk documentation, though, pdftk does not alter XMP metadata.

exiftool

exiftool can be used to read/write XMP metadata from/to PDF files.

  • exiftool -all:all => read all the tags.
  • exiftool -all:all= => remove all the tags.

exiftool -all:all also removes the pointer to the Info Dictionary, but does not completely remove it.

qpdf

qpdf can be used to linearize PDF files (qpdf --linearize $FILE), which optimises them for fast web loading and removes any orphan data.

Embedded objects.

After running qpdf, there may be new XMP metadata, as it extracts metadata from any embedded objects. To read the XMP tags of embedded objects, use exiftool -extractEmbedded -all:all $FILE.

<?php
$file = 'example.pdf';
// get the current metadata
$command = sprintf('pdftk %s dump_data', escapeshellarg($file));
$output = array(); $return = null; exec($command, $output, $return);
//print_r($output);
if ($return) {
throw new Exception('There was an error reading metadata from the PDF file');
}
// set any metadata values to null
foreach ($output as $index => $line) {
if (strpos($line, 'InfoValue:') === 0) {
$output[$index] = 'InfoValue:';
}
}
// write the updated metadata to a file
$metadataFile = tempnam(sys_get_temp_dir(), 'pdf-meta-');
file_put_contents($metadataFile, implode("\n", $output));
// create a new PDF using the updated metadata
$tmpFile = tempnam(sys_get_temp_dir(), 'pdf-tmp-');
$command = sprintf('pdftk %s update_info %s output %s',
escapeshellarg($file), escapeshellarg($metadataFile), escapeshellarg($tmpFile));
$output = array(); $return = null; exec($command, $output, $return);
if ($return) {
throw new Exception('There was an error writing metadata to the PDF file');
}
// clean up the temporary files
rename($tmpFile, $file);
unlink($metadataFile);
#!/bin/bash
FILE=example.pdf
# read tags from the original PDF
#exiftool -all:all $FILE
# remove tags (XMP + metadata) from the PDF
exiftool -all:all= $FILE
# linearize the file to remove orphan data
qpdf --linearize $FILE
# read XMP from the modified PDF
#exiftool -all:all $FILE
# read all strings from the modified PDF
#cat $FILE | strings > $FILE.txt
# read XMP from embedded objects in the modified PDF
#exiftool -extractEmbedded -all:all $FILE
@TiffanyNerd
Copy link

Hello,

I’ve just discovered cpdf when I stumbled upon your discussion here!

Thank you @verlanmar Cpdf is absolutely amazing!!!

In order to achieve all the modifications I need done to PDF files, I usually use Infix Pro, Acrobat X Pro, BeCyPDFMetaEdit, qpdf, Exif Tools, pdftk, and probably something else I cannot recall!

None of the above mentioned can modify the original File ID, and I’ve just discovered that cpdf can do this along with many other interesting things, and so this is very exciting!

But I’ve encountered a strange issue with one of my modified PDF files. I had used Infix Pro to modify some text in the PDF file, and that works great. Except that Infix Pro leaves a lot of traces. If I open my PDF file in Notepad, I can see all the object streams, one after the other, documenting all the Infix сhanges:

0 obj
<<
/AcroForm 3 0 R
/Infix <<
/Changes [ 4 0 R 5 0 R 6 0 R 7 0 R … etc

This is soon followed by an endless list of object streams that mention the date/time stamp of each modification and my name, that’s the user’s name, for example:

0 obj
<<
/ModDate (D:20181110085910)
/Pages (1)
/User (my name)

endobj4

My only solution to "sanitizing" and thus removing this information is to open my modified PDF file in Adobe Acrobat Reader and then simply Print as Adobe PDF. This creates a new PDF file that inherits zero object streams from my modified PDF, and also comes with a new File ID (DocumentID and InstanceID identical). The downside to this “Print as Adobe PDF” method is that sometimes the rendered quality is not good enough, even if I set all the possible printing quality options to the best possible, with no image compressions etc.

I think that I’ve tried all possible solutions through cpdf, but I’m unable to permanently remove the object streams that had been injected by Infix. I've tried many commands described in the cpdf manual, such as garbage collection, not preserving object streams, creating and not preserving object streams, removing metadata, copying File ID, creating new PDF through cpdf then merging with my modified PDF...

At one stage, I thought that some manipulation had worked, because I opened the cpdf output file in Notepad, and all I could see is some type of Chinese script, it was total gibberish but at least it was totally unreadable! However, I then opened this output PDF file in BeCyPDFMetaEdit, entered all the meta data I needed on there, such as Author, Creation Date, etc, saved it. Then I opened it again in Notepad, and all the Infix object streams had resurfaced, and the Chinese script was totally gone!

If ever anyone has an explanation for this, or a solution? I would like to continue using BeCyPDFMetaEdit as the very last step of the modification process, as it’s much faster to type in all the meta data modifications into the little GUI (so more user-friendly). And even if I don't use the BeCy GUI, I would still like to be reassured that the object streams are gone for good and cannot be so easily recovered as running the file through BeCy.

Thanks very much for your help!

@Moon1moon
Copy link

Hi, do you know how good this tool for removing metadata?
https://github.com/szTheory/exifcleaner

@Korb
Copy link

Korb commented Feb 10, 2023

pdftk does not alter XMP metadata.

exiftool (...) does not completely remove it

qpdf (...) removes any orphan data

So the author of README.md wants to report that all three tools cannot remove all metadata in a PDF document? Or that only using them together can do it? Or something third?

@dpanic
Copy link

dpanic commented May 10, 2023

These methods don't seem to remove EXIF data from images embedded within a PDF. For example, the adobe photoshop editing history in a JPEG.

@naught101 Can you please provide such PDF file as an example. I want to implement that.

@jonluca
Copy link

jonluca commented Jun 27, 2023

@dpanic what happened to apdf

@dpanic
Copy link

dpanic commented Jun 27, 2023

@dpanic what happened to apdf

Had to take it off because it is part of commercial project I am building ... NDA won't alow me do that, sorry

@thieu1995
Copy link

If you have problem with submitting PDF to arXiv. You don't need to to all of that hard works.
I just found the way to do it (Worked 27/07/2023). Using Foxit Reader (free version) in Windows. Open your PDF file, Ctrl+P to print, Select the mode name "Microsoft Print to PDF". Select the path to save new PDF file. Upload this PDF file to arXiv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment