Skip to content

Instantly share code, notes, and snippets.

@rbrito
Forked from hubgit/README.md
Last active November 2, 2015 16:18
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rbrito/ac850a4cc3a1dfa20de0 to your computer and use it in GitHub Desktop.
Save rbrito/ac850a4cc3a1dfa20de0 to your computer and use it in GitHub Desktop.
Remove metadata from a PDF file, using exiftool and qpdf. Note that embedded objects may still contain metadata.

Anonymising PDFs

PDF metadata

Metadata in PDF files can be stored in at least two places:

  • the Info Dictionary, a limited set of key/value pairs
  • XMP packets, which contain RDF statements expressed as XML

PDF files

A PDF file contains a) objects and b) pointers to those objects.

When information is added to a PDF file, it is appended to the end of the file and a pointer is added.

When information is removed from a PDF file, the pointer is removed, but the actual data may not be removed.

To remove previously-deleted data, the PDF file must be rebuilt.

pdftk

pdftk can be used to update the Info Dictionary of a PDF file. See pdftk-unset-info-dictionary-values.php below for an example. As noted in the pdftk documentation, though, pdftk does not alter XMP metadata.

exiftool

exiftool can be used to read/write XMP metadata from/to PDF files.

  • exiftool -all:all => read all the tags.
  • exiftool -all:all= => remove all the tags.

exiftool -all:all also removes the pointer to the Info Dictionary, but does not completely remove it.

qpdf

qpdf can be used to linearize PDF files (qpdf --linearize $FILE), which optimises them for fast web loading and removes any orphan data.

Embedded objects.

After running qpdf, there may be new XMP metadata, as it extracts metadata from any embedded objects. To read the XMP tags of embedded objects, use exiftool -extractEmbedded -all:all $FILE.

<?php
$file = 'example.pdf';
// get the current metadata
$command = sprintf('pdftk %s dump_data', escapeshellarg($file));
$output = array(); $return = null; exec($command, $output, $return);
//print_r($output);
if ($return) {
throw new Exception('There was an error reading metadata from the PDF file');
}
// set any metadata values to null
foreach ($output as $index => $line) {
if (strpos($line, 'InfoValue:') === 0) {
$output[$index] = 'InfoValue:';
}
}
// write the updated metadata to a file
$metadataFile = tempnam(sys_get_temp_dir(), 'pdf-meta-');
file_put_contents($metadataFile, implode("\n", $output));
// create a new PDF using the updated metadata
$tmpFile = tempnam(sys_get_temp_dir(), 'pdf-tmp-');
$command = sprintf('pdftk %s update_info %s output %s',
escapeshellarg($file), escapeshellarg($metadataFile), escapeshellarg($tmpFile));
$output = array(); $return = null; exec($command, $output, $return);
if ($return) {
throw new Exception('There was an error writing metadata to the PDF file');
}
// clean up the temporary files
rename($tmpFile, $file);
unlink($metadataFile);
#!/bin/bash
FILE=example.pdf
# read tags from the original PDF
#exiftool -all:all $FILE
# remove tags (XMP + metadata) from the PDF
exiftool -all:all= $FILE
# linearize the file to remove orphan data
qpdf --linearize $FILE
# read XMP from the modified PDF
#exiftool -all:all $FILE
# read all strings from the modified PDF
#cat $FILE | strings > $FILE.txt
# read XMP from embedded objects in the modified PDF
#exiftool -extractEmbedded -all:all $FILE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment