Metadata in PDF files can be stored in at least two places:
- the Info Dictionary, a limited set of key/value pairs
- XMP packets, which contain RDF statements expressed as XML
A PDF file contains a) objects and b) pointers to those objects.
When information is added to a PDF file, it is appended to the end of the file and a pointer is added.
When information is removed from a PDF file, the pointer is removed, but the actual data may not be removed.
To remove previously-deleted data, the PDF file must be rebuilt.
pdftk can be used to update the Info Dictionary of a PDF file. See pdftk-unset-info-dictionary-values.php
below for an example. As noted in the pdftk documentation, though, pdftk does not alter XMP metadata.
exiftool can be used to read/write XMP metadata from/to PDF files.
exiftool -all:all
=> read all the tags.exiftool -all:all=
=> remove all the tags.
exiftool -all:all
also removes the pointer to the Info Dictionary, but does not completely remove it.
qpdf can be used to linearize PDF files (qpdf --linearize $FILE
), which optimises them for fast web loading and removes any orphan data.
After running qpdf, there may be new XMP metadata, as it extracts metadata from any embedded objects. To read the XMP tags of
embedded objects, use exiftool -extractEmbedded -all:all $FILE
.
bluesceada:
Unfortunately, exiftool was never a really sanitizing approach due to its limitation: http://www.sno.phy.queensu.ca/%7Ephil/exiftool/ - “Writer Limitations: PDF - The original metadata is never actually removed.”
But
qpdf --pages myfile.pdf 1-z -- --empty clean-myfile.pdf
/* creates a new (empty) PDF document from scratch and add (all: 1-z) the pages from the original PDF file into it */
does the trick as the top-level (=file itself) metadata are concerned. It does not clean metadata of embedded objects.
(Remark 1.: It is possible to use
pdftk myfile.pdf cat 1-end output clean-myfile.pdf
instead abovementioned as well.
Remark 2.: On MS Windows, you can use BeCyPDFMetaEdit to obtain the same result, too; but for PDF version >1.6 the result is not guaranteed.)
bertalanimre:
It may perhaps be done by filtering the PDF file through an editor (sed, tr?) capable of deleting characters between (and including) "<x:xmpmeta" and "</x:xmpmeta>" strings. But I have never needed it so never tried it.