Metadata in PDF files can be stored in at least two places:
- the Info Dictionary, a limited set of key/value pairs
- XMP packets, which contain RDF statements expressed as XML
A PDF file contains a) objects and b) pointers to those objects.
When information is added to a PDF file, it is appended to the end of the file and a pointer is added.
When information is removed from a PDF file, the pointer is removed, but the actual data may not be removed.
To remove previously-deleted data, the PDF file must be rebuilt.
pdftk can be used to update the Info Dictionary of a PDF file. See pdftk-unset-info-dictionary-values.php
below for an example. As noted in the pdftk documentation, though, pdftk does not alter XMP metadata.
exiftool can be used to read/write XMP metadata from/to PDF files.
exiftool -all:all
=> read all the tags.exiftool -all:all=
=> remove all the tags.
exiftool -all:all
also removes the pointer to the Info Dictionary, but does not completely remove it.
qpdf can be used to linearize PDF files (qpdf --linearize $FILE
), which optimises them for fast web loading and removes any orphan data.
After running qpdf, there may be new XMP metadata, as it extracts metadata from any embedded objects. To read the XMP tags of
embedded objects, use exiftool -extractEmbedded -all:all $FILE
.
Hello,
I’ve just discovered cpdf when I stumbled upon your discussion here!
Thank you @verlanmar Cpdf is absolutely amazing!!!
In order to achieve all the modifications I need done to PDF files, I usually use Infix Pro, Acrobat X Pro, BeCyPDFMetaEdit, qpdf, Exif Tools, pdftk, and probably something else I cannot recall!
None of the above mentioned can modify the original File ID, and I’ve just discovered that cpdf can do this along with many other interesting things, and so this is very exciting!
But I’ve encountered a strange issue with one of my modified PDF files. I had used Infix Pro to modify some text in the PDF file, and that works great. Except that Infix Pro leaves a lot of traces. If I open my PDF file in Notepad, I can see all the object streams, one after the other, documenting all the Infix сhanges:
0 obj
<<
/AcroForm 3 0 R
/Infix <<
/Changes [ 4 0 R 5 0 R 6 0 R 7 0 R … etc
This is soon followed by an endless list of object streams that mention the date/time stamp of each modification and my name, that’s the user’s name, for example:
0 obj
<<
/ModDate (D:20181110085910)
/Pages (1)
/User (my name)
endobj4
My only solution to "sanitizing" and thus removing this information is to open my modified PDF file in Adobe Acrobat Reader and then simply Print as Adobe PDF. This creates a new PDF file that inherits zero object streams from my modified PDF, and also comes with a new File ID (DocumentID and InstanceID identical). The downside to this “Print as Adobe PDF” method is that sometimes the rendered quality is not good enough, even if I set all the possible printing quality options to the best possible, with no image compressions etc.
I think that I’ve tried all possible solutions through cpdf, but I’m unable to permanently remove the object streams that had been injected by Infix. I've tried many commands described in the cpdf manual, such as garbage collection, not preserving object streams, creating and not preserving object streams, removing metadata, copying File ID, creating new PDF through cpdf then merging with my modified PDF...
At one stage, I thought that some manipulation had worked, because I opened the cpdf output file in Notepad, and all I could see is some type of Chinese script, it was total gibberish but at least it was totally unreadable! However, I then opened this output PDF file in BeCyPDFMetaEdit, entered all the meta data I needed on there, such as Author, Creation Date, etc, saved it. Then I opened it again in Notepad, and all the Infix object streams had resurfaced, and the Chinese script was totally gone!
If ever anyone has an explanation for this, or a solution? I would like to continue using BeCyPDFMetaEdit as the very last step of the modification process, as it’s much faster to type in all the meta data modifications into the little GUI (so more user-friendly). And even if I don't use the BeCy GUI, I would still like to be reassured that the object streams are gone for good and cannot be so easily recovered as running the file through BeCy.
Thanks very much for your help!