Skip to content

Instantly share code, notes, and snippets.

@hubgit
Last active December 16, 2023 14:09
Star You must be signed in to star a gist
Save hubgit/6078384 to your computer and use it in GitHub Desktop.
Remove metadata from a PDF file, using exiftool and qpdf. Note that embedded objects may still contain metadata.

Anonymising PDFs

PDF metadata

Metadata in PDF files can be stored in at least two places:

  • the Info Dictionary, a limited set of key/value pairs
  • XMP packets, which contain RDF statements expressed as XML

PDF files

A PDF file contains a) objects and b) pointers to those objects.

When information is added to a PDF file, it is appended to the end of the file and a pointer is added.

When information is removed from a PDF file, the pointer is removed, but the actual data may not be removed.

To remove previously-deleted data, the PDF file must be rebuilt.

pdftk

pdftk can be used to update the Info Dictionary of a PDF file. See pdftk-unset-info-dictionary-values.php below for an example. As noted in the pdftk documentation, though, pdftk does not alter XMP metadata.

exiftool

exiftool can be used to read/write XMP metadata from/to PDF files.

  • exiftool -all:all => read all the tags.
  • exiftool -all:all= => remove all the tags.

exiftool -all:all also removes the pointer to the Info Dictionary, but does not completely remove it.

qpdf

qpdf can be used to linearize PDF files (qpdf --linearize $FILE), which optimises them for fast web loading and removes any orphan data.

Embedded objects.

After running qpdf, there may be new XMP metadata, as it extracts metadata from any embedded objects. To read the XMP tags of embedded objects, use exiftool -extractEmbedded -all:all $FILE.

<?php
$file = 'example.pdf';
// get the current metadata
$command = sprintf('pdftk %s dump_data', escapeshellarg($file));
$output = array(); $return = null; exec($command, $output, $return);
//print_r($output);
if ($return) {
throw new Exception('There was an error reading metadata from the PDF file');
}
// set any metadata values to null
foreach ($output as $index => $line) {
if (strpos($line, 'InfoValue:') === 0) {
$output[$index] = 'InfoValue:';
}
}
// write the updated metadata to a file
$metadataFile = tempnam(sys_get_temp_dir(), 'pdf-meta-');
file_put_contents($metadataFile, implode("\n", $output));
// create a new PDF using the updated metadata
$tmpFile = tempnam(sys_get_temp_dir(), 'pdf-tmp-');
$command = sprintf('pdftk %s update_info %s output %s',
escapeshellarg($file), escapeshellarg($metadataFile), escapeshellarg($tmpFile));
$output = array(); $return = null; exec($command, $output, $return);
if ($return) {
throw new Exception('There was an error writing metadata to the PDF file');
}
// clean up the temporary files
rename($tmpFile, $file);
unlink($metadataFile);
#!/bin/bash
FILE=example.pdf
# read tags from the original PDF
#exiftool -all:all $FILE
# remove tags (XMP + metadata) from the PDF
exiftool -all:all= $FILE
# linearize the file to remove orphan data
qpdf --linearize $FILE
# read XMP from the modified PDF
#exiftool -all:all $FILE
# read all strings from the modified PDF
#cat $FILE | strings > $FILE.txt
# read XMP from embedded objects in the modified PDF
#exiftool -extractEmbedded -all:all $FILE
@Nichtraucher
Copy link

Could you possibly add functionality that makes it possible to a) remove metadata for files in a directory (and its subdirectories), and b) make it a Nautilus script (in order to edit metadata in selected files/directories)? That would make it a lot easier to use!
cheers!

@m3nu
Copy link

m3nu commented Aug 28, 2015

This is short enough to make it a shell function.

clean_pdf() {
 pdftk $1 dump_data | \
  sed -e 's/\(InfoValue:\)\s.*/\1\ /g' | \
  pdftk $1 update_info - output clean-$1

 exiftool -all:all= clean-$1
 exiftool -all:all clean-$1
 exiftool -extractEmbedded -all:all clean-$1
 qpdf --linearize clean-$1 clean2-$1

 pdftk clean2-$1 dump_data
 exiftool clean2-$1
 pdfinfo -meta clean2-$1
}

via http://blog.snapdragon.cc/2015/08/28/shell-function-to-remove-all-metadata-from-pdf/

@naught101
Copy link

These methods don't seem to remove EXIF data from images embedded within a PDF. For example, the adobe photoshop editing history in a JPEG.

@Telekor
Copy link

Telekor commented Sep 29, 2015

I have scripyt bymanuelRiel: now the word "clean" is added at the end of the file name (without extension). The tricky line is this:

FILE="${FILE%%.*}"

And this is the fool script:

clean_pdf() {
    FILE=$1
    FILE="${FILE%%.*}"
    echo "#############"
    echo $1
    echo "#############"
    if [ -e $1 ]
        then
        pdftk $1 dump_data | \
        sed -e 's/\(InfoValue:\)\s.*/\1\ /g' | \
        pdftk $1 update_info - output ${FILE}.clean.pdf
        exiftool -all:all= ${FILE}.clean.pdf
        exiftool -all:all ${FILE}.clean.pdf
        exiftool -extractEmbedded -all:all ${FILE}.clean.pdf
        qpdf --linearize ${FILE}.clean.pdf ${FILE}.clean2.pdf
        pdftk ${FILE}.clean2.pdf1 dump_data
        exiftool ${FILE}.clean2.pdf
        echo "#############"
        echo "Metadata of file "${FILE}.clean2.pdf
        pdfinfo -meta ${FILE}.clean2.pdf
        echo "#############"
    else
        echo "File not found!"

        fi
}

@rbretongmz
Copy link

Sorry, there is a small mistake. This one works fully:

clean_pdf() {
    FILE=$1
    FILE="${FILE%%.*}"
    echo "#############"
    echo $1
    echo "#############"
    if [ -e $1 ]
        then
        pdftk $1 dump_data | \
        sed -e 's/\(InfoValue:\)\s.*/\1\ /g' | \
        pdftk $1 update_info - output ${FILE}.clean.pdf
        exiftool -all:all= ${FILE}.clean.pdf
        exiftool -all:all ${FILE}.clean.pdf
        exiftool -extractEmbedded -all:all ${FILE}.clean.pdf
        qpdf --linearize ${FILE}.clean.pdf ${FILE}.clean2.pdf
        pdftk ${FILE}.clean2.pdf dump_data
        exiftool ${FILE}.clean2.pdf
        echo "#############"
        echo "Metadatos de fichero "${FILE}.clean2.pdf
        pdfinfo -meta ${FILE}.clean2.pdf
        echo "#############"
    else
        echo "File not found!"

        fi
}

@ande2101
Copy link

The previous script doesn't work with files with spaces in the filename.

@Changaco
Copy link

Other option: install pdf-redact-tools and run pdf-redact-tools -s $FILE

@danielneis
Copy link

This pdf-redact-tools uses exiftool to remove some tags as you can see in https://github.com/firstlookmedia/pdf-redact-tools/blob/master/pdf-redact-tools#L115

@bluesceada
Copy link

bluesceada commented Sep 24, 2016

Hi, I wonder if exiftool is still a valid (or ever was) approach.
If I run exiftool on my file it warns me that tags are not really removed:

$ exiftool -all:all= myfile.pdf
Warning: [minor] ExifTool PDF edits are reversible. Deleted tags may be recovered! - myfile.pdf
    1 image files updated

My files also grow from 542.9 kB to 543.2 kB by exiftool and then from 543.2 kB to 544.6 kB by qpdf. So it seems there is actually more information added?

Let's see if these pdf-redact-tools do anything more. However I do for sure not want to follow one of their approaches, stacking PNG files and call it a PDF (that won't be searchable, has no vector graphic figures, and is probably larger in file size...)

//edit: OK that is actually the only approach they support, that's not applicable for me (and shouldn't be for most people that don't want to give away very larger or bad quality PDFs)

@bertalanimre
Copy link

I'm wondering if there is any enhancement for this script to remove the embeded meetadata as well. For example, I have metadata about embeded Word documents. Unfortunatelly after the cleaning some sensitive data remains like filename, DocumentID and Instance ID.
How can I delete these embeded metadata in the fist place?

@9991212
Copy link

9991212 commented Feb 18, 2017

bluesceada:
Unfortunately, exiftool was never a really sanitizing approach due to its limitation: http://www.sno.phy.queensu.ca/%7Ephil/exiftool/ - “Writer Limitations: PDF - The original metadata is never actually removed.”
But
qpdf --pages myfile.pdf 1-z -- --empty clean-myfile.pdf
/* creates a new (empty) PDF document from scratch and add (all: 1-z) the pages from the original PDF file into it */
does the trick as the top-level (=file itself) metadata are concerned. It does not clean metadata of embedded objects.
(Remark 1.: It is possible to use
pdftk myfile.pdf cat 1-end output clean-myfile.pdf
instead abovementioned as well.
Remark 2.: On MS Windows, you can use BeCyPDFMetaEdit to obtain the same result, too; but for PDF version >1.6 the result is not guaranteed.)

bertalanimre:
It may perhaps be done by filtering the PDF file through an editor (sed, tr?) capable of deleting characters between (and including) "<x:xmpmeta" and "</x:xmpmeta>" strings. But I have never needed it so never tried it.

@RootLUG
Copy link

RootLUG commented Mar 23, 2017

Did you guys find saome way how to remove metadata from file like this: https://publications.usa.gov/USAFileDnld.php?PubType=P&PubID=6099&httpGetPubID=0 ?

I also tried the approach suggested by @9991212 but there is still a lot of metadata left like Creator, For, Create Date etc... which should be on the top level PDF

@yeKcim
Copy link

yeKcim commented Mar 13, 2018

@verlanmar
Copy link

Coherent PDF Command Line Tools can remove metadata from PDF:

https://github.com/coherentgraphics/cpdf-binaries

cpdf -remove-metadata in.pdf -o out.pdf

@TiffanyNerd
Copy link

Hello,

I’ve just discovered cpdf when I stumbled upon your discussion here!

Thank you @verlanmar Cpdf is absolutely amazing!!!

In order to achieve all the modifications I need done to PDF files, I usually use Infix Pro, Acrobat X Pro, BeCyPDFMetaEdit, qpdf, Exif Tools, pdftk, and probably something else I cannot recall!

None of the above mentioned can modify the original File ID, and I’ve just discovered that cpdf can do this along with many other interesting things, and so this is very exciting!

But I’ve encountered a strange issue with one of my modified PDF files. I had used Infix Pro to modify some text in the PDF file, and that works great. Except that Infix Pro leaves a lot of traces. If I open my PDF file in Notepad, I can see all the object streams, one after the other, documenting all the Infix сhanges:

0 obj
<<
/AcroForm 3 0 R
/Infix <<
/Changes [ 4 0 R 5 0 R 6 0 R 7 0 R … etc

This is soon followed by an endless list of object streams that mention the date/time stamp of each modification and my name, that’s the user’s name, for example:

0 obj
<<
/ModDate (D:20181110085910)
/Pages (1)
/User (my name)

endobj4

My only solution to "sanitizing" and thus removing this information is to open my modified PDF file in Adobe Acrobat Reader and then simply Print as Adobe PDF. This creates a new PDF file that inherits zero object streams from my modified PDF, and also comes with a new File ID (DocumentID and InstanceID identical). The downside to this “Print as Adobe PDF” method is that sometimes the rendered quality is not good enough, even if I set all the possible printing quality options to the best possible, with no image compressions etc.

I think that I’ve tried all possible solutions through cpdf, but I’m unable to permanently remove the object streams that had been injected by Infix. I've tried many commands described in the cpdf manual, such as garbage collection, not preserving object streams, creating and not preserving object streams, removing metadata, copying File ID, creating new PDF through cpdf then merging with my modified PDF...

At one stage, I thought that some manipulation had worked, because I opened the cpdf output file in Notepad, and all I could see is some type of Chinese script, it was total gibberish but at least it was totally unreadable! However, I then opened this output PDF file in BeCyPDFMetaEdit, entered all the meta data I needed on there, such as Author, Creation Date, etc, saved it. Then I opened it again in Notepad, and all the Infix object streams had resurfaced, and the Chinese script was totally gone!

If ever anyone has an explanation for this, or a solution? I would like to continue using BeCyPDFMetaEdit as the very last step of the modification process, as it’s much faster to type in all the meta data modifications into the little GUI (so more user-friendly). And even if I don't use the BeCy GUI, I would still like to be reassured that the object streams are gone for good and cannot be so easily recovered as running the file through BeCy.

Thanks very much for your help!

@Moon1moon
Copy link

Hi, do you know how good this tool for removing metadata?
https://github.com/szTheory/exifcleaner

@Korb
Copy link

Korb commented Feb 10, 2023

pdftk does not alter XMP metadata.

exiftool (...) does not completely remove it

qpdf (...) removes any orphan data

So the author of README.md wants to report that all three tools cannot remove all metadata in a PDF document? Or that only using them together can do it? Or something third?

@dpanic
Copy link

dpanic commented May 10, 2023

These methods don't seem to remove EXIF data from images embedded within a PDF. For example, the adobe photoshop editing history in a JPEG.

@naught101 Can you please provide such PDF file as an example. I want to implement that.

@jonluca
Copy link

jonluca commented Jun 27, 2023

@dpanic what happened to apdf

@dpanic
Copy link

dpanic commented Jun 27, 2023

@dpanic what happened to apdf

Had to take it off because it is part of commercial project I am building ... NDA won't alow me do that, sorry

@thieu1995
Copy link

If you have problem with submitting PDF to arXiv. You don't need to to all of that hard works.
I just found the way to do it (Worked 27/07/2023). Using Foxit Reader (free version) in Windows. Open your PDF file, Ctrl+P to print, Select the mode name "Microsoft Print to PDF". Select the path to save new PDF file. Upload this PDF file to arXiv.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment