Drugoy/hand-modify-pdf.md

## hand-modify-pdf.md

      
    Raw
  

              hand-modify-pdf.md
            
          
    So you want to modify the text of a PDF by hand...

If you, like me, resent every dollar spent on commercial PDF tools,
you might want to know how to change the text content of a PDF without
having to pay for Adobe Acrobat or another PDF tool. I didn't see an
obvious open-source tool that lets you dig into PDF internals, but I
did discover a few useful facts about how PDFs are structured that
I think may prove useful to others (or myself) in the future. They
are recorded here. They are surely not universally applicable --

the PDF standard is truly Byzantine -- but they worked for my case.
This guide is Mac-oriented, but the tools are all available via most
linux distributions as well.
Viewing compressed text data

You can open a PDF in a text editor and see some stuff that looks kinda
readable, in a vague way, but find that none of it is the actual text
of the PDF. It turns out that many PDFs store the text data in a
compressed form. To view the compressed data, you can use a command line
tool called qpdf. For Macs, there's a homebrew formula.
Here's a command that decompresses all compressed text streams in a
given PDF (via this stackoverflow post):
qpdf --qdf --object-streams=disable in.pdf out.pdf

You can recompress the streams like so:
qpdf out-edited.pdf out-recompressed.pdf

This second command generated some errors for me, but the resulting PDF
was readable using Preview.
Finding the text data

Once you've decompressed the compressed text streams, you can open the
PDF in a text editor and view them! Except you have to find them. Here's
what they look like in a basic form:
BT
  /Font_0 12 Tf
  288 720 Td
  <002a004800570003003600480057> Tj
ET

The PDF Reference
(Third Edition, p.293) has this to say about the above:

The five lines of this example perform the following steps:

Begin a text object.
Set the font and font size to use, installing them as parameters in the text state...
Specify a starting position on the page, setting parameters in the text object.
Paint the glyphs for a string of characters there.
End the text object.


Actually reading the text

As you can see from the above example, we still can't read the text.
It is encoded. And if you thought to yourself "look at that hex string,
I bet it's a bunch of unicode code points" -- well, I wish we lived in
a kinder world too. It seems there are a million ways to specify encodings
in PDFs, including custom encodings that are embedded in the file itself.
Those encodings do map to unicode code points (most of the time?), so that's
good. Let's assume that the file you're working with does have embedded
encodings (because I have no idea how to handle other cases).
Identifying fonts associated with embedded encodings

Text encodings in PDFs are linked to specific fonts. Information about those
encodings is embedded in the PDF in ways I don't understand, but there's an
existing command line tool that extracts it: pdffonts. Here's an example
of the output it generates:
$ pdffonts sample.pdf
name                                 type              emb sub uni prob object ID
------------------------------------ ----------------- --- --- --- ---- ---------
CLDQZB+TrebuchetMS,Bold              CID TrueType      yes yes yes           9  0
YQBAIZ+TrebuchetMS                   CID TrueType      yes yes yes          10  0

Here, the relevant fields are "emb" (meaning the encoding is embedded in
the PDF) and "uni" (meaning the encoding is to unicode code points rather
than to raw glyphs). Assuming both are set to "yes," we're in luck.
In the text example above, you'll notice the \Font_0 descriptor. Not
all fonts in all PDFs will work this way, but in my case, those labels
lined up in a straightforward way with the listing of fonts above. (So
\Font_0 is referring to the font named CLDQZB+TrebuchetMS,Bold in the
above table.)
Finding the embedded encoding table for the given font

Once you have determined the full name of your text's font (like
CLDQZB+TrebuchetMS,Bold) you can search for it. In my case it appeared
several times, but in one particular case, it appeared in a short
block of commands including one that looked like this:
/ToUnicode 19 0 R

This appears to specify the object id of the encoding table. If you then
search for 19 0 obj, you'll find the table. (Or at least that's how
it worked in my case!)
The encoding table format

The salient part of the encoding table looks like this:
38 beginbfrange^M
<0036><0036><0053>^M
<0057><0057><0074>^M
<0044><0044><0061>^M
<0048><0048><0065>^M
<0050><0050><006D>^M
...

If yours looks different, check out the ToUnicode mapping file tutorial
which describes a bunch of possible variations. In this case, the table is
mapping ranges of custom encoding points to unicode points -- except these
are ranges of just one character. So here, the custom point 0036 maps to the
unicode point 0053 -- that is, the digit 5.
To perform this translation in an automated way, I used Python to convert the
table into a dictionary, and wrote some simple encoding and decoding functions.
This isn't a Python tutorial, sadly, but if you know Python or any other scripting
language, you can probably work out a few different ways to solve this part of
the problem.
Equipped with my encoder and decoder, I determined the custom-encoded version of
the text I wanted to replace, wrote the replacement text and custom-encoded it,
and used find-and-replace to swap them out. The end!