Skip to content

Instantly share code, notes, and snippets.

@averagesecurityguy
Last active April 20, 2024 17:56
Show Gist options
  • Star 42 You must be signed in to star a gist
  • Fork 12 You must be signed in to fork a gist
  • Save averagesecurityguy/ba8d9ed3c59c1deffbd1390dafa5a3c2 to your computer and use it in GitHub Desktop.
Save averagesecurityguy/ba8d9ed3c59c1deffbd1390dafa5a3c2 to your computer and use it in GitHub Desktop.
Decompress FlateDecode Objects in PDF
#!/usr/bin/env python3
# This script is designed to do one thing and one thing only. It will find each
# of the FlateDecode streams in a PDF document using a regular expression,
# unzip them, and print out the unzipped data. You can do the same in any
# programming language you choose.
#
# This is NOT a generic PDF decoder, if you need a generic PDF decoder, please
# take a look at pdf-parser by Didier Stevens, which is included in Kali linux.
# https://tools.kali.org/forensics/pdf-parser.
#
# Any requests to decode a PDF will be ignored.
import re
import zlib
pdf = open("some_doc.pdf", "rb").read()
stream = re.compile(rb'.*?FlateDecode.*?stream(.*?)endstream', re.S)
for s in stream.findall(pdf):
s = s.strip(b'\r\n')
try:
print(zlib.decompress(s))
print("")
except:
pass
@averagesecurityguy
Copy link
Author

Unfortunately no.

@Kungergely
Copy link

@mikodham BTW pasting binary code in here like that is completely useless (the comment interface of GitHub is NOT binary safe!). It's impossible to reconstruct the original data stream that you had in your PDF file from what you've posted. This applies to @vipercommand as well BTW.

@LazyBoy95
Copy link

I tried to use it and print out as UTF-8 to txt.
Turns out couple of streams were empty/missing.

@BYugandhar
Copy link

My pdf has structure having the data as follows

20 0 obj
<</Filter/FlateDecode/Length 299>>stream
H‰\‘ËjÃ0�E÷úŠY&‹ ?ã�Œ!Í�¼èƒ:ý G�§‚Z�²²ðßW�…��àpï�I3üÐ��­�ð�;Š��ôJK‹Óx·�áŠ7¥Yœ€TÂ=ˆN1t†q�nçÉáÐè~dU�üÓ‹“³3¬ör¼âšñw+Ñ}ƒÕס]�oïÆüà€ÚA�u
�{èµ3oÝ€À)¶i¤×•›7>óç¸Ì�!!ŽÃcÄ(q2�@Ûé�²*ò«†êìWÍPË�zœ†Øµ�ß�%{êíQ”DõBqB”¥�Åñ‰h›“–eA;.”æ�Qº'g¾�Î] c ’¨Ø�:�zhg¢Ý�¨HˆÊ(P¸½ 5‹Œè%�”Ó·�ï>èç Ïµ¾q4,êØÒ+¥ñ9O3�ð©e³_�� xÃ�Í
endstream
endobj

i want decode it ,i had tried using above script it didnt worked for me can any suggest me any solution to decode using Python/Java

@Kungergely
Copy link

Kungergely commented May 11, 2023

@BYugandhar My comment above still applies: the binary data you've posted here is completely useless. If the script above doesn't work then the content is most likely encrypted. Look for an /Encrypt XX Y R tag in the PDF. The XX Y part refers to the object that holds the encryption directory (i.e. the encryption metadata). This object (referenced as XX Y obj) will contain lines such as /U and /O, these are the (hashed) user and owner passwords which you need to crack. Good luck with them, just get a program that does the brute-force cracking for you instead.

@BYugandhar
Copy link

BYugandhar commented May 12, 2023

@Kungergely Thanks for the answer. This PDF is not encrypted ,even i have checked for /U ,/O ,/UE and /UE also not found.Could please suggest me Is there way to decode back to original Text.(In the PDF text content is "Hello" that i need to retrieve back from above stream)

@GitHubRulesOK
Copy link

GitHubRulesOK commented Jun 18, 2023

@BYugandhar
You have posted text

20 0 obj
<</Filter/FlateDecode/Length 299>>stream
H‰\‘ËjÃ0�E÷úŠY&‹ ?ã�Œ!Í�¼èƒ:ý G�

HOWEVER in the PDF those are not text (in common with every computer file on this planet that is a BINARY bitSTREAM.
When we open such a file in a TEXT editor we see the Binary BYTES as characters like ABCDEFG or ������� when not A-Z or other normal ASCII text characters.
When you cut and paste such ANSI text say from MS Notepad to MS Notepad in ANSI mode, most of the characters (EXCEPT [None]) will actually be uncorrupted and thus potentially usable. Here is an ANSI view of such text NOTE there are very few ����
image

HOWEVER when paste or save as plain text that one missing [none] is critical and all saves are usually corrupted, such that Fonts and Images that depend on that nul and void character fail back to blank leaving pages bare of data. sadly the Equation for PDF is 255/256 NEQ <00>

What can often happen in such cases, MAY BE the decode fails and I often see returns, Rubbish In, Rubbish Out

20 0 obj
<</Length 0>>
stream

endstream
endobj

image

@nerun
Copy link

nerun commented Mar 24, 2024

Add import sys and replace "some_doc.pdf" by sys.argv[1] for a generic pdf flat decode command line tool.

#!/usr/bin/env python3
# This script is designed to do one thing and one thing only. It will find each
# of the FlateDecode streams in a PDF document using a regular expression,
# unzip them, and print out the unzipped data. You can do the same in any
# programming language you choose.
#
# This is NOT a generic PDF decoder, if you need a generic PDF decoder, please
# take a look at pdf-parser by Didier Stevens, which is included in Kali linux.
# https://tools.kali.org/forensics/pdf-parser.
#
# Any requests to decode a PDF will be ignored.
import re
import zlib
import sys

pdf = open(sys.argv[1], "rb").read()
stream = re.compile(rb'.*?FlateDecode.*?stream(.*?)endstream', re.S)

for s in stream.findall(pdf):
    s = s.strip(b'\r\n')
    try:
        print(zlib.decompress(s))
        print("")
    except:
        pass

@vipercommand
Copy link

vipercommand commented Apr 20, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment