Skip to content

Instantly share code, notes, and snippets.

@kevinl95
Created July 8, 2018 20:25
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save kevinl95/29a9e18d474eb6e23372074deff2df38 to your computer and use it in GitHub Desktop.
Save kevinl95/29a9e18d474eb6e23372074deff2df38 to your computer and use it in GitHub Desktop.
Demonstration of how to extract attachments from PDF files using Python 3 and PyPDF2.
import PyPDF2
def getAttachments(reader):
"""
Retrieves the file attachments of the PDF as a dictionary of file names
and the file data as a bytestring.
:return: dictionary of filenames and bytestrings
"""
catalog = reader.trailer["/Root"]
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
attachments = {}
for f in fileNames:
if isinstance(f, str):
name = f
dataIndex = fileNames.index(f) + 1
fDict = fileNames[dataIndex].getObject()
fData = fDict['/EF']['/F'].getData()
attachments[name] = fData
return attachments
handler = open('YOURPDFPATH', 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
print(dictionary)
for fName, fData in dictionary.items():
with open(fName, 'wb') as outfile:
outfile.write(fData)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment