Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
find PDF font info with PyPDF2, example code
from PyPDF2 import PdfFileReader
from pprint import pprint
def walk(obj, fnt, emb):
'''
If there is a key called 'BaseFont', that is a font that is used in the document.
If there is a key called 'FontName' and another key in the same dictionary object
that is called 'FontFilex' (where x is null, 2, or 3), then that fontname is
embedded.
We create and add to two sets, fnt = fonts used and emb = fonts embedded.
'''
if not hasattr(obj, 'keys'):
return None, None
fontkeys = set(['/FontFile', '/FontFile2', '/FontFile3'])
if '/BaseFont' in obj:
fnt.add(obj['/BaseFont'])
if '/FontName' in obj:
if [x for x in fontkeys if x in obj]:# test to see if there is FontFile
emb.add(obj['/FontName'])
for k in obj.keys():
walk(obj[k], fnt, emb)
return fnt, emb# return the sets for each page
if __name__ == '__main__':
fname = 'myfile.pdf'
pdf = PdfFileReader(fname)
fonts = set()
embedded = set()
for page in pdf.pages:
obj = page.getObject()
# updated via this answer:
# https://stackoverflow.com/questions/60876103/use-pypdf2-to-detect-non-embedded-fonts-in-pdf-file-generated-by-google-docs/60895334#60895334
# in order to handle lists inside objects. Thanks misingnoglic !
# untested code since I don't have such a PDF to play with.
if type(obj) == PyPDF2.generic.ArrayObject: # You can also do ducktyping here
for i in obj:
if hasattr(i, 'keys'):
f, e = walk(i, fonts, embedded_fonts)
fonts = fonts.union(f)
embedded = embedded.union(e)
else:
f, e = walk(obj['/Resources'], fonts, embedded)
fonts = fonts.union(f)
embedded = embedded.union(e)
unembedded = fonts - embedded
print 'Font List'
pprint(sorted(list(fonts)))
if unembedded:
print '\nUnembedded Fonts'
pprint(unembedded)
@smblance

This comment has been minimized.

Copy link

@smblance smblance commented Jan 19, 2016

Thank you very much for the script - used it to get the fonts of a pdf that I couldn't extract by other means!

@LizaKoz

This comment has been minimized.

Copy link

@LizaKoz LizaKoz commented Dec 7, 2018

That's a very handy! Is it also possuble to get a text wich is written bold?

@Shohreh

This comment has been minimized.

Copy link

@Shohreh Shohreh commented Mar 17, 2020

First, install the two packages:

pip install PyPDF2
pip install pprint

Next, if using Python3, edit the following lines:

print('Font List')
pprint(sorted(list(fonts)))
…
print('\nUnembedded Fonts')
pprint(unembedded)
@misingnoglic

This comment has been minimized.

Copy link

@misingnoglic misingnoglic commented Mar 26, 2020

Hi - FYI I tried to use this script to find unembedded fonts in PDFs. I downloaded a PDF from Google docs with just Arial, and Adobe Reader says the font is embedded, but this script says it's not.

@tiarno

This comment has been minimized.

Copy link
Owner Author

@tiarno tiarno commented Mar 27, 2020

thanks for the info. This works for the PDFs I've come across, but there are so many different structures possible inside a PDF. I would definitely believe Adobe. If you want further confirmation, pdffonts is a command line tool you might be interested in. https://www.xpdfreader.com/pdffonts-man.html

@misingnoglic

This comment has been minimized.

Copy link

@misingnoglic misingnoglic commented Mar 27, 2020

Thanks - I'll try to reverse engineer what they have done. In the meantime I've asked on Stack Overflow: https://stackoverflow.com/questions/60876103/use-pypdf2-to-detect-non-embedded-fonts-in-pdf-file-generated-by-google-docs

@misingnoglic

This comment has been minimized.

Copy link

@misingnoglic misingnoglic commented Mar 27, 2020

Figured it out! You need to modify the script to handle lists as well. I put an example in the stackoverflow answer:
https://stackoverflow.com/questions/60876103/use-pypdf2-to-detect-non-embedded-fonts-in-pdf-file-generated-by-google-docs/60895334#60895334

@tiarno

This comment has been minimized.

Copy link
Owner Author

@tiarno tiarno commented Mar 28, 2020

I updated the code as best I could. Untested though. Thanks for the info!

@pranav1698

This comment has been minimized.

Copy link

@pranav1698 pranav1698 commented Apr 5, 2020

That's a very handy! Is it also possuble to get a text wich is written bold?

@tiarno any way we can do this as well

@ilcaa72

This comment has been minimized.

Copy link

@ilcaa72 ilcaa72 commented Apr 10, 2020

hi, can someone help a rookie out... this is a function that will return the names of the various fonts within a pdf... correct?
so i should feed it a pdf ( i assume this is the object param of the function) but what are the other 2 params? it seems like its asking me for 2 fonts..

quick explain... thanks

@vagnit

This comment has been minimized.

Copy link

@vagnit vagnit commented Aug 21, 2020

That's a very handy! Is it also possuble to get a text wich is written bold?

@tiarno any way we can do this as well

Indeed this would be really useful!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.