Skip to content

Instantly share code, notes, and snippets.

@Samathy
Created January 5, 2018 18:50
Show Gist options
  • Save Samathy/8e4447e455966cb80f5d484042c8be79 to your computer and use it in GitHub Desktop.
Save Samathy/8e4447e455966cb80f5d484042c8be79 to your computer and use it in GitHub Desktop.
Python Script to extract highlighted text from PDFs. Uses python-poppler-qt4. Updated [1] to Python 3 [1] https://stackoverflow.com/questions/21050551/extracting-text-from-higlighted-text-using-poppler-qt4-python-poppler-qt4
import popplerqt4
import sys
import PyQt4
def main():
doc = popplerqt4.Poppler.Document.load(sys.argv[1])
total_annotations = 0
for i in range(doc.numPages()):
#print("========= PAGE {} =========".format(i+1))
page = doc.page(i)
annotations = page.annotations()
(pwidth, pheight) = (page.pageSize().width(), page.pageSize().height())
if len(annotations) > 0:
for annotation in annotations:
if isinstance(annotation, popplerqt4.Poppler.Annotation):
total_annotations += 1
if(isinstance(annotation, popplerqt4.Poppler.HighlightAnnotation)):
quads = annotation.highlightQuads()
txt = ""
for quad in quads:
rect = (quad.points[0].x() * pwidth,
quad.points[0].y() * pheight,
quad.points[2].x() * pwidth,
quad.points[2].y() * pheight)
bdy = PyQt4.QtCore.QRectF()
bdy.setCoords(*rect)
txt = txt + str(page.text(bdy)) + ' '
#print("========= ANNOTATION =========")
print(txt)
if total_annotations > 0:
print (str(total_annotations) + " annotation(s) found")
else:
print ("no annotations found")
if __name__ == "__main__":
main()
@ZycAlix
Copy link

ZycAlix commented Feb 20, 2018

Hi,May I ask you some question ? i couldn't install this module popplerqt4 on my Mac, do u know how to fix it ?

@v-sukt
Copy link

v-sukt commented Jul 21, 2019

It's really helpful. Can you point out some resource about getting the heading from PDF based on size. Also for getting the notes made on the highlight.

@v-sukt
Copy link

v-sukt commented Sep 3, 2019

In case of any dependency issues try docker image - https://cloud.docker.com/repository/docker/vsukt/extract_pdf_notes. or vagrant config at https://github.com/v-sukt/extract-pdf-notes

@mahdibabaei
Copy link

Hi,May I ask you some question ? i couldn't install this module popplerqt4 on my Mac, do u know how to fix it ?

same here

@v-sukt
Copy link

v-sukt commented Aug 8, 2020

Hi,May I ask you some question ? i couldn't install this module popplerqt4 on my Mac, do u know how to fix it ?

Can u try using the docker image on Mac. or compile the source code with it's dependencies.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment