Skip to content

Instantly share code, notes, and snippets.

@jackrusher
Created July 26, 2012 14:49
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jackrusher/3182488 to your computer and use it in GitHub Desktop.
Save jackrusher/3182488 to your computer and use it in GitHub Desktop.
Fetching PDF text with attributes in Clojure using PDFBox
(ns pdfbox.core
(:import [org.apache.pdfbox.pdmodel PDDocument]
[org.apache.pdfbox.util PDFMarkedContentExtractor TextPosition]
[java.util ArrayList]))
(defn parse-pdf [filename]
(let [pages (.getAllPages (.getDocumentCatalog (PDDocument/load filename)))
textpool (ArrayList.)
extract-text (proxy [PDFMarkedContentExtractor] []
(processTextPosition [text]
(.add textpool text)))]
(doseq [page pages]
(when-let [contents (.getStream (.getContents page))]
(.processStream extract-text page (.findResources page) contents)))
textpool))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment