Skip to content

Instantly share code, notes, and snippets.

@jashmenn
Created December 1, 2010 20:56
Show Gist options
  • Save jashmenn/724207 to your computer and use it in GitHub Desktop.
Save jashmenn/724207 to your computer and use it in GitHub Desktop.
extract the text from a webpage using jericho html parser in clojure
;; lein dep: [net.htmlparser.jericho/jericho-html "3.1"]
(ns foo.preprocess
(:import
[java.io File BufferedInputStream FileInputStream]
[net.htmlparser.jericho Source TextExtractor]))
(def filename "data/raw-html/cosmetiquemedspa.com/index.html")
(def file (java.io.File. filename))
(def source (Source. (BufferedInputStream. (FileInputStream. file))))
(def tex (TextExtractor. source))
(.toString tex)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment