Skip to content

Instantly share code, notes, and snippets.

@emlyn
Forked from jashmenn/extract-text.clj
Last active December 18, 2015 12:28
Show Gist options
  • Save emlyn/5782682 to your computer and use it in GitHub Desktop.
Save emlyn/5782682 to your computer and use it in GitHub Desktop.
Extract the text from a webpage using jericho html parser in clojure. Run with 'lein one-off extract-text.clj filename.html'
#_(defdeps [[net.htmlparser.jericho/jericho-html "3.1"]])
(ns foo.preprocess
(:import [java.io File BufferedInputStream FileInputStream]
[net.htmlparser.jericho Source TextExtractor HTMLElementName]))
(defn my-text-extractor [source]
(proxy [TextExtractor] [source]
(excludeElement [tag]
(= (.getName tag) HTMLElementName/PRE))))
(defn -main [fname]
(let [file (java.io.File. fname)
source (Source. (BufferedInputStream. (FileInputStream. file)))
tex (my-text-extractor source)]
(println (str tex))))
(apply -main *command-line-args*)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment