Skip to content

Instantly share code, notes, and snippets.

@rcarmo
Last active August 29, 2015 14:11
Show Gist options
  • Star 2 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rcarmo/bb0310c71d6573b3919c to your computer and use it in GitHub Desktop.
Save rcarmo/bb0310c71d6573b3919c to your computer and use it in GitHub Desktop.
Readability-like HTML extraction in Hy
; originally posted on https://github.com/rodricios/eatiht/issues/2#issuecomment-67769343
(import
[collections [Counter]]
[cookielib [CookieJar]]
[lxml.etree [HTML tostring]]
[urllib2 [build-opener HTTPCookieProcessor]])
(def *min-length* 20)
(def *text-xpath* (+ "//body//*[not(self::script or self::style or self::i or self::b or self::strong or self::span or self::a)]/text()[string-length(normalize-space()) > " (str *min-length*) "]/.."))
(defn get-page [url]
(print "getting" url)
(let [[jar (CookieJar)]
[opener (build-opener (HTTPCookieProcessor jar))]]
(setv (. opener addheaders) [(, "User-agent" "Mozilla/5.0")])
(.read (.open opener url))))
(defn find-text [doc]
(let [[root (.getroottree doc)]]
(map (fn [e] (, (.getpath root e) e)) (.xpath doc *text-xpath*))))
(defn parent-path [i]
(.join "/" (slice (.split (get i 0) "/") 0 -1)))
(defn top-node [seq]
(let [[parents (map parent-path seq)]
[distribution (Counter parents)]]
(get (get (.most-common distribution) 0) 0)))
(defn extract [buffer]
(let [[doc (HTML buffer)]
[text-nodes (find-text doc)]
[top (get (.xpath doc (top-node text-nodes)) 0)]
[children []]]
(for [child (.iterchildren top)]
(.append children (tostring child)))
(.join "" children)))
(defmain [&rest args]
(print (-> (get args 1)
(get-page)
(extract))))
@rcarmo
Copy link
Author

rcarmo commented Dec 21, 2014

Things you can do with this:

  • Set a "relevance" threshold for capturing more than the "best" text subtree (most interesting articles in my feed have more than one body of text)
  • Tweak the xpath filter a bit for more accuracy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment