Skip to content

Instantly share code, notes, and snippets.

Last active August 29, 2015 13:56
Show Gist options
  • Save rxacevedo/c79c5813f799a6de7cf9 to your computer and use it in GitHub Desktop.
Save rxacevedo/c79c5813f799a6de7cf9 to your computer and use it in GitHub Desktop.
Scraping via XPath
;; xalan 2.6 gets put on the classpath when using
;; Incanter, causes issues with clj-xpath. So DON'T
;; use Incanter when using clj-xpath.
(->> (ClassLoader/getSystemClassLoader)
(map #(.toString %))
(clojure.string/join "\n")
(re-seq #".*xalan.*")
;; Import stuff
(require '(clj-http (client :as c)))
(require '(clj-xpath (core :refer :all))))
;; Memoize to avoid repeatedly fetching
(def my-rss-xml
(memoize (fn [] (-> (c/get "") :body))))
;; Extract some awesomeness!
(let [titles ($x:text* "//entry/title" (my-rss-xml))
urls ($x:text* "//entry/id" (my-rss-xml))
updated ($x:text* "//entry/updated" (my-rss-xml))]
(->> (map vector titles urls updated)
(map (fn [[t u utime]] {:title t :url u :updated utime}))
; Outputs..
; ({:title "Sierpinski Triangle Fractal",
; :url
; "",
; :updated "2014-02-15T22:36:06-05:00"}
; {:title "Approximating the Golden Ratio",
; :url
; "",
; :updated "2014-02-09T12:47:20-05:00"}
; {:title "Dynamic Binding and Being Meta",
; :url
; "",
; :updated "2014-02-07T17:47:38-05:00"}
; {:title "Predicting Algorithm Running Times",
; :url
; "",
; :updated "2014-01-26T19:47:29-05:00"}
; {:title "First-class Functions",
; :url
; "",
; :updated "2014-01-07T19:25:44-05:00"}
; {:title "Scala and Clojure List Operations",
; :url
; "",
; :updated "2013-12-18T19:32:00-05:00"}
; {:title "A Tale of Two Languages",
; :url
; "",
; :updated "2013-10-20T10:51:00-04:00"}
; {:title "Recursion in Scala",
; :url "",
; :updated "2013-04-08T19:33:00-04:00"}
; {:title "Multi-threaded Socket Server",
; :url "",
; :updated "2012-12-03T11:34:00-05:00"})
;; Similar approach for HackerNews
(def hackernews-xml
(memoize (fn [] (-> (c/get "") :body))))
;; Same
(let [titles ($x:text* "rss//item/title" (hackernews-xml))
links ($x:text* "rss//item/link" (hackernews-xml))
comments ($x:text* "rss//item/comments" (hackernews-xml))]
(->> (map vector titles links comments)
(map (fn [[t l c]] {:title t :url l :comments c}))
; Outputs..
; ({:title "The future of Fiber",
; :url "",
; :comments ""}
; {:title "WebGL Water",
; :url "",
; :comments ""}
; {:title "How Microryza Acquired the Domain",
; :url
; "",
; :comments ""}
; {:title "How I was able to track the location of any Tinder user",
; :url
; "",
; :comments ""}
; {:title
; "Canonical announces first partners to ship Ubuntu phones around the globe",
; :url
; "",
; :comments ""}
; {:title "This App Trains You to See Farther",
; :url
; "",
; :comments ""}
; {:title
; "Heap's new interface for analytics: clicking around your site",
; :url "",
; :comments ""})
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment