Skip to content

Instantly share code, notes, and snippets.

@borkdude
Created October 5, 2020 08:21
  • Star 8 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
Star You must be signed in to star a gist
Save borkdude/fc64444a4e7aea4eb647ce42888d1adf to your computer and use it in GitHub Desktop.
Extract HTML tables with babashka and bootleg
(ns scrape
(:require [babashka.pods :as pods]
[clojure.walk :as walk]))
(pods/load-pod "bootleg") ;; installed on path, use "./bootleg" for local binary
(require '[babashka.curl :as curl])
(def clojure-html (:body (curl/get "https://en.wikipedia.org/wiki/Clojure")))
(require '[pod.retrogradeorbit.bootleg.utils :refer [convert-to]])
(def hiccup (convert-to clojure-html :hiccup))
(def tables (atom []))
(walk/postwalk (fn [node]
(when (and (vector? node)
(= :table (first node)))
(swap! tables conj node))
node)
hiccup)
(count @tables) ;; 15
@retrogradeorbit
Copy link

Enlive is more for transforming. But we can hack it

(ns scrape
  (:require [babashka.pods :as pods]))

(pods/load-pod "bootleg") ;; installed on path, use "./bootleg" for local binary

(require '[babashka.curl :as curl])

(def clojure-html (:body (curl/get "https://en.wikipedia.org/wiki/Clojure")))

(require '[pod.retrogradeorbit.bootleg.utils :refer [convert-to]]
         '[pod.retrogradeorbit.bootleg.enlive :as enlive])

(def hiccup (convert-to clojure-html :hiccup))

(def tables (atom []))

(enlive/at hiccup [:table] ;; [:table] is a "css like" selector
           #(do
              ;; function will be called with hickory forms.
              ;; so convert them to hiccup as we conj
              (swap! tables conj (convert-to % :hiccup))

              ;; enlive expect a transformed form to be returned.
              ;; just return the same to transform nothing
              %))

(count @tables) ;; 15

Hickory is elegant at selection and extraction:

(ns scrape
  (:require [babashka.pods :as pods]))

(pods/load-pod "bootleg") ;; installed on path, use "./bootleg" for local binary

(require '[babashka.curl :as curl])

(def clojure-html (:body (curl/get "https://en.wikipedia.org/wiki/Clojure")))

(require '[pod.retrogradeorbit.bootleg.utils :refer [convert-to]]
         '[pod.retrogradeorbit.hickory.select :as s])

(def hickory (convert-to clojure-html :hickory))

;; select all table tags from markup.
;; will return a vector of hickory structures
(def tables-hickory (s/select (s/tag :table) hickory))

;; but if you want them as hiccup, you can just convert them
(def tables-hiccup (convert-to tables-hickory :hiccup-seq))

(count tables-hickory) ;; 15

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment