Skip to content

Instantly share code, notes, and snippets.

@usametov
Forked from borkdude/scrape_tables.clj
Last active July 7, 2022 11:16
Show Gist options
  • Save usametov/d197221263df65b76107a705362d6b9e to your computer and use it in GitHub Desktop.
Save usametov/d197221263df65b76107a705362d6b9e to your computer and use it in GitHub Desktop.
get list of pubmed baseline files
(ns scrape
(:require [babashka.pods :as pods]
[clojure.walk :as walk]))
(pods/load-pod 'retrogradeorbit/bootleg "0.1.9")
(require '[babashka.curl :as curl])
(def pubmed-html (:body (curl/get "https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/")))
(require '[pod.retrogradeorbit.bootleg.utils :refer [convert-to]])
(def hiccup (convert-to pubmed-html :hiccup))
(def links (atom []))
(walk/postwalk (fn [node]
(when (and (vector? node)
(= :a (first node))
(string? (:href (second node)))
(str/ends-with? (:href (second node)) "xml.gz"))
(swap! links conj node))
node)
hiccup)
(def pubmed-links (->> @links (map second) (map :href)))
(spit "pubmed-links.edn" (into [] pubmed-links))
#!/usr/bin/env bb
(require '[babashka.process :refer [process check]])
(require '[clojure.string :as s])
(require '[babashka.fs :as fs])
(require '[clojure.edn :as edn])
(require '[clojure.java.shell :refer [sh]])
(defn wget-file
[url output-dir]
(sh "wget" "-P" output-dir url))
(def links (map #(str "https://ftp.ncbi.nlm.nih.gov/pubmed/baseline/" %)
(edn/read-string (slurp "bb-get-pubmed/pubmed-links.edn"))))
(prn (str "scraped " (count links) " links"))
(doseq [l (drop 0 links)] ;; so that we can restart it later, if needed
(wget-file l "./pubmed"))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment