Skip to content

Instantly share code, notes, and snippets.

@caryfitzhugh
Created October 26, 2015 18:13
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save caryfitzhugh/aa839230f920c7f912b4 to your computer and use it in GitHub Desktop.
Save caryfitzhugh/aa839230f920c7f912b4 to your computer and use it in GitHub Desktop.
(defn sitemap-article-urls
[sitemap-url syndicate]
(try
(let [
;; Look back N days (how ever long our publication lifespan is on the syndicate)
scrape-date (tf/unparse (:date-time tf/formatters) (t/minus (t/now) (t/days (:plife syndicate))))
log (logger/info "Scraping: " scrape-date (pr-str sitemap-url))
;; Go and scrape all those sitemaps and get the set of unique urls
article-urls (set (first (sitemap/urls sitemap-url scrape-date)))
log (logger/info "Article URLs: " sitemap-url " -> " (pr-str article-urls))
valid-article-urls (filter (fn [article-url] (data/spugnable-url? article-url)) article-urls)
log (logger/info "VALID Article URLs: " sitemap-url " -> " (pr-str valid-article-urls))
]
valid-article-urls
)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment