Skip to content

Instantly share code, notes, and snippets.

@lucassouza1
Created November 23, 2012 12:57
Show Gist options
  • Save lucassouza1/4135502 to your computer and use it in GitHub Desktop.
Save lucassouza1/4135502 to your computer and use it in GitHub Desktop.
Site crawler
(ns receitacrawler.sites.tudogostoso.crawler
(:require [net.cgrand.enlive-html :as html])
(:use [cheshire.core :only [generate-string]]))
(def base-url "http://tudogostoso.uol.com.br")
(def title-selector [:.page-title :h1])
(def ingredients-selector [:.ingredients :.recipelist :li :span])
(def image-url-selector [:.photo])
(defn all-from-selector [res selector]
(html/select res selector))
(defn first-from-selector [res selector]
(first (all-from-selector res selector)))
(defn content-from-selector [res selector]
(let [content (-> (first-from-selector res selector) :content first clojure.string/trim)]
content))
(defn parse-ingredients [res]
(let [ingredients (all-from-selector res ingredients-selector)]
(map #(-> % :content first clojure.string/trim) ingredients)))
(defn recipe [path]
(let [url (str base-url path)
res (-> url java.net.URL.
.getContent (java.io.InputStreamReader. "ISO-8859-1")
html/html-resource)
title (content-from-selector res title-selector)
ingredients (parse-ingredients res)
image-url (-> (first-from-selector res image-url-selector) :attrs :src)]
{:title title :ingredients ingredients :image-url image-url}))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment