Skip to content

Instantly share code, notes, and snippets.

@evandonovan
Created August 1, 2013 21:33
Show Gist options
  • Save evandonovan/6135556 to your computer and use it in GitHub Desktop.
Save evandonovan/6135556 to your computer and use it in GitHub Desktop.
DMOZ scraper
module DMOZ
FIELDS = {
"dmoz.csv" => [
Class.new(Object) do
def name
:title
end
end.new,
Class.new(Object) do
def name
:url
end
end.new
]
}
START_JOBS = [
Class.new(BaseJob) do
def url
'http://www.dmoz.org/Arts/Television/Networks/PBS/'
end
def execute(doc, data_store, fields)
url_node = doc.css(".directory-url li a:first")
data_store.add_item("dmoz.csv", [
self.url,
doc.css(".directory-url li a")[0].text,
doc.css(".directory-url li a")[0]['href']
])
end
end.new
]
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment