Skip to content

Instantly share code, notes, and snippets.

@barraponto
Created September 30, 2012 18:29
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save barraponto/3808079 to your computer and use it in GitHub Desktop.
Save barraponto/3808079 to your computer and use it in GitHub Desktop.
The is a re-implementation of the Scrapy spider tutorial using HtmlCSSSelector
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlCSSSelector
class DmozSpiderCSS(BaseSpider):
name = "pyquery"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hcs = HtmlCSSSelector(response)
sites = hcs.select('ul li')
for site in sites:
links = site.select('a')
if len(links):
title = links[0].text_content()
link = links[0].get('href')
desc = site.text_content()
print title, link, desc
Copy link

ghost commented Jun 4, 2013

I think some method same name with jquery may be better。ex: links[0].text() links[0].attr('href')。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment