Skip to content

Instantly share code, notes, and snippets.

@redapple
Last active January 3, 2016 03:59
Show Gist options
  • Save redapple/8405909 to your computer and use it in GitHub Desktop.
Save redapple/8405909 to your computer and use it in GitHub Desktop.
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from bcncat.items import BcncatItem
import re
class BcnSpider(CrawlSpider):
name = 'bcn'
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']
rules = (
Rule(
SgmlLinkExtractor(
allow=(re.escape("index.php")),
restrict_xpaths=("//div[@class='paginador']")),
callback="parse_item",
follow=True),
)
def parse_item(self, response):
self.log("parse_item")
sel = Selector(response)
i = BcncatItem()
#i['domain_id'] = sel.xpath('//input[@id="sid"]/@value').extract()
#i['name'] = sel.xpath('//div[@id="name"]').extract()
#i['description'] = sel.xpath('//div[@id="description"]').extract()
return i
paul@wheezy:~/tmp/bcn/bcncat$ scrapy crawl bcn
2014-01-13 19:52:51+0100 [scrapy] INFO: Scrapy 0.21.0 started (bot: bcncat)
2014-01-13 19:52:51+0100 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-01-13 19:52:51+0100 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'bcncat.spiders', 'SPIDER_MODULES': ['bcncat.spiders'], 'BOT_NAME': 'bcncat'}
2014-01-13 19:52:51+0100 [scrapy] INFO: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-01-13 19:52:51+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-01-13 19:52:51+0100 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-01-13 19:52:51+0100 [scrapy] INFO: Enabled item pipelines:
2014-01-13 19:52:51+0100 [bcn] INFO: Spider opened
2014-01-13 19:52:51+0100 [bcn] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2014-01-13 19:52:51+0100 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-01-13 19:52:51+0100 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-01-13 19:52:52+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&q=*:*> (referer: None)
2014-01-13 19:52:57+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=30&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&q=*:*)
2014-01-13 19:52:57+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:57+0100 [bcn] DEBUG: Filtered duplicate request: <GET http://guia.bcn.cat/index.php?pg=search&from=20&q=*:*&nr=10> - no more duplicates will be shown (see DUPEFILTER_CLASS)
2014-01-13 19:52:58+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=10&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&q=*:*)
2014-01-13 19:52:58+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:58+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=80&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=30&q=*:*&nr=10)
2014-01-13 19:52:58+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:58+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=0&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=30&q=*:*&nr=10)
2014-01-13 19:52:58+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:58+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=60&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=30&q=*:*&nr=10)
2014-01-13 19:52:58+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=70&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=30&q=*:*&nr=10)
2014-01-13 19:52:58+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:58+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:59+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=40&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&q=*:*)
2014-01-13 19:52:59+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:59+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=50&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&q=*:*)
2014-01-13 19:52:59+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:59+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=120&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=80&q=*:*&nr=10)
2014-01-13 19:52:59+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:59+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=130&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=80&q=*:*&nr=10)
2014-01-13 19:52:59+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:59+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=110&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=80&q=*:*&nr=10)
2014-01-13 19:52:59+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:59+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=100&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=80&q=*:*&nr=10)
2014-01-13 19:52:59+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:59+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=90&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=80&q=*:*&nr=10)
2014-01-13 19:52:59+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:59+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=20&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&q=*:*)
2014-01-13 19:52:59+0100 [bcn] DEBUG: parse_item
2014-01-13 19:52:59+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=150&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=120&q=*:*&nr=10)
2014-01-13 19:53:00+0100 [bcn] DEBUG: parse_item
2014-01-13 19:53:00+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=170&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=120&q=*:*&nr=10)
2014-01-13 19:53:00+0100 [bcn] DEBUG: parse_item
2014-01-13 19:53:00+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=160&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=120&q=*:*&nr=10)
2014-01-13 19:53:00+0100 [bcn] DEBUG: parse_item
2014-01-13 19:53:00+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=180&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=130&q=*:*&nr=10)
2014-01-13 19:53:00+0100 [bcn] DEBUG: parse_item
2014-01-13 19:53:00+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=140&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=120&q=*:*&nr=10)
2014-01-13 19:53:00+0100 [bcn] DEBUG: parse_item
2014-01-13 19:53:00+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=190&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=150&q=*:*&nr=10)
2014-01-13 19:53:00+0100 [bcn] DEBUG: parse_item
2014-01-13 19:53:00+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=220&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=170&q=*:*&nr=10)
2014-01-13 19:53:00+0100 [bcn] DEBUG: parse_item
2014-01-13 19:53:00+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=210&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=170&q=*:*&nr=10)
2014-01-13 19:53:00+0100 [bcn] DEBUG: parse_item
2014-01-13 19:53:00+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=230&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=180&q=*:*&nr=10)
2014-01-13 19:53:00+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=200&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=150&q=*:*&nr=10)
2014-01-13 19:53:00+0100 [bcn] DEBUG: parse_item
2014-01-13 19:53:00+0100 [bcn] DEBUG: parse_item
2014-01-13 19:53:01+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=260&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=220&q=*:*&nr=10)
2014-01-13 19:53:01+0100 [bcn] DEBUG: parse_item
2014-01-13 19:53:01+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=250&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=220&q=*:*&nr=10)
2014-01-13 19:53:01+0100 [bcn] DEBUG: parse_item
2014-01-13 19:53:01+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=240&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=190&q=*:*&nr=10)
2014-01-13 19:53:01+0100 [bcn] DEBUG: parse_item
2014-01-13 19:53:01+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=270&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=220&q=*:*&nr=10)
2014-01-13 19:53:01+0100 [bcn] DEBUG: parse_item
^C2014-01-13 19:53:01+0100 [scrapy] INFO: Received SIGINT, shutting down gracefully. Send again to force
2014-01-13 19:53:01+0100 [bcn] INFO: Closing spider (shutdown)
2014-01-13 19:53:02+0100 [bcn] DEBUG: Crawled (200) <GET http://guia.bcn.cat/index.php?pg=search&from=300&q=*:*&nr=10> (referer: http://guia.bcn.cat/index.php?pg=search&from=260&q=*:*&nr=10)
2014-01-13 19:53:02+0100 [bcn] DEBUG: parse_item
^C2014-01-13 19:53:02+0100 [scrapy] INFO: Received SIGINT twice, forcing unclean shutdown
2014-01-13 19:53:02+0100 [bcn] DEBUG: Retrying <GET http://guia.bcn.cat/index.php?pg=search&from=320&q=*:*&nr=10> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>, <twisted.python.failure.Failure <class 'twisted.web.http._DataLoss'>>]
2014-01-13 19:53:02+0100 [bcn] DEBUG: Retrying <GET http://guia.bcn.cat/index.php?pg=search&from=310&q=*:*&nr=10> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>, <twisted.python.failure.Failure <class 'twisted.web.http._DataLoss'>>]
2014-01-13 19:53:02+0100 [bcn] DEBUG: Retrying <GET http://guia.bcn.cat/index.php?pg=search&from=290&q=*:*&nr=10> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>, <twisted.python.failure.Failure <class 'twisted.web.http._DataLoss'>>]
2014-01-13 19:53:02+0100 [bcn] DEBUG: Retrying <GET http://guia.bcn.cat/index.php?pg=search&from=280&q=*:*&nr=10> (failed 1 times): [<twisted.python.failure.Failure <class 'twisted.internet.error.ConnectionLost'>>, <twisted.python.failure.Failure <class 'twisted.web.http._DataLoss'>>]
2014-01-13 19:53:02+0100 [bcn] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 4,
'downloader/exception_type_count/twisted.web._newclient.ResponseFailed': 4,
'downloader/request_bytes': 24602,
'downloader/request_count': 34,
'downloader/request_method_count/GET': 34,
'downloader/response_bytes': 1781320,
'downloader/response_count': 30,
'downloader/response_status_count/200': 30,
'finish_reason': 'shutdown',
'finish_time': datetime.datetime(2014, 1, 13, 18, 53, 2, 86554),
'log_count/DEBUG': 66,
'log_count/INFO': 9,
'request_depth_max': 8,
'response_received_count': 30,
'scheduler/dequeued': 34,
'scheduler/dequeued/memory': 34,
'scheduler/enqueued': 41,
'scheduler/enqueued/memory': 41,
'start_time': datetime.datetime(2014, 1, 13, 18, 52, 51, 141656)}
2014-01-13 19:53:02+0100 [bcn] INFO: Spider closed (shutdown)
^C
paul@wheezy:~/tmp/bcn/bcncat$ ^C
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment