This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
import lxml.etree | |
import parslepy | |
import urllib2 | |
import pprint | |
xml_parser = lxml.etree.XMLParser() | |
url = 'https://itunes.apple.com/us/rss/topalbums/limit=10/explicit=true/xml' | |
req = urllib2.Request(url) | |
root = lxml.etree.parse(urllib2.urlopen(req), parser=xml_parser).getroot() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
paul@wheezy:~/tmp$ scrapy startproject custompipeline | |
paul@wheezy:~/tmp$ cd custompipeline/ | |
paul@wheezy:~/tmp/custompipeline$ scrapy crawl custompipeline | |
2013-07-01 13:09:01+0200 [scrapy] INFO: Scrapy 0.17.0 started (bot: custompipeline) | |
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Optional features available: ssl, django, http11, boto, libxml2 | |
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'custompipeline.spiders', 'SPIDER_MODULES': ['custompipeline.spiders'], 'ITEM_PIPELINES': ['custompipeline.pipelines.CustomPipeline'], 'BOT_NAME': 'custompipeline'} | |
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState | |
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
paul@wheezy:~/tmp/ProjetVinNicolas1$ scrapy crawl vino | |
2013-07-10 12:40:31+0200 [scrapy] INFO: Scrapy 0.17.0 started (bot: ProjetVinNicolas1) | |
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Optional features available: ssl, django, http11, boto, libxml2 | |
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'ProjetVinNicolas1.spiders', 'SPIDER_MODULES': ['ProjetVinNicolas1.spiders'], 'BOT_NAME': 'ProjetVinNicolas1'} | |
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState | |
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats | |
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLe |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from scrapy.selector import Selector | |
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor | |
from scrapy.contrib.spiders import CrawlSpider, Rule | |
from bcncat.items import BcncatItem | |
import re | |
class BcnSpider(CrawlSpider): | |
name = 'bcn' | |
allowed_domains = ['guia.bcn.cat'] | |
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*'] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from scrapy.selector import Selector | |
selector = Selector(text=""" | |
<div itemscope itemtype ="http://schema.org/Movie"> | |
<h1 itemprop="name">Avatar</h1> | |
<span>Director: <span itemprop="director">James Cameron</span> (born August 16, 1954)</span> | |
<span itemprop="genre">Science fiction</span> | |
<a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a> | |
</div>""", type="html") |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
paul@paul-SATELLITE-R830:~/dev/scrapy$ scrapy shell "https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1" | |
2014-06-19 12:18:12+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: scrapybot) | |
2014-06-19 12:18:12+0200 [scrapy] INFO: Optional features available: ssl, http11, boto | |
2014-06-19 12:18:12+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0} | |
2014-06-19 12:18:12+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState | |
2014-06-19 12:18:13+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats | |
2014-06-19 12:18:13+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware | |
2014-06-19 12:18: |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
http://instagram.com/barackobama | |
https://www.facebook.com/ScrapingHub |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ scrapy shell "https://www.youtube.com/watch?v=1EFnX1UkXVU" | |
/usr/local/lib/python2.7/dist-packages/twisted/internet/_sslverify.py:184: UserWarning: You do not have the service_identity module installed. Please install it from <https://pypi.python.org/pypi/service_identity>. Without the service_identity module and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimentary TLS client hostnameverification. Many valid certificate/hostname mappings may be rejected. | |
verifyHostname, VerificationError = _selectVerifyImplementation() | |
2014-12-30 15:18:08+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot) | |
2014-12-30 15:18:08+0100 [scrapy] INFO: Optional features available: ssl, http11, boto, django | |
2014-12-30 15:18:08+0100 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0} | |
2014-12-30 15:18:08+0100 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState | |
2014-12-30 15:18:08+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddlewa |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ scrapy shell | |
2016-01-28 18:21:43 [scrapy] INFO: Scrapy 1.1.0dev1 started (bot: scrapybot) | |
2016-01-28 18:21:43 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'} | |
2016-01-28 18:21:43 [scrapy] INFO: Enabled extensions: | |
['scrapy.extensions.telnet.TelnetConsole', | |
'scrapy.extensions.corestats.CoreStats'] | |
2016-01-28 18:21:44 [scrapy] INFO: Enabled downloader middlewares: | |
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', | |
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', | |
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
$ scrapy shell | |
2016-02-01 12:41:35 [scrapy] INFO: Scrapy 1.1.0dev1 started (bot: scrapybot) | |
2016-02-01 12:41:35 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'} | |
2016-02-01 12:41:35 [scrapy] INFO: Enabled extensions: | |
['scrapy.extensions.telnet.TelnetConsole', | |
'scrapy.extensions.corestats.CoreStats'] | |
2016-02-01 12:41:35 [scrapy] INFO: Enabled downloader middlewares: | |
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', | |
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', | |
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware', |
OlderNewer