Skip to content

Instantly share code, notes, and snippets.

@redapple
redapple / parslepy_xml.py
Last active December 19, 2015 03:39
Parsing XML with parslepy
import lxml.etree
import parslepy
import urllib2
import pprint
xml_parser = lxml.etree.XMLParser()
url = 'https://itunes.apple.com/us/rss/topalbums/limit=10/explicit=true/xml'
req = urllib2.Request(url)
root = lxml.etree.parse(urllib2.urlopen(req), parser=xml_parser).getroot()
@redapple
redapple / console.log
Created July 1, 2013 11:12
Scrapy custom MediaPipeline - access Crawler in process_item()
paul@wheezy:~/tmp$ scrapy startproject custompipeline
paul@wheezy:~/tmp$ cd custompipeline/
paul@wheezy:~/tmp/custompipeline$ scrapy crawl custompipeline
2013-07-01 13:09:01+0200 [scrapy] INFO: Scrapy 0.17.0 started (bot: custompipeline)
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Optional features available: ssl, django, http11, boto, libxml2
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'custompipeline.spiders', 'SPIDER_MODULES': ['custompipeline.spiders'], 'ITEM_PIPELINES': ['custompipeline.pipelines.CustomPipeline'], 'BOT_NAME': 'custompipeline'}
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware,
paul@wheezy:~/tmp/ProjetVinNicolas1$ scrapy crawl vino
2013-07-10 12:40:31+0200 [scrapy] INFO: Scrapy 0.17.0 started (bot: ProjetVinNicolas1)
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Optional features available: ssl, django, http11, boto, libxml2
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'ProjetVinNicolas1.spiders', 'SPIDER_MODULES': ['ProjetVinNicolas1.spiders'], 'BOT_NAME': 'ProjetVinNicolas1'}
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLe
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from bcncat.items import BcncatItem
import re
class BcnSpider(CrawlSpider):
name = 'bcn'
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']
@redapple
redapple / schema_org.py
Created June 19, 2014 09:05
schema.org with XPath blog post
from scrapy.selector import Selector
selector = Selector(text="""
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Avatar</h1>
<span>Director: <span itemprop="director">James Cameron</span> (born August 16, 1954)</span>
<span itemprop="genre">Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>""", type="html")
paul@paul-SATELLITE-R830:~/dev/scrapy$ scrapy shell "https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1"
2014-06-19 12:18:12+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: scrapybot)
2014-06-19 12:18:12+0200 [scrapy] INFO: Optional features available: ssl, http11, boto
2014-06-19 12:18:12+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-06-19 12:18:12+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-19 12:18:13+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-19 12:18:13+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-19 12:18:
http://instagram.com/barackobama
https://www.facebook.com/ScrapingHub
@redapple
redapple / scrapyshell
Created December 30, 2014 14:25
YouTube js2xml
$ scrapy shell "https://www.youtube.com/watch?v=1EFnX1UkXVU"
/usr/local/lib/python2.7/dist-packages/twisted/internet/_sslverify.py:184: UserWarning: You do not have the service_identity module installed. Please install it from <https://pypi.python.org/pypi/service_identity>. Without the service_identity module and a recent enough pyOpenSSL tosupport it, Twisted can perform only rudimentary TLS client hostnameverification. Many valid certificate/hostname mappings may be rejected.
verifyHostname, VerificationError = _selectVerifyImplementation()
2014-12-30 15:18:08+0100 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)
2014-12-30 15:18:08+0100 [scrapy] INFO: Optional features available: ssl, http11, boto, django
2014-12-30 15:18:08+0100 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-12-30 15:18:08+0100 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-12-30 15:18:08+0100 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddlewa
$ scrapy shell
2016-01-28 18:21:43 [scrapy] INFO: Scrapy 1.1.0dev1 started (bot: scrapybot)
2016-01-28 18:21:43 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-01-28 18:21:43 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-01-28 18:21:44 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
$ scrapy shell
2016-02-01 12:41:35 [scrapy] INFO: Scrapy 1.1.0dev1 started (bot: scrapybot)
2016-02-01 12:41:35 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0, 'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter'}
2016-02-01 12:41:35 [scrapy] INFO: Enabled extensions:
['scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.corestats.CoreStats']
2016-02-01 12:41:35 [scrapy] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',