Skip to content

Instantly share code, notes, and snippets.

http://instagram.com/barackobama
https://www.facebook.com/ScrapingHub
paul@paul-SATELLITE-R830:~/dev/scrapy$ scrapy shell "https://careers-meridianhealth.icims.com/jobs/search?hashed=0&searchCategory=&searchLocation=&ss=1"
2014-06-19 12:18:12+0200 [scrapy] INFO: Scrapy 0.22.2 started (bot: scrapybot)
2014-06-19 12:18:12+0200 [scrapy] INFO: Optional features available: ssl, http11, boto
2014-06-19 12:18:12+0200 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}
2014-06-19 12:18:12+0200 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2014-06-19 12:18:13+0200 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2014-06-19 12:18:13+0200 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2014-06-19 12:18:
@redapple
redapple / schema_org.py
Created June 19, 2014 09:05
schema.org with XPath blog post
from scrapy.selector import Selector
selector = Selector(text="""
<div itemscope itemtype ="http://schema.org/Movie">
<h1 itemprop="name">Avatar</h1>
<span>Director: <span itemprop="director">James Cameron</span> (born August 16, 1954)</span>
<span itemprop="genre">Science fiction</span>
<a href="../movies/avatar-theatrical-trailer.html" itemprop="trailer">Trailer</a>
</div>""", type="html")
from scrapy.selector import Selector
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from bcncat.items import BcncatItem
import re
class BcnSpider(CrawlSpider):
name = 'bcn'
allowed_domains = ['guia.bcn.cat']
start_urls = ['http://guia.bcn.cat/index.php?pg=search&q=*:*']
paul@wheezy:~/tmp/ProjetVinNicolas1$ scrapy crawl vino
2013-07-10 12:40:31+0200 [scrapy] INFO: Scrapy 0.17.0 started (bot: ProjetVinNicolas1)
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Optional features available: ssl, django, http11, boto, libxml2
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'ProjetVinNicolas1.spiders', 'SPIDER_MODULES': ['ProjetVinNicolas1.spiders'], 'BOT_NAME': 'ProjetVinNicolas1'}
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-10 12:40:31+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLe
@redapple
redapple / console.log
Created July 1, 2013 11:12
Scrapy custom MediaPipeline - access Crawler in process_item()
paul@wheezy:~/tmp$ scrapy startproject custompipeline
paul@wheezy:~/tmp$ cd custompipeline/
paul@wheezy:~/tmp/custompipeline$ scrapy crawl custompipeline
2013-07-01 13:09:01+0200 [scrapy] INFO: Scrapy 0.17.0 started (bot: custompipeline)
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Optional features available: ssl, django, http11, boto, libxml2
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'custompipeline.spiders', 'SPIDER_MODULES': ['custompipeline.spiders'], 'ITEM_PIPELINES': ['custompipeline.pipelines.CustomPipeline'], 'BOT_NAME': 'custompipeline'}
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware,
@redapple
redapple / parslepy_xml.py
Last active December 19, 2015 03:39
Parsing XML with parslepy
import lxml.etree
import parslepy
import urllib2
import pprint
xml_parser = lxml.etree.XMLParser()
url = 'https://itunes.apple.com/us/rss/topalbums/limit=10/explicit=true/xml'
req = urllib2.Request(url)
root = lxml.etree.parse(urllib2.urlopen(req), parser=xml_parser).getroot()