Skip to content

Instantly share code, notes, and snippets.

@redapple
Created July 1, 2013 11:12
Show Gist options
  • Save redapple/5900011 to your computer and use it in GitHub Desktop.
Save redapple/5900011 to your computer and use it in GitHub Desktop.
Scrapy custom MediaPipeline - access Crawler in process_item()
paul@wheezy:~/tmp$ scrapy startproject custompipeline
paul@wheezy:~/tmp$ cd custompipeline/
paul@wheezy:~/tmp/custompipeline$ scrapy crawl custompipeline
2013-07-01 13:09:01+0200 [scrapy] INFO: Scrapy 0.17.0 started (bot: custompipeline)
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Optional features available: ssl, django, http11, boto, libxml2
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'custompipeline.spiders', 'SPIDER_MODULES': ['custompipeline.spiders'], 'ITEM_PIPELINES': ['custompipeline.pipelines.CustomPipeline'], 'BOT_NAME': 'custompipeline'}
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled item pipelines: CustomPipeline
2013-07-01 13:09:01+0200 [custompipeline] INFO: Spider opened
2013-07-01 13:09:01+0200 [custompipeline] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None)
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n '], 'link': [u'/'], 'title': [u'Top']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [], 'link': [u'/Computers/'], 'title': [u'Computers']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [], 'link': [u'/Computers/Programming/'], 'title': [u'Programming']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [],
'link': [u'/Computers/Programming/Languages/'],
'title': [u'Languages']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [],
'link': [u'/Computers/Programming/Languages/Python/'],
'title': [u'Python']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n \t', u'\xa0', u'\n '],
'link': [],
'title': []}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n ',
u' \n ',
u'\n '],
'link': [u'/Computers/Programming/Languages/Python/Resources/'],
'title': [u'Computers: Programming: Languages: Python: Resources']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n ',
u' \n ',
u'\n '],
'link': [u'/Computers/Programming/Languages/Ruby/Books/'],
'title': [u'Computers: Programming: Languages: Ruby: Books']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n \t',
u'\n ',
u'\n\t\t\t\t\t'],
'link': [u'/World/Deutsch/Computer/Programmieren/Sprachen/Python/B%C3%BCcher/'],
'title': [u'German']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By Guido van Rossum, Fred L. Drake, Jr.; Network Theory Ltd., 2003, ISBN 0954161769. Printed edition of official tutorial, for v2.x, from Python.org. [Network Theory, online]\n \n '],
'link': [u'http://www.network-theory.co.uk/python/intro/'],
'title': [u'An Introduction to Python']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By Wesley J. Chun; Prentice Hall PTR, 2001, ISBN 0130260363. For experienced developers to improve extant skills; professional level examples. Starts by introducing syntax, objects, error handling, functions, classes, built-ins. [Prentice Hall]\n \n '],
'link': [u'http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html'],
'title': [u'Core Python Programming']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\n \n '],
'link': [u'http://www.brpreiss.com/books/opus7/html/book.html'],
'title': [u'Data Structures and Algorithms with Object-Oriented Design Patterns in Python']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By Mark Pilgrim, Guide to Python 3 and its differences from Python 2. Each chapter starts with a real code sample and explains it fully. Has a comprehensive appendix of all the syntactic and semantic changes in Python 3\r\n\r\n\n \n '],
'link': [u'http://www.diveintopython.net/'],
'title': [u'Dive Into Python 3']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - This book covers a wide range of topics. From raw TCP and UDP to encryption with TSL, and then to HTTP, SMTP, POP, IMAP, and ssh. It gives you a good understanding of each field and how to do everything on the network with Python.\n \n '],
'link': [u'http://rhodesmill.org/brandon/2011/foundations-of-python-network-programming/'],
'title': [u'Foundations of Python Network Programming']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - Free Python books and tutorials.\n \n '],
'link': [u'http://www.techbooksforfree.com/perlpython.shtml'],
'title': [u'Free Python books']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - Annotated list of free online books on Python scripting language. Topics range from beginner to advanced.\n \n '],
'link': [u'http://www.freetechbooks.com/python-f6.html'],
'title': [u'FreeTechBooks: Python Scripting Language']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By Allen B. Downey, Jeffrey Elkner, Chris Meyers; Green Tea Press, 2002, ISBN 0971677506. Teaches general principles of programming, via Python as subject language. Thorough, in-depth approach to many basic and intermediate programming topics. Full text online and downloads: HTML, PDF, PS, LaTeX. [Free, Green Tea Press]\n \n '],
'link': [u'http://greenteapress.com/thinkpython/'],
'title': [u'How to Think Like a Computer Scientist: Learning with Python']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - Book by Alan Gauld with full text online. Introduction for those learning programming basics: terminology, concepts, methods to write code. Assumes no prior knowledge but basic computer skills.\n \n '],
'link': [u'http://www.freenetpages.co.uk/hp/alan.gauld/'],
'title': [u'Learn to Program Using Python']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By Rashi Gupta; John Wiley and Sons, 2002, ISBN 0471219754. Covers language basics, use for CGI scripting, GUI development, network programming; shows why it is one of more sophisticated of popular scripting languages. [Wiley]\n \n '],
'link': [u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471219754.html'],
'title': [u'Making Use of Python']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By Magnus Lie Hetland; Apress LP, 2002, ISBN 1590590066. Readable guide to ideas most vital to new users, from basics common to high level languages, to more specific aspects, to a series of 10 ever more complex programs. [Apress]\n \n '],
'link': [u'http://hetland.org/writing/practical-python/'],
'title': [u'Practical Python']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By Rytis Sileika, ISBN13: 978-1-4302-2605-5, Uses real-world system administration examples like manage devices with SNMP and SOAP, build a distributed monitoring system, manage web applications and parse complex log files, monitor and manage MySQL databases.\r\n\n \n '],
'link': [u'http://www.sysadminpy.com/'],
'title': [u'Pro Python System Administration']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - A Complete Introduction to the Python 3.\n \n '],
'link': [u'http://www.qtrac.eu/py3book.html'],
'title': [u'Programming in Python 3 (Second Edition)']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By Dave Brueck, Stephen Tanner; John Wiley and Sons, 2001, ISBN 0764548077. Full coverage, clear explanations, hands-on examples, full language reference; shows step by step how to use components, assemble them, form full-featured programs. [John Wiley and Sons]\n \n '],
'link': [u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0764548077.html'],
'title': [u'Python 2.1 Bible']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - A step-by-step tutorial for OOP in Python 3, including discussion and examples of abstraction, encapsulation, information hiding, and raise, handle, define, and manipulate exceptions.\n \n '],
'link': [u'https://www.packtpub.com/python-3-object-oriented-programming/book'],
'title': [u'Python 3 Object Oriented Programming']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By Guido van Rossum, Fred L. Drake, Jr.; Network Theory Ltd., 2003, ISBN 0954161785. Printed edition of official language reference, for v2.x, from Python.org, describes syntax, built-in datatypes. [Network Theory, online]\n \n '],
'link': [u'http://www.network-theory.co.uk/python/language/'],
'title': [u'Python Language Reference Manual']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By Thomas W. Christopher; Prentice Hall PTR, 2002, ISBN 0130409561. Shows how to write large programs, introduces powerful design patterns that deliver high levels of robustness, scalability, reuse.\n \n '],
'link': [u'http://www.pearsonhighered.com/educator/academic/product/0,,0130409561,00%2Ben-USS_01DBC.html'],
'title': [u'Python Programming Patterns']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u" \n\t\t\t\n\t\t\t\t\t\n - By Richard Hightower; Addison-Wesley, 2002, 0201616165. Begins with Python basics, many exercises, interactive sessions. Shows programming novices concepts and practical methods. Shows programming experts Python's abilities and ways to interface with Java APIs. [publisher website]\n \n "],
'link': [u'http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1'],
'title': [u'Python Programming with the Java Class Libraries: A Tutorial for Building Web and Enterprise Applications with Jython']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By Chris Fehily; Peachpit Press, 2002, ISBN 0201748843. Task-based, step-by-step visual reference guide, many screen shots, for courses in digital graphics; Web design, scripting, development; multimedia, page layout, office tools, operating systems. [Prentice Hall]\n \n '],
'link': [u'http://www.pearsonhighered.com/educator/academic/product/0,,0201748843,00%2Ben-USS_01DBC.html'],
'title': [u'Python: Visual QuickStart Guide']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By Ivan Van Laningham; Sams Publishing, 2000, ISBN 0672317354. Split into 24 hands-on, 1 hour lessons; steps needed to learn topic: syntax, language features, OO design and programming, GUIs (Tkinter), system administration, CGI. [Sams Publishing]\n \n '],
'link': [u'http://www.informit.com/store/product.aspx?isbn=0672317354'],
'title': [u'Sams Teach Yourself Python in 24 Hours']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.]\n \n '],
'link': [u'http://gnosis.cx/TPiP/'],
'title': [u'Text Processing in Python']}
CustomPipeline::process_item()
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0>
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0>
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/>
{'desc': [u'\n\t\t\t\n \t',
u' \n\t\t\t\n\t\t\t\t\t\n - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n \n '],
'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'],
'title': [u'XML Processing with Python']}
2013-07-01 13:09:02+0200 [custompipeline] INFO: Closing spider (finished)
2013-07-01 13:09:02+0200 [custompipeline] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 263,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 8006,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2013, 7, 1, 11, 9, 2, 460078),
'item_scraped_count': 31,
'log_count/DEBUG': 40,
'log_count/INFO': 4,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2013, 7, 1, 11, 9, 1, 882702)}
2013-07-01 13:09:02+0200 [custompipeline] INFO: Spider closed (finished)
paul@wheezy:~/tmp/custompipeline$
from scrapy.spider import BaseSpider
from scrapy.item import Item, Field
from scrapy.selector import HtmlXPathSelector
class MyItem(Item):
title = Field()
link = Field()
desc = Field()
class CustomPipelineSpider(BaseSpider):
name = "custompipeline"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
#"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
hxs = HtmlXPathSelector(response)
sites = hxs.select('//ul/li')
items = []
for site in sites:
item = MyItem()
item['title'] = site.select('a/text()').extract()
item['link'] = site.select('a/@href').extract()
item['desc'] = site.select('text()').extract()
items.append(item)
return items
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
from scrapy.contrib.pipeline.media import MediaPipeline
class CustomPipeline(MediaPipeline):
def process_item(self, item, spider):
print "%s::process_item()" % self.__class__.__name__
print self.crawler
print self.crawler.stats
return item
# Scrapy settings for custompipeline project
#
# For simplicity, this file contains only the most important settings by
# default. All the other settings are documented here:
#
# http://doc.scrapy.org/en/latest/topics/settings.html
#
BOT_NAME = 'custompipeline'
SPIDER_MODULES = ['custompipeline.spiders']
NEWSPIDER_MODULE = 'custompipeline.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
#USER_AGENT = 'custompipeline (+http://www.yourdomain.com)'
ITEM_PIPELINES = ['custompipeline.pipelines.CustomPipeline']
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment