Created
July 1, 2013 11:12
-
-
Save redapple/5900011 to your computer and use it in GitHub Desktop.
Scrapy custom MediaPipeline - access Crawler in process_item()
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
paul@wheezy:~/tmp$ scrapy startproject custompipeline | |
paul@wheezy:~/tmp$ cd custompipeline/ | |
paul@wheezy:~/tmp/custompipeline$ scrapy crawl custompipeline | |
2013-07-01 13:09:01+0200 [scrapy] INFO: Scrapy 0.17.0 started (bot: custompipeline) | |
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Optional features available: ssl, django, http11, boto, libxml2 | |
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Overridden settings: {'NEWSPIDER_MODULE': 'custompipeline.spiders', 'SPIDER_MODULES': ['custompipeline.spiders'], 'ITEM_PIPELINES': ['custompipeline.pipelines.CustomPipeline'], 'BOT_NAME': 'custompipeline'} | |
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled extensions: LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState | |
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats | |
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware | |
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Enabled item pipelines: CustomPipeline | |
2013-07-01 13:09:01+0200 [custompipeline] INFO: Spider opened | |
2013-07-01 13:09:01+0200 [custompipeline] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) | |
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023 | |
2013-07-01 13:09:01+0200 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080 | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None) | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n '], 'link': [u'/'], 'title': [u'Top']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [], 'link': [u'/Computers/'], 'title': [u'Computers']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [], 'link': [u'/Computers/Programming/'], 'title': [u'Programming']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [], | |
'link': [u'/Computers/Programming/Languages/'], | |
'title': [u'Languages']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [], | |
'link': [u'/Computers/Programming/Languages/Python/'], | |
'title': [u'Python']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n \t', u'\xa0', u'\n '], | |
'link': [], | |
'title': []} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n ', | |
u' \n ', | |
u'\n '], | |
'link': [u'/Computers/Programming/Languages/Python/Resources/'], | |
'title': [u'Computers: Programming: Languages: Python: Resources']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n ', | |
u' \n ', | |
u'\n '], | |
'link': [u'/Computers/Programming/Languages/Ruby/Books/'], | |
'title': [u'Computers: Programming: Languages: Ruby: Books']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n \t', | |
u'\n ', | |
u'\n\t\t\t\t\t'], | |
'link': [u'/World/Deutsch/Computer/Programmieren/Sprachen/Python/B%C3%BCcher/'], | |
'title': [u'German']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By Guido van Rossum, Fred L. Drake, Jr.; Network Theory Ltd., 2003, ISBN 0954161769. Printed edition of official tutorial, for v2.x, from Python.org. [Network Theory, online]\n \n '], | |
'link': [u'http://www.network-theory.co.uk/python/intro/'], | |
'title': [u'An Introduction to Python']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By Wesley J. Chun; Prentice Hall PTR, 2001, ISBN 0130260363. For experienced developers to improve extant skills; professional level examples. Starts by introducing syntax, objects, error handling, functions, classes, built-ins. [Prentice Hall]\n \n '], | |
'link': [u'http://www.pearsonhighered.com/educator/academic/product/0,,0130260363,00%2Ben-USS_01DBC.html'], | |
'title': [u'Core Python Programming']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - The primary goal of this book is to promote object-oriented design using Python and to illustrate the use of the emerging object-oriented design patterns.\r\nA secondary goal of the book is to present mathematical tools just in time. Analysis techniques and proofs are presented as needed and in the proper context.\n \n '], | |
'link': [u'http://www.brpreiss.com/books/opus7/html/book.html'], | |
'title': [u'Data Structures and Algorithms with Object-Oriented Design Patterns in Python']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By Mark Pilgrim, Guide to Python 3 and its differences from Python 2. Each chapter starts with a real code sample and explains it fully. Has a comprehensive appendix of all the syntactic and semantic changes in Python 3\r\n\r\n\n \n '], | |
'link': [u'http://www.diveintopython.net/'], | |
'title': [u'Dive Into Python 3']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - This book covers a wide range of topics. From raw TCP and UDP to encryption with TSL, and then to HTTP, SMTP, POP, IMAP, and ssh. It gives you a good understanding of each field and how to do everything on the network with Python.\n \n '], | |
'link': [u'http://rhodesmill.org/brandon/2011/foundations-of-python-network-programming/'], | |
'title': [u'Foundations of Python Network Programming']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - Free Python books and tutorials.\n \n '], | |
'link': [u'http://www.techbooksforfree.com/perlpython.shtml'], | |
'title': [u'Free Python books']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - Annotated list of free online books on Python scripting language. Topics range from beginner to advanced.\n \n '], | |
'link': [u'http://www.freetechbooks.com/python-f6.html'], | |
'title': [u'FreeTechBooks: Python Scripting Language']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By Allen B. Downey, Jeffrey Elkner, Chris Meyers; Green Tea Press, 2002, ISBN 0971677506. Teaches general principles of programming, via Python as subject language. Thorough, in-depth approach to many basic and intermediate programming topics. Full text online and downloads: HTML, PDF, PS, LaTeX. [Free, Green Tea Press]\n \n '], | |
'link': [u'http://greenteapress.com/thinkpython/'], | |
'title': [u'How to Think Like a Computer Scientist: Learning with Python']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - Book by Alan Gauld with full text online. Introduction for those learning programming basics: terminology, concepts, methods to write code. Assumes no prior knowledge but basic computer skills.\n \n '], | |
'link': [u'http://www.freenetpages.co.uk/hp/alan.gauld/'], | |
'title': [u'Learn to Program Using Python']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By Rashi Gupta; John Wiley and Sons, 2002, ISBN 0471219754. Covers language basics, use for CGI scripting, GUI development, network programming; shows why it is one of more sophisticated of popular scripting languages. [Wiley]\n \n '], | |
'link': [u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0471219754.html'], | |
'title': [u'Making Use of Python']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By Magnus Lie Hetland; Apress LP, 2002, ISBN 1590590066. Readable guide to ideas most vital to new users, from basics common to high level languages, to more specific aspects, to a series of 10 ever more complex programs. [Apress]\n \n '], | |
'link': [u'http://hetland.org/writing/practical-python/'], | |
'title': [u'Practical Python']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By Rytis Sileika, ISBN13: 978-1-4302-2605-5, Uses real-world system administration examples like manage devices with SNMP and SOAP, build a distributed monitoring system, manage web applications and parse complex log files, monitor and manage MySQL databases.\r\n\n \n '], | |
'link': [u'http://www.sysadminpy.com/'], | |
'title': [u'Pro Python System Administration']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - A Complete Introduction to the Python 3.\n \n '], | |
'link': [u'http://www.qtrac.eu/py3book.html'], | |
'title': [u'Programming in Python 3 (Second Edition)']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By Dave Brueck, Stephen Tanner; John Wiley and Sons, 2001, ISBN 0764548077. Full coverage, clear explanations, hands-on examples, full language reference; shows step by step how to use components, assemble them, form full-featured programs. [John Wiley and Sons]\n \n '], | |
'link': [u'http://www.wiley.com/WileyCDA/WileyTitle/productCd-0764548077.html'], | |
'title': [u'Python 2.1 Bible']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - A step-by-step tutorial for OOP in Python 3, including discussion and examples of abstraction, encapsulation, information hiding, and raise, handle, define, and manipulate exceptions.\n \n '], | |
'link': [u'https://www.packtpub.com/python-3-object-oriented-programming/book'], | |
'title': [u'Python 3 Object Oriented Programming']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By Guido van Rossum, Fred L. Drake, Jr.; Network Theory Ltd., 2003, ISBN 0954161785. Printed edition of official language reference, for v2.x, from Python.org, describes syntax, built-in datatypes. [Network Theory, online]\n \n '], | |
'link': [u'http://www.network-theory.co.uk/python/language/'], | |
'title': [u'Python Language Reference Manual']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By Thomas W. Christopher; Prentice Hall PTR, 2002, ISBN 0130409561. Shows how to write large programs, introduces powerful design patterns that deliver high levels of robustness, scalability, reuse.\n \n '], | |
'link': [u'http://www.pearsonhighered.com/educator/academic/product/0,,0130409561,00%2Ben-USS_01DBC.html'], | |
'title': [u'Python Programming Patterns']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u" \n\t\t\t\n\t\t\t\t\t\n - By Richard Hightower; Addison-Wesley, 2002, 0201616165. Begins with Python basics, many exercises, interactive sessions. Shows programming novices concepts and practical methods. Shows programming experts Python's abilities and ways to interface with Java APIs. [publisher website]\n \n "], | |
'link': [u'http://www.informit.com/store/product.aspx?isbn=0201616165&redir=1'], | |
'title': [u'Python Programming with the Java Class Libraries: A Tutorial for Building Web and Enterprise Applications with Jython']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By Chris Fehily; Peachpit Press, 2002, ISBN 0201748843. Task-based, step-by-step visual reference guide, many screen shots, for courses in digital graphics; Web design, scripting, development; multimedia, page layout, office tools, operating systems. [Prentice Hall]\n \n '], | |
'link': [u'http://www.pearsonhighered.com/educator/academic/product/0,,0201748843,00%2Ben-USS_01DBC.html'], | |
'title': [u'Python: Visual QuickStart Guide']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By Ivan Van Laningham; Sams Publishing, 2000, ISBN 0672317354. Split into 24 hands-on, 1 hour lessons; steps needed to learn topic: syntax, language features, OO design and programming, GUIs (Tkinter), system administration, CGI. [Sams Publishing]\n \n '], | |
'link': [u'http://www.informit.com/store/product.aspx?isbn=0672317354'], | |
'title': [u'Sams Teach Yourself Python in 24 Hours']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By David Mertz; Addison Wesley. Book in progress, full text, ASCII format. Asks for feedback. [author website, Gnosis Software, Inc.]\n \n '], | |
'link': [u'http://gnosis.cx/TPiP/'], | |
'title': [u'Text Processing in Python']} | |
CustomPipeline::process_item() | |
<scrapy.crawler.CrawlerProcess object at 0x1eb2bd0> | |
<scrapy.statscol.MemoryStatsCollector object at 0x1eb2ed0> | |
2013-07-01 13:09:02+0200 [custompipeline] DEBUG: Scraped from <200 http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> | |
{'desc': [u'\n\t\t\t\n \t', | |
u' \n\t\t\t\n\t\t\t\t\t\n - By Sean McGrath; Prentice Hall PTR, 2000, ISBN 0130211192, has CD-ROM. Methods to build XML applications fast, Python tutorial, DOM and SAX, new Pyxie open source XML processing library. [Prentice Hall PTR]\n \n '], | |
'link': [u'http://www.informit.com/store/product.aspx?isbn=0130211192'], | |
'title': [u'XML Processing with Python']} | |
2013-07-01 13:09:02+0200 [custompipeline] INFO: Closing spider (finished) | |
2013-07-01 13:09:02+0200 [custompipeline] INFO: Dumping Scrapy stats: | |
{'downloader/request_bytes': 263, | |
'downloader/request_count': 1, | |
'downloader/request_method_count/GET': 1, | |
'downloader/response_bytes': 8006, | |
'downloader/response_count': 1, | |
'downloader/response_status_count/200': 1, | |
'finish_reason': 'finished', | |
'finish_time': datetime.datetime(2013, 7, 1, 11, 9, 2, 460078), | |
'item_scraped_count': 31, | |
'log_count/DEBUG': 40, | |
'log_count/INFO': 4, | |
'response_received_count': 1, | |
'scheduler/dequeued': 1, | |
'scheduler/dequeued/memory': 1, | |
'scheduler/enqueued': 1, | |
'scheduler/enqueued/memory': 1, | |
'start_time': datetime.datetime(2013, 7, 1, 11, 9, 1, 882702)} | |
2013-07-01 13:09:02+0200 [custompipeline] INFO: Spider closed (finished) | |
paul@wheezy:~/tmp/custompipeline$ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
from scrapy.spider import BaseSpider | |
from scrapy.item import Item, Field | |
from scrapy.selector import HtmlXPathSelector | |
class MyItem(Item): | |
title = Field() | |
link = Field() | |
desc = Field() | |
class CustomPipelineSpider(BaseSpider): | |
name = "custompipeline" | |
allowed_domains = ["dmoz.org"] | |
start_urls = [ | |
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", | |
#"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" | |
] | |
def parse(self, response): | |
hxs = HtmlXPathSelector(response) | |
sites = hxs.select('//ul/li') | |
items = [] | |
for site in sites: | |
item = MyItem() | |
item['title'] = site.select('a/text()').extract() | |
item['link'] = site.select('a/@href').extract() | |
item['desc'] = site.select('text()').extract() | |
items.append(item) | |
return items |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Define your item pipelines here | |
# | |
# Don't forget to add your pipeline to the ITEM_PIPELINES setting | |
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html | |
from scrapy.contrib.pipeline.media import MediaPipeline | |
class CustomPipeline(MediaPipeline): | |
def process_item(self, item, spider): | |
print "%s::process_item()" % self.__class__.__name__ | |
print self.crawler | |
print self.crawler.stats | |
return item |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Scrapy settings for custompipeline project | |
# | |
# For simplicity, this file contains only the most important settings by | |
# default. All the other settings are documented here: | |
# | |
# http://doc.scrapy.org/en/latest/topics/settings.html | |
# | |
BOT_NAME = 'custompipeline' | |
SPIDER_MODULES = ['custompipeline.spiders'] | |
NEWSPIDER_MODULE = 'custompipeline.spiders' | |
# Crawl responsibly by identifying yourself (and your website) on the user-agent | |
#USER_AGENT = 'custompipeline (+http://www.yourdomain.com)' | |
ITEM_PIPELINES = ['custompipeline.pipelines.CustomPipeline'] |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment