ScrapingHub - Scrapy
- Name: Harshavardhan Srinivas
- Website: harshasrinivas.me
- Email: harshsrinivas@gmail.com
- GitHub: @harshasrinivas
- Resume: Link to my resume
- Time Zone: UTC+5:30
- #2682 Update Makefile to open webbrowser in MacOS
- #2643 Add feature to set RETRY_TIMES per request
- #2668 Add sphinx-rtd-theme to docs setup - README
- #2683 Remove __nonzero__ from SelectorList docs
- #2631 Add note for processing escaped URLs in scrapy shell
- #2685 Runspider - warn in case of multiple spiders
- #2661 Scrapy docs: 'make htmlview' does not open the webbrowser
- #2624 Scrapy shell - bug in processing escaped URLs
This project intends to make the development of Scrapy spiders easier by creating a user-friendly programming environment on IPython (Jupyter) Notebook. This work involves two important tasks:
- Integration of the Scrapy-Twisted event loop with the Jupyter event loop.
- Develop XPath and CSS Selector helpers for IPython
-
Motivation: Inability to use Scrapy as a library inside IPython notebook.
-
Problem at hand:
Twisted event loop is not restartable - thereby limiting Scrapy's functionalities in IPython. Take a look at this sample code snippet below which demonstrates the issue to be fixed.
Running the below code snippet once - does not cause any problems. However, running it again gives raise to the error
ReactorNotRestartable
# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess
class MySpider(scrapy.Spider):
name = 'scrapy'
allowed_domains = ['scrapy.org']
def start_requests(self):
link = 'https://scrapy.org/'
yield scrapy.Request(url=link, callback=self.parse)
def parse(self, response):
pass
process = CrawlerProcess({
'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()
The working of scrapy.crawler.CrawlerProcess
is currently as follows:
1. Starts a Twisted reactor
2. Runs the spider
3. Stops the Twisted reactor
And since the Twisted event loop is not restartable, running the above code snippet more than once gives raise to the error ReactorNotRestartable
.
Since the class scrapy.crawler.CrawlerProcess
is the one used by all the Scrapy commands, this requires a major fix to be carried out
-
Proposal: Use
scrapy.crawler.CrawlerRunner
to integrate Scrapy-Twisted event loop with the IPython-Jupyter event loop. This is due to the fact that CrawlerRunner won’t start or interfere with existing reactors - Take a look at the code sample below.This code snippet performs exactly the same job as
ReactorNotRestartable.ipynb
. However, we useCrawlerRunner
here instead ofCrawlerProcess
- so we have a control over starting and stopping the twisted reactor.
# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor
from scrapy.utils.log import configure_logging
class MySpider(scrapy.Spider):
name = 'scrapy'
allowed_domains = ['scrapy.org']
def start_requests(self):
link = 'https://scrapy.org/'
yield scrapy.Request(url=link, callback=self.parse)
def parse(self, response):
pass
configure_logging()
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()
-
Implementation:
-
Motivation: Retrieving data using Selector functions could be made easier. Picking a valid selector argument should not be time-consuming.
-
Problem at hand: Currently, a valid selector path for getting a set of data from a website is obtained using:
- Analysis of underlying HTML
- Using browser extensions etc. (Example)
-
Proposal: A visualization function has been proposed. IPython's capability to display a HTML webpage using
_repr_html_
can be used to display and highlight the elements corresponding to the XPath/CSS Selector argument of the function. (Highlighting would be done similar to how this looks) -
Usage:
- User runs the function
show_xpath('https://scrapy.org', '//li')
- Example:
show_xpath(), show_css() etc.
- Example:
- Web page is visualized - data to be extracted is highlighted
- User can cross-check the validity of the selector path.
- User runs the function
-
Proof of Concept: IPython Proof of Concept
- Read and understand the code to get familiar with all of the Scrapy components. Fix bugs and submit patches, understand the structure of the code from the lowest level possible.
- Improvise and fix issues corresponding to Parsel library and understand it's implementation as well.
- Obtain a deeper understanding of Twisted - submit patches to fix existing issues relating to Scrapy-Twisted. Experiment with its event loop integration tasks.
- Quickly respond to Scrapy's issues reported by developers around the globe. Remain active with the Scrapy community on Stackoverflow. This would be the most effective way to enhance my understanding of Scrapy.
- Regular discussions with Mentors and Community members regarding the project.
- Create a perfect day to day schedule after consulting with the mentors about the internals of the project.
- Respond to Scrapy's issues on Github and Stackoverflow promptly.
- Twisted-Jupyter event loop integration
- Get code reviewed by mentor and improvise code as suggested.
- Submit code for official review on May 26.
- Begin working on the XPath and CSS Selector helpers.
- Implement modules in Parsel, improvise code coverage.
- Get code reviewed by mentors and improvise code as suggested.
- Submit code for official review on July 24.
- Code reviews from the community and improvise all the changes as requested.
- Improvise the tests, document the code.
- Buffer period to fix additional bugs and implement changes wherever necessary.
- No
- No
- I positively look forward to complete this project well ahead of schedule. In case additional time is left over before the end of the project, I shall proceed with the implementation of additional features for Scrapy. Moreover, the scope of using IPython and Scrapy together is huge. Hence, will be working on additional functions/modules to improvise the user experience and hopefully contribute for years to come.
- University: National Institute of Technology, Tiruchirappalli, India
- Major: Electrical and Electronics Engineering
- Year: Senior
- Graduation: Aug 2017 (Expected)
- Skype: Username - harshsrinivas
- Homepage: harshasrinivas.me
- Twitter: @harshasrinivas