harshasrinivas/proposal.md

## proposal.md

      
    Raw
  

              proposal.md
            
          
    ScrapingHub - IPython IDE for Scrapy

Sub-org Info:

ScrapingHub - Scrapy
Student Info


Name: Harshavardhan Srinivas
Website: harshasrinivas.me
Email: harshsrinivas@gmail.com
GitHub: @harshasrinivas
Resume: Link to my resume
Time Zone: UTC+5:30

Contributions to Scrapy

Pull Requests:


#2682 Update Makefile to open webbrowser in MacOS
#2643 Add feature to set RETRY_TIMES per request
#2668 Add sphinx-rtd-theme to docs setup - README
#2683 Remove __nonzero__ from SelectorList docs
#2631 Add note for processing escaped URLs in scrapy shell
#2685 Runspider - warn in case of multiple spiders

Issues reported:


#2661 Scrapy docs: 'make htmlview' does not open the webbrowser
#2624 Scrapy shell - bug in processing escaped URLs

GSoC Proposal Abstract

This project intends to make the development of Scrapy spiders easier  by creating a user-friendly programming environment on IPython (Jupyter) Notebook. This work involves two important tasks:

Integration of the Scrapy-Twisted event loop with the Jupyter event loop.
Develop XPath and CSS Selector helpers for IPython

Proposed Work

1. Integration of Twisted-Jupyter event loop:


Motivation: Inability to use Scrapy as a library inside IPython notebook.


Problem at hand:
Twisted event loop is not restartable - thereby limiting Scrapy's functionalities in IPython. Take a look at this sample code snippet below which demonstrates the issue to be fixed.
Running the below code snippet once - does not cause any problems. However, running it again gives raise to the error ReactorNotRestartable


# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    name = 'scrapy'
    allowed_domains = ['scrapy.org']
    def start_requests(self):
    	link = 'https://scrapy.org/'
    	yield scrapy.Request(url=link, callback=self.parse)
    def parse(self, response):
        pass

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()
The working of scrapy.crawler.CrawlerProcess is currently as follows:
1. Starts a Twisted reactor
2. Runs the spider
3. Stops the Twisted reactor

And since the Twisted event loop is not restartable, running the above code snippet more than once gives raise to the error ReactorNotRestartable.
Since the class scrapy.crawler.CrawlerProcess is the one used by all the Scrapy commands, this requires a major fix to be carried out


Proposal: Use scrapy.crawler.CrawlerRunner to integrate Scrapy-Twisted event loop with the IPython-Jupyter event loop. This is due to the fact that CrawlerRunner won’t start or interfere with existing reactors - Take a look at the code sample below.
This code snippet performs exactly the same job as ReactorNotRestartable.ipynb. However, we use CrawlerRunner here instead of CrawlerProcess - so we have a control over starting and stopping the twisted reactor.


# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
    name = 'scrapy'
    allowed_domains = ['scrapy.org']
    def start_requests(self):
    	link = 'https://scrapy.org/'
    	yield scrapy.Request(url=link, callback=self.parse)
    def parse(self, response):
        pass

configure_logging()
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()


Implementation:

References:

Integrating with GUI event loops
Twisted and Foreign event loops
Event loop integration for ZeroMQ based kernels
Twisted + ZMQ integration


2. XPath and CSS Selector helpers/visualizers:


Motivation: Retrieving data using Selector functions could be made easier. Picking a valid selector argument should not be time-consuming.


Problem at hand: Currently, a valid selector path for getting a set of data from a website is obtained using:

Analysis of underlying HTML
Using browser extensions etc. (Example)


Proposal: A visualization function has been proposed. IPython's capability to display a HTML webpage using _repr_html_ can be used to display and highlight the elements corresponding to the XPath/CSS Selector argument of the function. (Highlighting would be done similar to how this looks)


Usage:

User runs the function show_xpath('https://scrapy.org', '//li')

Example: show_xpath(), show_css() etc.


Web page is visualized - data to be extracted is highlighted
User can cross-check the validity of the selector path.


Proof of Concept: IPython Proof of Concept


Proposed Timeline:

Until May 4: Accepted students proposals announced


Read and understand the code to get familiar with all of the Scrapy components. Fix bugs and submit patches, understand the structure of the code from the lowest level possible.
Improvise and fix issues corresponding to Parsel library and understand it's implementation as well.
Obtain a deeper understanding of Twisted - submit patches to fix existing issues relating to Scrapy-Twisted. Experiment with its event loop integration tasks.
Quickly respond to Scrapy's issues reported by developers around the globe. Remain active with the Scrapy community on Stackoverflow. This would be the most effective way to enhance my understanding of Scrapy.

May 5 - May 30: Community Bonding Period


Regular discussions with Mentors and Community members regarding the project.
Create a perfect day to day schedule after consulting with the mentors about the internals of the project.
Respond to Scrapy's issues on Github and Stackoverflow promptly.

May 30 (Coding begins) - June 18:


Twisted-Jupyter event loop integration

June 19 - June 26 (Code Review):


Get code reviewed by mentor and improvise code as suggested.
Submit code for official review on May 26.

July 1 - July 16:


Begin working on the XPath and CSS Selector helpers.
Implement modules in Parsel, improvise code coverage.

July 17 - July 24 (Code Review):


Get code reviewed by mentors and improvise code as suggested.
Submit code for official review on July 24.

July 28 - August 6:


Code reviews from the community and improvise all the changes as requested.
Improvise the tests, document the code.

August 6 - August 21:


Buffer period to fix additional bugs and implement changes wherever necessary.

Other Commitments:

Do you have any other commitments during the main GSoC time period (Classes/Internships etc.)?


No

Have you applied with any other organizations?


No

NOTE:


I positively look forward to complete this project well ahead of schedule. In case additional time is left over before the end of the project, I shall proceed with the implementation of additional features for Scrapy. Moreover, the scope of using IPython and Scrapy together is huge. Hence, will be working on additional functions/modules to improvise the user experience and hopefully contribute for years to come.

Extra Information:

University Info:


University: National Institute of Technology, Tiruchirappalli, India
Major: Electrical and Electronics Engineering
Year: Senior
Graduation: Aug 2017 (Expected)

Other Contact Info:


Skype: Username - harshsrinivas
Homepage: harshasrinivas.me
Twitter: @harshasrinivas