Skip to content

Instantly share code, notes, and snippets.

@harshasrinivas
Last active June 5, 2020 00:11
Show Gist options
  • Save harshasrinivas/5d3b3934213ae0d6a48029c05f6281c6 to your computer and use it in GitHub Desktop.
Save harshasrinivas/5d3b3934213ae0d6a48029c05f6281c6 to your computer and use it in GitHub Desktop.
IPython IDE for Scrapy - GSOC Proposal

ScrapingHub - IPython IDE for Scrapy

Sub-org Info:

ScrapingHub - Scrapy

Student Info

Contributions to Scrapy

Pull Requests:

  1. #2682 Update Makefile to open webbrowser in MacOS
  2. #2643 Add feature to set RETRY_TIMES per request
  3. #2668 Add sphinx-rtd-theme to docs setup - README
  4. #2683 Remove __nonzero__ from SelectorList docs
  5. #2631 Add note for processing escaped URLs in scrapy shell
  6. #2685 Runspider - warn in case of multiple spiders

Issues reported:

  1. #2661 Scrapy docs: 'make htmlview' does not open the webbrowser
  2. #2624 Scrapy shell - bug in processing escaped URLs

GSoC Proposal Abstract

This project intends to make the development of Scrapy spiders easier by creating a user-friendly programming environment on IPython (Jupyter) Notebook. This work involves two important tasks:

  1. Integration of the Scrapy-Twisted event loop with the Jupyter event loop.
  2. Develop XPath and CSS Selector helpers for IPython

Proposed Work

1. Integration of Twisted-Jupyter event loop:

  • Motivation: Inability to use Scrapy as a library inside IPython notebook.

  • Problem at hand:

    Twisted event loop is not restartable - thereby limiting Scrapy's functionalities in IPython. Take a look at this sample code snippet below which demonstrates the issue to be fixed.

    Running the below code snippet once - does not cause any problems. However, running it again gives raise to the error ReactorNotRestartable

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    name = 'scrapy'
    allowed_domains = ['scrapy.org']
    def start_requests(self):
    	link = 'https://scrapy.org/'
    	yield scrapy.Request(url=link, callback=self.parse)
    def parse(self, response):
        pass

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})
process.crawl(MySpider)
process.start()

The working of scrapy.crawler.CrawlerProcess is currently as follows:

1. Starts a Twisted reactor
2. Runs the spider
3. Stops the Twisted reactor

And since the Twisted event loop is not restartable, running the above code snippet more than once gives raise to the error ReactorNotRestartable.

Since the class scrapy.crawler.CrawlerProcess is the one used by all the Scrapy commands, this requires a major fix to be carried out

  • Proposal: Use scrapy.crawler.CrawlerRunner to integrate Scrapy-Twisted event loop with the IPython-Jupyter event loop. This is due to the fact that CrawlerRunner won’t start or interfere with existing reactors - Take a look at the code sample below.

    This code snippet performs exactly the same job as ReactorNotRestartable.ipynb. However, we use CrawlerRunner here instead of CrawlerProcess - so we have a control over starting and stopping the twisted reactor.

# -*- coding: utf-8 -*-
import scrapy
from scrapy.crawler import CrawlerRunner
from twisted.internet import reactor
from scrapy.utils.log import configure_logging

class MySpider(scrapy.Spider):
    name = 'scrapy'
    allowed_domains = ['scrapy.org']
    def start_requests(self):
    	link = 'https://scrapy.org/'
    	yield scrapy.Request(url=link, callback=self.parse)
    def parse(self, response):
        pass

configure_logging()
runner = CrawlerRunner()
d = runner.crawl(MySpider)
d.addBoth(lambda _: reactor.stop())
reactor.run()

2. XPath and CSS Selector helpers/visualizers:

  • Motivation: Retrieving data using Selector functions could be made easier. Picking a valid selector argument should not be time-consuming.

  • Problem at hand: Currently, a valid selector path for getting a set of data from a website is obtained using:

    1. Analysis of underlying HTML
    2. Using browser extensions etc. (Example)
  • Proposal: A visualization function has been proposed. IPython's capability to display a HTML webpage using _repr_html_ can be used to display and highlight the elements corresponding to the XPath/CSS Selector argument of the function. (Highlighting would be done similar to how this looks)

  • Usage:

    1. User runs the function show_xpath('https://scrapy.org', '//li')
      • Example: show_xpath(), show_css() etc.
    2. Web page is visualized - data to be extracted is highlighted
    3. User can cross-check the validity of the selector path.
  • Proof of Concept: IPython Proof of Concept

Proposed Timeline:

Until May 4: Accepted students proposals announced

  • Read and understand the code to get familiar with all of the Scrapy components. Fix bugs and submit patches, understand the structure of the code from the lowest level possible.
  • Improvise and fix issues corresponding to Parsel library and understand it's implementation as well.
  • Obtain a deeper understanding of Twisted - submit patches to fix existing issues relating to Scrapy-Twisted. Experiment with its event loop integration tasks.
  • Quickly respond to Scrapy's issues reported by developers around the globe. Remain active with the Scrapy community on Stackoverflow. This would be the most effective way to enhance my understanding of Scrapy.

May 5 - May 30: Community Bonding Period

  • Regular discussions with Mentors and Community members regarding the project.
  • Create a perfect day to day schedule after consulting with the mentors about the internals of the project.
  • Respond to Scrapy's issues on Github and Stackoverflow promptly.

May 30 (Coding begins) - June 18:

  • Twisted-Jupyter event loop integration

June 19 - June 26 (Code Review):

  • Get code reviewed by mentor and improvise code as suggested.
  • Submit code for official review on May 26.

July 1 - July 16:

  • Begin working on the XPath and CSS Selector helpers.
  • Implement modules in Parsel, improvise code coverage.

July 17 - July 24 (Code Review):

  • Get code reviewed by mentors and improvise code as suggested.
  • Submit code for official review on July 24.

July 28 - August 6:

  • Code reviews from the community and improvise all the changes as requested.
  • Improvise the tests, document the code.

August 6 - August 21:

  • Buffer period to fix additional bugs and implement changes wherever necessary.

Other Commitments:

Do you have any other commitments during the main GSoC time period (Classes/Internships etc.)?

  • No

Have you applied with any other organizations?

  • No

NOTE:

  • I positively look forward to complete this project well ahead of schedule. In case additional time is left over before the end of the project, I shall proceed with the implementation of additional features for Scrapy. Moreover, the scope of using IPython and Scrapy together is huge. Hence, will be working on additional functions/modules to improvise the user experience and hopefully contribute for years to come.

Extra Information:

University Info:

  • University: National Institute of Technology, Tiruchirappalli, India
  • Major: Electrical and Electronics Engineering
  • Year: Senior
  • Graduation: Aug 2017 (Expected)

Other Contact Info:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment