Skip to content

Instantly share code, notes, and snippets.

@malloxpb
Last active December 17, 2018 08:53
Show Gist options
  • Save malloxpb/48c924e039a612f009b7827768feb184 to your computer and use it in GitHub Desktop.
Save malloxpb/48c924e039a612f009b7827768feb184 to your computer and use it in GitHub Desktop.
Final Report for GSoC 2018

Final report for GSoC 2018

Chau Tung Lam Nguyen

Scrapy project - Python Software Foundation

Summary

Code Repository: https://github.com/scrapy/scurl

PR: scrapy/w3lib#110 , scrapy/scrapy#3332

The project I have done over the summer is the library named Scurl , which is a url-parsing library for Scrapy. The performance improve result provides somewhat between 10-15% times faster than Scrapy without Scurl installed.

The library can be activated when people have Scurl installed within the same environment of a Scrapy project.

Work that is done:

  • The library is now at the stable state, it passed nearly all the tests from urllib.parse.
  • It passes all the tests from Scrapy and w3lib, which is the most important task.
  • The library supports 4 functions: urlparse , urlsplit , urljoin , canonicalize_url
  • The result is 2 - 5 times faster than the original parsing functions
  • Scrapy spiders using Scurl are 10-15% faster
  • The Chromium source that Scurl uses has been updated and there is a documentation on how to update the source
  • Scurl has been transferred to the Scrapy organization!

Work that is not done:

  • The PR to Scrapy has not been merged as it requires some further testing
  • The PR to w3lib has also not been merged as it requires some further testing
  • Some small issues listed in Scurl repository.
  • Make the Scurl library more compatible with Scrapy by passing all the tests in urllib.parse
  • Windows support for the library
  • Wheel support for the library

In Detail

Overview

The project Scurl is built to improve the performance of Scrapy. It focuses on the bottleneck components of Scrapy such as urlparse , urljoin , urlsplit and canonicalize_url. Therefore, Scurl currently supports those functions only.

Profiling Result

Running scrapy-bench bookworm on python 3:

Python3
With Scurl 72.71 items/second
Without Scurl 62.18 items/second

Running scrapy bench on python 3:

With Scurl 2880 pages/min
Without Scurl 2400 pages/min

More information on the profiling result can be found under this gist. One image is the CPU % of parse in Scrapy spider without Scurl installed and the other is the CPU % with Scurl installed.

Fallback

Since the GURL container is still somewhat different from the urllib.parse functions (one is a browser url component and the other is one of the CPython standard libraries), I have also implemented the fallback to the urllib.parse functions when it's necessary. However, we can still improve Scurl performance by resolving the incompatibility somehow (probably by adding more code to the Cython wrapper)

Testing

The tests that Scurl is running are based on the tests from urllib.parse. The reason why I am using the urllib.parse tests is because Scrapy spiders currently use parsing functions from that library.

Currently, there are a few failed tests in the urllib.parse tests, which can be found under the /tests directory in the Scurl repository. The failed tests are marked as xfail (expected to fail) and I hope that they can be resolved soon as it can bring some incompatibility to Scrapy spiders.

In addition, Scurl is also running the tests from w3lib repository of Scrapy since Scurl supports canonicalize_url function. However, Scurl has passed all the tests from both Scrapy and W3lib!

More information can be found on these 2 PRs: scrapy/w3lib#110 , scrapy/scrapy#3332

Platform Support

Currently Scurl supports macOS and Linux. In the future, Scurl will need help from other contributors who are familiar with Windows environment to work on this issue!

Extra issues

GURL component in Chromium uses ICU (more information can be found at http://site.icu-project.org/) to parse international domain names. However, it was really challenging to get it installed and supported by Scurl. Therefore, Scurl will also need some help on this issue as it is the potential performance improvement for the library!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment