Chau Tung Lam Nguyen
Scrapy project - Python Software Foundation
Code Repository: https://github.com/scrapy/scurl
PR: scrapy/w3lib#110 , scrapy/scrapy#3332
The project I have done over the summer is the library named Scurl , which is a url-parsing library for Scrapy. The performance improve result provides somewhat between 10-15% times faster than Scrapy without Scurl installed.
The library can be activated when people have Scurl installed within the same environment of a Scrapy project.
Work that is done:
- The library is now at the stable state, it passed nearly all the tests from
urllib.parse
. - It passes all the tests from Scrapy and w3lib, which is the most important task.
- The library supports 4 functions: urlparse , urlsplit , urljoin , canonicalize_url
- The result is 2 - 5 times faster than the original parsing functions
- Scrapy spiders using Scurl are 10-15% faster
- The Chromium source that Scurl uses has been updated and there is a documentation on how to update the source
- Scurl has been transferred to the Scrapy organization!
Work that is not done:
- The PR to Scrapy has not been merged as it requires some further testing
- The PR to w3lib has also not been merged as it requires some further testing
- Some small issues listed in Scurl repository.
- Make the Scurl library more compatible with Scrapy by passing all the tests in urllib.parse
- Windows support for the library
- Wheel support for the library
Overview
The project Scurl is built to improve the performance of Scrapy. It focuses on the bottleneck components of Scrapy such as urlparse , urljoin , urlsplit and canonicalize_url. Therefore, Scurl currently supports those functions only.
Profiling Result
Running scrapy-bench bookworm
on python 3:
Python3 | |
---|---|
With Scurl | 72.71 items/second |
Without Scurl | 62.18 items/second |
Running scrapy bench
on python 3:
With Scurl | 2880 pages/min |
---|---|
Without Scurl | 2400 pages/min |
More information on the profiling result can be found under this gist. One image is the CPU % of parse in Scrapy spider without Scurl installed and the other is the CPU % with Scurl installed.
Fallback
Since the GURL container is still somewhat different from the urllib.parse
functions (one is a browser url component and the other is one of the CPython standard libraries), I have also implemented the fallback to the urllib.parse
functions when it's necessary. However, we can still improve Scurl performance by resolving the incompatibility somehow (probably by adding more code to the Cython wrapper)
Testing
The tests that Scurl is running are based on the tests from urllib.parse. The reason why I am using the urllib.parse
tests is because Scrapy spiders currently use parsing functions from that library.
Currently, there are a few failed tests in the urllib.parse
tests, which can be found under the /tests directory in the Scurl repository. The failed tests are marked as xfail (expected to fail) and I hope that they can be resolved soon as it can bring some incompatibility to Scrapy spiders.
In addition, Scurl is also running the tests from w3lib repository of Scrapy since Scurl supports canonicalize_url
function. However, Scurl has passed all the tests from both Scrapy and W3lib!
More information can be found on these 2 PRs: scrapy/w3lib#110 , scrapy/scrapy#3332
Platform Support
Currently Scurl supports macOS and Linux. In the future, Scurl will need help from other contributors who are familiar with Windows environment to work on this issue!
Extra issues
GURL component in Chromium uses ICU (more information can be found at http://site.icu-project.org/) to parse international domain names. However, it was really challenging to get it installed and supported by Scurl. Therefore, Scurl will also need some help on this issue as it is the potential performance improvement for the library!