malloxpb/final-gsoc-report.md

## final-gsoc-report.md

      
    Raw
  

              final-gsoc-report.md
            
          
    Final report for GSoC 2018

Chau Tung Lam Nguyen
Scrapy project - Python Software Foundation
Summary

Code Repository: https://github.com/scrapy/scurl
PR: scrapy/w3lib#110 , scrapy/scrapy#3332
The project I have done over the summer is the library named Scurl , which is a url-parsing library for Scrapy. The performance improve result provides somewhat between 10-15% times faster than Scrapy without Scurl installed.
The library can be activated when people have Scurl installed within the same environment of a Scrapy project.
Work that is done:

The library is now at the stable state, it passed nearly all the tests from urllib.parse.
It passes all the tests from Scrapy and w3lib, which is the most important task.
The library supports 4 functions: urlparse , urlsplit , urljoin , canonicalize_url
The result is 2 - 5 times faster than the original parsing functions
Scrapy spiders using Scurl are 10-15% faster
The Chromium source that Scurl uses has been updated and there is a documentation on how to update the source
Scurl has been transferred to the Scrapy organization!

Work that is not done:

The PR to Scrapy has not been merged as it requires some further testing
The PR to w3lib has also not been merged as it requires some further testing
Some small issues listed in Scurl repository.
Make the Scurl library more compatible with Scrapy by passing all the tests in urllib.parse
Windows support for the library
Wheel support for the library

In Detail

Overview
The project Scurl is built to improve the performance of Scrapy. It focuses on the bottleneck components of Scrapy such as urlparse , urljoin , urlsplit and canonicalize_url. Therefore, Scurl currently supports those functions only.
Profiling Result
Running scrapy-bench bookworm on python 3:


Python3


With Scurl
72.71 items/second


Without Scurl
62.18 items/second


Running scrapy bench on python 3:


With Scurl
2880 pages/min


Without Scurl
2400 pages/min


More information on the profiling result can be found under this gist. One image is the CPU % of parse in Scrapy spider without Scurl installed and the other is the CPU % with Scurl installed.
Fallback
Since the GURL container is still somewhat different from the urllib.parse functions (one is a browser url component and the other is one of  the CPython standard libraries), I have also implemented the fallback to the urllib.parse functions when it's necessary. However, we can still improve Scurl performance by resolving the incompatibility somehow (probably by adding more code to the Cython wrapper)
Testing
The tests that Scurl is running are based on the tests from urllib.parse. The reason why I am using the urllib.parse tests is because Scrapy spiders currently use parsing functions from that library.
Currently, there are a few failed tests in the urllib.parse tests, which can be found under the /tests directory in the Scurl repository. The failed tests are marked as xfail (expected to fail) and I hope that they can be resolved soon as it can bring some incompatibility to Scrapy spiders.
In addition, Scurl is also running the tests from w3lib repository of Scrapy since Scurl supports canonicalize_url function. However, Scurl has passed all the tests from both Scrapy and W3lib!
More information can be found on these 2 PRs: scrapy/w3lib#110 , scrapy/scrapy#3332
Platform Support
Currently Scurl supports macOS and Linux. In the future, Scurl will need help from other contributors who are familiar with Windows environment to work on this issue!
Extra issues
GURL component in Chromium uses ICU (more information can be found at http://site.icu-project.org/) to parse international domain names. However, it was really challenging to get it installed and supported by Scurl. Therefore, Scurl will also need some help on this issue as it is the potential performance improvement for the library!

  
## noscurl.png

      
    Raw
  

              noscurl.png
            
          
## scurl.png

      
    Raw
  

              scurl.png