Skip to content

Instantly share code, notes, and snippets.

@rootAvish
Created August 23, 2016 03:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save rootAvish/2799091c237dfbf0003ad27ab4593d0a to your computer and use it in GitHub Desktop.
Save rootAvish/2799091c237dfbf0003ad27ab4593d0a to your computer and use it in GitHub Desktop.
GSoC 2016 Final Report

Faster signaling in Scrapy

Table of contents

Introduction

This document is for the purpose of review and for those interested to gain understanding about the work that I did on Scrapy as a GSoC 2016 student developer. I will pull back the curtains on how signals in Scrapy worked previously and the bottlenecks that our approach overcomes,and the ones it doesn't. I'll also go over the benchmarks and share the insights I gained using line_profiler into why a benchmark's result is what it is.

Quick links

How Scrapy uses signals

Scrapy made use of a library called pyDispatcher for its signalling mechanism. Even before we go into the details of pyDispatcher, let us first understand the structure of Scrapy a bit more. Scrapy is written on top of the Twisted asynchronous programming library, so a lot of the tasks such as fetching response from webpages etc. are I/O intensive and require waiting. To avoid this waiting the processing of those parts is deferred and the said waiting does not block our crawler. Once the background waiting is over it is necessary to signal all functions that might be waiting for some event to occur so that they may enter the active state and their body is executed. There are also a ton of bookkeeping tasks involved for crawler inspection since doing the same on the fly would disrupt the crawling process. An example of this is the following piece of code:

File: scrapy/core/scraper.py

self.signals.send_catch_log(
    signal=signals.spider_error,
    failure=_failure, response=response,
    spider=spider
)

The above code is used to log an error in the running spider, as stopping the spider to report an exception does not make sense the same are caught and passed to all concerned. The alternative is to actually call each of the receiver methods individually which is a lot of bookkeeping overhead.

Another similar use case but this time using an asynchronous method that can return Twisted Deferred instances.

"""Start the execution engine"""
assert not self.running, "Engine already running"
self.start_time = time()
yield self.signals.send_catch_log_deferred(signal=signals.engine_started)
self.running = True
self._closewait = defer.Deferred()
yield self._closewait

How Signals worked previously

  • Now that we know what Signals were used for, let's see what some of the problems with pyDispatcher was and how we overcame them. A look into the code of dispatcher.py from the pyDispatcher package:

    connections = {}
    senders = {}
    sendersBack = {}
    
    .
    .
    .
    
    # this receiver in the set, including back-references
    if signal in signals:
        receivers = signals[signal]
        _removeOldBackRefs(senderkey, signal, receiver, receivers)
    else:
        receivers = signals[signal] = []

    I think the problem is pretty much clear just by looking at these constructs but for verbosity's sake as you can see the issue here is that Signals and receivers are all maintained in a centralized "pool". This might not seem like a big deal, but this is one of the key design issues that was changed with an eye towards speed. How and what was affected is covered here.

  • Another such issue was robustApply. Although we still could not get rid of this method(because we needed to ensure backward compatability) this method is a nuisance. To avoid pasting a ton of code snippets in here, what this method does is filter arguments and call the receiver with only those arguments that are a part of its signature. This functionality is arguably not worth the overhead it introduces since a lot of Python code is written as is using keyword arguments. This too was leveraged by us in our quest for faster signals.

  • Signals were sent regardless of whether or not something was listening.

  • every emit/send requires searching for any matching sender, and matching signal receiver- additionally, requires looking up Any. The listeners change only when a new connection is added/goes out of scope.

  • Doing it every time at send is inefficient. Signals weren't used for anything more than a constant; since all of the logic was centralized in dispatcher.

  • Due to a signal being just a constant, there was no easy space to make signals faster via calculating up front the receivers to notify.

Solutions to these problems

Before we proceed any further I would by lying if I said I discovered most of this, since for the most part I was working off the intution gained by the Django developers when they migrated away from pyDispatcher, in fact scrapy.dispatch built off django.dispatch and as it currently stands, does keep much of its design. Here's an overview of some of the steps taken:

  • Refactoring

Instead of signals just being mere constants and all logic being shoved into the dispatcher, signals are now instances of the custom Signal class. The advantage here is that now each signal individually keeps track of all its receivers. This is a major speedup because now the presence of receivers for another signal does not affect the compeletion of this signal's events.

  • Getting rid of Any and Anonymous

Now that all signals were keeping track of their receivers, we no longer required a centralized lookup key value store to match signals to their receivers. This led to a much significant increase in that the following piece of code:

FILE: scrapy/utils/signal.py

for receiver in liveReceivers(getAllReceivers(sender, signal)):
    try:
        response = robustApply(receiver, signal=signal, sender=sender,
            *arguments, **named)
        if isinstance(response, Deferred):
            logger.error("Cannot return deferreds from signal handler: %(receiver)s",
                         {'receiver': receiver}, extra={'spider': spider})

was replaced by:

for receiver in self._live_receivers(sender):
        try:
            if self.receiver_accepts_kwargs[_make_id(receiver)]:
                response = receiver(signal=self, sender=sender, **named)
            else:
                response = robust_apply(receiver, signal=self,
                                        sender=sender, **named)
            if isinstance(response, Deferred):
                logger.error("Cannot return deferreds from signal"
                             " handler: %(receiver)s",
                             {'receiver': receiver},
                             extra={'spider': spider})

and the send_catch_log method refactored inside the Signal class. This seemingly step alone is what is responsible for a majority reduction in the overhead of signals.

At this point I'd like to note that even though Any no longer exists, the functionality for Any is still made available in the API and is achieved by passing in sender=None to connect.

  • Partial move away from robustApply

Once we move to the new signals, all new code is required to connect only those receivers to dispatcher which accept a variable keyword args parameter. However, unlike Django we decided not to force this change on our users and break backward compatability, instead we decided to set up a backward compatability mechanism so that legacy extensions and middleware need not migrate right away. We did end up keeping an implementation robust_apply inside the Scrapy code for now, which is the same as the one provided by pyDispatcher.

  • Changing how we deal with dead receivers

Instead of dealing with receivers only when a connection/disconnection happens, now we check every time a send call happens for dead receivers that we need to get rid of. This helps us in keeping a cache that is directly looked up for receivers anytime the signal is emitted speeding up the lookup process. So like when a signal has no active listeners, the NO_RECEIVERS property is set and said signal is not fired. We used a WeakKeyDictionary based cache to keep track of our weak receivers so this does not disrupt our weak=True functionality.

  • What is not affected

Signals still preserve their dispatch order, which is critical to some code.

  • Maintaining a cache of whether listeners receive kwargs

This is not something that was an issue with pyDispatcher, however one that we created for ourselves when we decided to ensure backward compatability of receivers that do not accept **kwargs. This resulted in the following piece of code:

if self.receiver_accepts_kwargs[_make_id(receiver)]:
        response = receiver(signal=self, sender=sender, **named)
    else:
        response = robust_apply(receiver, signal=self,
                                sender=sender, **named)

This might seem standard, but is miles better compared to my over-engineered first attempt at the solution which introduced significant method call overhead into the code. Like Signal.receivers, this store too is sanitised every time a signal is sent, as well as on connect and disconnect.

Benchmarks and their explanation

For the purpose of benchmarking the performance of the older dispatcher to the new one, I wrote this benchmarking suite using some of the utils provided by the Djangobench project. Here's a sample of the output of running all the benchmarks in that suite.

Running all benchmarks
Running 'no_kwargs_receiver' benchmark ...
Min: 0.000011 -> 0.000007: 1.5862x faster
Avg: 0.000012 -> 0.000009: 1.4045x faster
Significant (t=7.277109)
Stddev: 0.00000 -> 0.00000: 1.0950x smaller (N = 100)

Running 'connect_no_kwargs' benchmark ...
Min: 0.000007 -> 0.000013: 1.8621x slower
Avg: 0.000009 -> 0.000014: 1.6524x slower
Significant (t=-8.083380)
Stddev: 0.00000 -> 0.00001: 2.6941x larger (N = 100)

Running 'no_compatability_used' benchmark ...
Min: 0.000009 -> 0.000004: 2.1765x faster
Avg: 0.000011 -> 0.000005: 2.0548x faster
Significant (t=12.793539)
Stddev: 0.00000 -> 0.00000: 2.5308x smaller (N = 100)

Running 'dispatcher' benchmark ...
Min: 0.000009 -> 0.000004: 2.3125x faster
Avg: 0.000010 -> 0.000005: 2.1571x faster
Significant (t=15.301444)
Stddev: 0.00000 -> 0.00000: 2.6563x smaller (N = 100)

Running 'connect_accepts_kwargs' benchmark ...
Min: 0.000007 -> 0.000019: 2.7241x slower
Avg: 0.000009 -> 0.000022: 2.4746x slower
Significant (t=-7.760665)
Stddev: 0.00000 -> 0.00002: 4.3374x larger (N = 100)

Running 'proxied_signal' benchmark ...
Min: 0.000010 -> 0.000008: 1.2424x faster
Avg: 0.000011 -> 0.000009: 1.2551x faster
Significant (t=2.739335)
Stddev: 0.00001 -> 0.00001: 1.1157x smaller (N = 100)

Explanation

Before we delve into each indiviual benchmark let's see what some of the above metrics are: the Significant paramter is calculated by using a student's two- sample, two-tailed t-test with alpha=0.95.

To other parameter we look at is the standard deviation of the result, i.e. is a measure of the consistency of our benchmarks.

  • no_kwargs_receiver

    This benchmark makes use of the robust_apply method to call on receivers that do not accept a variable keyword arguments parameter. As expected, this benchmark even with the refactor is not a major improvement over the previous times, because again we here do not eliminate all our bottlenecks. This benchmark consistently gives us about 20% - 50% faster signals.

  • proxied signal

    This benchmark considers the case where we need to proxy signals that are defined as standard python objects i.e. are mere constants. As expected in this case too, the benchmark score is not all that faster than pyDispatcher and gives us about 25% - 30% faster signals on average.

  • dispatcher

    This benchmark is a raw apples to apples comparison of the new scrapy dispatcher when measure against pyDispatcher. From this benchmark we can observe that the new dispatcher is about 90% - 110% faster than the previous implementation!

  • no compatability

    This benchmark was meant as a sanity check over the previous dispatcher test as signals in scrapy are not registered directly using the Signal class methods but are instead passed through the SignalManager to ensure compatability for older style signals and to have a standard wrapper to completely re-write the backend just in case(hint: exactly what's happening here!).

  • connect_accepts_kwargs / connect_no_kwargs

    This benchmark compares the time it takes to connect a receiver to a signal in Scrapy, as you might observe, the performance here is actually slower than what we got using pyDispatcher. The caveat here is that if one was to run over the connect method using line_profiler and get an insight on where the majority of this time is spent, it is actually spent in verifying whether the receiver that we are connecting has a variable keyword args param and even more so is bottlenecked in raising the deprecation warning for the same! However, if we were to go by raw timing and not a comparative one, we would that see that this overhead is insignificant enough for us to ignore when compared to the massive amount of savings we do in the actual send call.

Further scope

  • I'm currently working on writing better documentation for these parts than I had previously written. Even though I'd like to think I've done a decent job documenting during the last two weeks of the coding period, my non-familiarity with ReStructuredText and not the best grammar skills make up for much improvement in that are.

  • We can look to do more experiments with caching as that is one area the dispatcher can definitely improve in. Currently, the caching part only works well when we are dealing with Signals that do have a fixed sender. I'll look to extend this to receivers that listen to all emitters in my search for other ways to speed up the library.

  • Squashing commits :), this is a big one since in all likelihood I'll end up stuck in rebase hell.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment