rootAvish/final-report.md

## final-report.md

      
    Raw
  

              final-report.md
            
          
    Faster signaling in Scrapy

Table of contents


Introduction
Quick links
How Scrapy uses signals
How signals worked previously
Solutions to these problems
Benchmarks and their explanation
Further scope

Introduction

This document is for the purpose of review and for those interested to gain
understanding about the work that I did on Scrapy as a GSoC 2016 student developer.
I will pull back the curtains on how signals in Scrapy worked previously and the
bottlenecks that our approach overcomes,and the ones it doesn't. I'll also go over
the benchmarks and share the insights I gained using line_profiler into why a
benchmark's result is what it is.
Quick links


PR Link with changes
Benchmark suite

How Scrapy uses signals

Scrapy made use of a library called pyDispatcher for its signalling
mechanism. Even before we go into the details of pyDispatcher, let us first
understand the structure of Scrapy a bit more. Scrapy is written on top of the
Twisted asynchronous programming library,
so a lot of the tasks such as fetching response from webpages etc. are I/O intensive
and require waiting. To avoid this waiting the processing of those parts is
deferred and the said waiting does not block our crawler. Once the background waiting is
over it is necessary to signal all functions that might be waiting for some event
to occur so that they may enter the active state and their body is executed. There
are also a ton of bookkeeping tasks involved for crawler inspection since doing the
same on the fly would disrupt the crawling process. An example of this is the
following piece of code:
File: scrapy/core/scraper.py

self.signals.send_catch_log(
    signal=signals.spider_error,
    failure=_failure, response=response,
    spider=spider
)
The above code is used to log an error in the running spider, as stopping the
spider to report an exception does not make sense the same are caught and passed
to all concerned. The alternative is to actually call each of the receiver
methods individually which is a lot of bookkeeping overhead.
Another similar use case but this time using an asynchronous method that can
return Twisted Deferred instances.
"""Start the execution engine"""
assert not self.running, "Engine already running"
self.start_time = time()
yield self.signals.send_catch_log_deferred(signal=signals.engine_started)
self.running = True
self._closewait = defer.Deferred()
yield self._closewait
How Signals worked previously


Now that we know what Signals were used for, let's see what some of the problems
with pyDispatcher was and how we overcame them. A look into the code of
dispatcher.py from the pyDispatcher package:
connections = {}
senders = {}
sendersBack = {}

.
.
.

# this receiver in the set, including back-references
if signal in signals:
    receivers = signals[signal]
    _removeOldBackRefs(senderkey, signal, receiver, receivers)
else:
    receivers = signals[signal] = []
I think the problem is pretty much clear just by looking at these constructs but for
verbosity's sake as you can see the issue here is that Signals and receivers are all maintained
in a centralized "pool". This might not seem like a big deal, but this is one
of the key design issues that was changed with an eye towards speed. How and
what was affected is covered here.


Another such issue was robustApply. Although we still could not get rid of this
method(because we needed to ensure backward compatability) this method is a nuisance.
To avoid pasting a ton of code snippets in here, what this method does is filter
arguments and call the receiver with only those arguments that are a part of its
signature. This functionality is arguably not worth the overhead it introduces
since a lot of Python code is written as is using keyword arguments. This too
was leveraged by us in our quest for faster signals.


Signals were sent regardless of whether or not something was listening.


every emit/send requires searching for any matching sender, and matching signal
receiver- additionally, requires looking up Any. The listeners change only when a
new connection is added/goes out of scope.


Doing it every time at send is inefficient. Signals weren't used for anything
more than a constant; since all of the logic was centralized in dispatcher.


Due to a signal being just a constant, there was no easy space to make signals
faster via calculating up front the receivers to notify.


Solutions to these problems

Before we proceed any further I would by lying if I said I discovered most of this,
since for the most part I was working off the intution gained by the Django
developers when they migrated away from pyDispatcher, in fact scrapy.dispatch
built off django.dispatch and as it currently stands, does keep much of its
design. Here's an overview of some of the steps taken:

Refactoring

Instead of signals just being mere constants and all logic being shoved into the
dispatcher, signals are now instances of the custom Signal class. The advantage
here is that now each signal individually keeps track of all its receivers. This
is a major speedup because now the presence of receivers for another signal does
not affect the compeletion of this signal's events.

Getting rid of Any and Anonymous

Now that all signals were keeping track of their receivers, we no longer required
a centralized lookup key value store to match signals to their receivers. This
led to a much significant increase in that the following piece of code:
FILE: scrapy/utils/signal.py

for receiver in liveReceivers(getAllReceivers(sender, signal)):
    try:
        response = robustApply(receiver, signal=signal, sender=sender,
            *arguments, **named)
        if isinstance(response, Deferred):
            logger.error("Cannot return deferreds from signal handler: %(receiver)s",
                         {'receiver': receiver}, extra={'spider': spider})
was replaced by:
for receiver in self._live_receivers(sender):
        try:
            if self.receiver_accepts_kwargs[_make_id(receiver)]:
                response = receiver(signal=self, sender=sender, **named)
            else:
                response = robust_apply(receiver, signal=self,
                                        sender=sender, **named)
            if isinstance(response, Deferred):
                logger.error("Cannot return deferreds from signal"
                             " handler: %(receiver)s",
                             {'receiver': receiver},
                             extra={'spider': spider})
and the send_catch_log method refactored inside the Signal class. This
seemingly step alone is what is responsible for a majority reduction in the
overhead of signals.
At this point I'd like to note that even though Any no longer exists, the
functionality for Any is still made available in the API and is achieved by
passing in sender=None to connect.

Partial move away from robustApply

Once we move to the new signals, all new code is required to connect only
those receivers to dispatcher which accept a variable keyword args parameter.
However, unlike Django we decided not to force this change on our users and
break backward compatability, instead we decided to set up a backward compatability
mechanism so that legacy extensions and middleware need not migrate right
away. We did end up keeping an implementation robust_apply inside the Scrapy
code for now, which is the same as the one provided by pyDispatcher.

Changing how we deal with dead receivers

Instead of dealing with receivers only when a connection/disconnection happens,
now we check every time a send call happens for dead receivers that we need to
get rid of. This helps us in keeping a cache that is directly looked up for
receivers anytime the signal is emitted speeding up the lookup process. So like
when a signal has no active listeners, the NO_RECEIVERS property is set and said
signal is not fired. We used a WeakKeyDictionary based cache to keep track of our weak
receivers so this does not disrupt our weak=True functionality.

What is not affected

Signals still preserve their dispatch order, which is critical to some code.

Maintaining a cache of whether listeners receive kwargs

This is not something that was an issue with pyDispatcher, however one that we
created for ourselves when we decided to ensure backward compatability of receivers
that do not accept **kwargs. This resulted in the following piece of code:
if self.receiver_accepts_kwargs[_make_id(receiver)]:
        response = receiver(signal=self, sender=sender, **named)
    else:
        response = robust_apply(receiver, signal=self,
                                sender=sender, **named)
This might seem standard, but is miles better compared to my over-engineered first
attempt at the solution which introduced significant method call overhead into the code.
Like Signal.receivers, this store too is sanitised every time a signal is sent,
as well as on connect and disconnect.
Benchmarks and their explanation

For the purpose of benchmarking the performance of the older dispatcher to the
new one, I wrote this benchmarking suite
using some of the utils provided by the Djangobench
project. Here's a sample of the output of running all the benchmarks in that
suite.
Running all benchmarks
Running 'no_kwargs_receiver' benchmark ...
Min: 0.000011 -> 0.000007: 1.5862x faster
Avg: 0.000012 -> 0.000009: 1.4045x faster
Significant (t=7.277109)
Stddev: 0.00000 -> 0.00000: 1.0950x smaller (N = 100)

Running 'connect_no_kwargs' benchmark ...
Min: 0.000007 -> 0.000013: 1.8621x slower
Avg: 0.000009 -> 0.000014: 1.6524x slower
Significant (t=-8.083380)
Stddev: 0.00000 -> 0.00001: 2.6941x larger (N = 100)

Running 'no_compatability_used' benchmark ...
Min: 0.000009 -> 0.000004: 2.1765x faster
Avg: 0.000011 -> 0.000005: 2.0548x faster
Significant (t=12.793539)
Stddev: 0.00000 -> 0.00000: 2.5308x smaller (N = 100)

Running 'dispatcher' benchmark ...
Min: 0.000009 -> 0.000004: 2.3125x faster
Avg: 0.000010 -> 0.000005: 2.1571x faster
Significant (t=15.301444)
Stddev: 0.00000 -> 0.00000: 2.6563x smaller (N = 100)

Running 'connect_accepts_kwargs' benchmark ...
Min: 0.000007 -> 0.000019: 2.7241x slower
Avg: 0.000009 -> 0.000022: 2.4746x slower
Significant (t=-7.760665)
Stddev: 0.00000 -> 0.00002: 4.3374x larger (N = 100)

Running 'proxied_signal' benchmark ...
Min: 0.000010 -> 0.000008: 1.2424x faster
Avg: 0.000011 -> 0.000009: 1.2551x faster
Significant (t=2.739335)
Stddev: 0.00001 -> 0.00001: 1.1157x smaller (N = 100)
Explanation

Before we delve into each indiviual benchmark let's see what some of the above
metrics are: the Significant paramter is calculated by using a student's two-
sample, two-tailed t-test
with alpha=0.95.
To other parameter we look at is the standard deviation of the result, i.e.
is a measure of the consistency of our benchmarks.


no_kwargs_receiver
This benchmark makes use of the robust_apply method to call on receivers that
do not accept a variable keyword arguments parameter. As expected, this benchmark
even with the refactor is not a major improvement over the previous times, because
again we here do not eliminate all our bottlenecks. This benchmark consistently
gives us about 20% - 50% faster signals.


proxied signal
This benchmark considers the case where we need to proxy signals that are defined
as standard python objects i.e. are mere constants. As expected in this case too,
the benchmark score is not all that faster than pyDispatcher and gives us about
25% - 30% faster signals on average.


dispatcher
This benchmark is a raw apples to apples comparison of the new scrapy dispatcher
when measure against pyDispatcher. From this benchmark we can observe that
the new dispatcher is about 90% - 110% faster than the previous implementation!


no compatability
This benchmark was meant as a sanity check over the previous dispatcher test as
signals in scrapy are not registered directly using the Signal class methods but
are instead passed through the SignalManager to ensure compatability for older
style signals and to have a standard wrapper to completely re-write the backend
just in case(hint: exactly what's happening here!).


connect_accepts_kwargs  / connect_no_kwargs
This benchmark compares the time it takes to connect a receiver to a signal in
Scrapy, as you might observe, the performance here is actually slower than
what we got using pyDispatcher. The caveat here is that if one was to run
over the connect method using line_profiler and get an insight on where the
majority of this time is spent, it is actually spent in verifying whether the
receiver that we are connecting has a variable keyword args param and even more
so is bottlenecked in raising the deprecation warning for the same! However, if
we were to go by raw timing and not a comparative one, we would that see that
this overhead is insignificant enough for us to ignore when compared to the
massive amount of savings we do in the actual send call.


Further scope


I'm currently working on writing better documentation for these parts than I
had previously written. Even though I'd like to think I've done a decent job
documenting during the last two weeks of the coding period, my non-familiarity
with ReStructuredText and not the best grammar skills make up for much improvement
in that are.


We can look to do more experiments with caching as that is one area the dispatcher
can definitely improve in. Currently, the caching part only works well when we
are dealing with Signals that do have a fixed sender. I'll look to extend this
to receivers that listen to all emitters in my search for other ways to speed
up the library.


Squashing commits :), this is a big one since in all likelihood I'll end up
stuck in rebase hell.