Skip to content

Instantly share code, notes, and snippets.

@jdemaeyer jdemaeyer/ Secret
Created May 23, 2015

What would you like to do?

Simplified Scrapy Add-ons

This proposal aims to improve both user and extension developer experience of the Scrapy web scraping framework by implementing a simplified interface to managing extensions (middlewares, handlers, pipelines, exporters, etc.). It is based on SEP-021 and the accompanying discussion.

When implemented, extension management will be closer to "plug and play". Users will be able to enable and configure extensions in a simple and intuitive manner at a single entry point, while developers will gain easier control over configuration and dependency management.


Scrapy provides a broad variety of hooks that allow customising and enhancing its functionality: from simple settings to item pipelines to spider/downloader middlewares to drop-in replacements of core components, and many others. These hooks are accessible by editing their respective settings in the project's There, extensions as well as core components are enabled/disabled and configured.

The current extension management has several shortfalls, in particular regarding user-friendliness:

  • Enabling and disabling extensions is a different process for included extensions (scrapy.contrib) versus custom extensions
  • Larger extensions often use several of the above hooks, requiring the user to edit multiple settings in a coordinated fashion. This is error-prone and counterintuitive, especially for users with small experience in Python or the Scrapy internals
  • Extension developers have little control over ensuring their library dependencies and configuration requirements are met, especially since most extensions never 'see' a fully-configured crawler before it starts running
  • The user is burdened with supervising potential interplay of extensions, especially non-included ones, ranging from setting name clashes to mutually excluding dependencies/configuration requirements

Add-ons search to remedy these shortcomings by refactoring Scrapy's extension management, making it easy-to-use and transparent for users while giving more configuration control to developers.


This proposal aims at making extension management in Scrapy "plug and play" for users by:

  • providing a single, simple to use entry point, such as scrapy.cfg or, to enable/disable and configure extensions,
  • freeing the user from having to configure all settings required by an extension (while conserving her possibility to do so if she wishes), instead allowing her to simply activate the extension with a single line, and
  • automating the task of resolving dependencies and preventing unwanted extension interplay as far as possible.

At the same time, developers are granted better control over proper operation of their extension by providing them with mechanisms to:

  • impose configuration settings,
  • check and resolve dependencies, and
  • perform post-initialization tests.

All of these goals should be met while preserving full backwards-compatibility with the current way of extension management (through editing numerous settings in

Summary of Current State

Extensions are enabled, disabled, and ordered, where appropriate, by directly editing (multiple) settings:

  • Custom spider and downloader middlewares, item pipelines, and generic extensions are enabled and ordered by defining an item in the SPIDER_MIDDLEWARES, DOWNLOADER_MIDDLEWARES, ITEM_PIPELINES, and EXTENSIONS dictionary, where the key is the path to the extension class and the integer value determines the ordering:
    'myproject.pipelines.DummyPipeline': 0,
  • On the other hand, built-in middlewares and generic extensions (as well as item pipelines, if there were any) can be enabled by setting a corresponding settings variable to True:
  • Built-in middlewares and generic extensions that are enabled by default can be disabled by setting their dictionary value to None:
    'scrapy.contrib.downloadermiddleware.cookies.CookiesMiddleware': None,

or, if available, by setting the corresponding settings variable to False:

  • Download handlers, feed exporters and storages are configured in the dictionaries DOWNLOAD_HANDLERS, FEED_EXPORTERS, and FEED_STORAGES. Here, the key corresponds to the protocol/file format being handled, and the value points to the handler/exporter/storage class. Again, disabling of built-in components is achieved by setting the dictionary value to None, or by overwriting it with the path of a custom class
  • Some core components can be replaced by placing the path of a drop-in replacement class in the corresponding settings variable (e.g. DUPEFILTER_CLASS, DOWNLOADER, or SCHEDULER)

Some extensions require multiple changes, such as those using more than one of the hooks provided by Scrapy (e.g. the popular scrapy-redis), or those requiring to disable built-in middlewares (e.g. scrapy-multifeedexporter or scrapy-random-useragent).

When users wish to enable extensions, they manually edit the settings mentioned above in their project's More popular custom extensions typically come with an installation manual that documents the settings which need to be set. Developers have to rely on the user to resolve all dependencies and meet all configuration requirements.

Summary of Proposed Changes

Besides a minor, backwards-compatible change described below (introducing per-key priorities for dictionary-like configuration settings), there is no need to change the structure in settings underlying Scrapy extensions. It is easy to understand if you spend some time on it and provides the flexibility required by Scrapy's versatile nature. However, users who simply want to enable a pipeline they've downloaded from pypi should not be required to read up on Scrapy internals.

The user will be able to enable and configure add-ons in scrapy.cfg, with one section per add-on:

dir = /tmp/example/path

_source = /path/to/
some_setting = some_value

A new add-on manager class built into Scrapy will read the sections from scrapy.cfg. Each section name is the name of an add-on. Add-ons are .py files or Python modules containing a certain set of variables (with general information like name, version, etc.) and implementing certain callbacks described below. The add-on manager searches for the module with the given add-on name in a project subfolder addons, then in scrapy.contrib.addons, then in the Python search path. Additionally, the path of the add-on may be directly given by providing a certain setting name (e.g. _source) in the add-on section.

The add-on manager is responsible for calling the add-on callbacks at an appropriate time in Scrapy's startup process and collecting information about all enabled add-ons, making it available as an attribute of the Crawler instance. It should also implement basic checks for dependency and setting clashes. However, a full-fledged dependency tree generation with automatic ordering of extensions, as suggested by nyov, is beyond the scope of this proposal.

Details of Proposed Changes

Per-key priorities for dictionary-like settings

Currently, there are two (related) issues with updating the dictionary-like variables which provide access to Scrapy's hooks, such as DOWNLOADER_MIDDLEWARES or EXTENSIONS, which are hindering in implementing the proposed add-on management:

  1. As the dictionaries are completely overwritten when reading, it makes a large difference whether add-ons are granted access to the settings before or after reading in the settings module. This dependency on the order of updating settings has already been removed for non-dictionary-like settings by using priorities. However,
  2. The (relatively new) settings priorities only assign a single priority to a complete dictionary. This forbids updating a key still at default setting (lowest priority) with a medium priority as soon as any key has been updated with a higher priority.

This section briefly discusses both of these issues and proposes a simple change to solve both of them while retaining full backwards-compatibility.

Add-ons before or after the settings module?

If we allow the add-ons to change the dict-like settings before reading, all the changes they made will be lost when the same dictionary is also given in the settings file (since the setting is going to be overwritten). If we first read and then allow the add-ons to fiddle with the settings, it becomes very hard for the user to overwrite changes made by the add-ons if wished.

In fact, this problem is already present in Scrapy's current release: the dictionary-like settings given in should not overwrite, but merely update, the default settings. Therefore, for most dictionary-like configuration settings, e.g. DOWNLOADER_MIDDLEWARES, the default settings (scrapy.settings.default_settings) contain an additional setting with the same name and _BASE appended, e.g. DOWNLOADER_MIDDLEWARES_BASE, which contains the defaults for this setting. The middleware managers then construct the final ordered list (using the dictionary values) by calling the helper function scrapy.utils.conf.build_component_list, as first step updates a copy of the _BASE dictionary with the user-given dictionary.

In principle, we could follow the same approach for add-ons, forcing them to only work on additional dictionaries, this time with the appendix _ADDONS. The middleware managers would construct the final dictionary by first updating the _BASE dictionary with the _ADDONS one, and then the resulting with the user-given dictionary. However, this seems tedious and unnecessarily counterintuitive.

Single priority for complete dictionary

To release developers from having to update settings in a specific order, the concept of setting priorities was introduced in the course of last year's Summer of Code. Every configuration value now has a numeric priority value associated with it, and can only be overwritten when a call to the Settings.set() method provides an equal or higher priority.

Currently, there is only a single priority associated with a complete dict-like setting. As this contains no information on which settings exactly were changed with what priority, the dictionary priority is not very meaninful. It furthermore forbids overwriting any key of the dictionary, even if still at its default value, with a given priority as soon as some key has been updated with a higher priority. To resolve this, every key should have its own priority associated with it.

Proposed solution: Replace dict with Settings

We can solve both of the above issues by letting the affected configuration settings be an instance of Settings, instead of an instance of dict. As the Settings class already provides a __getitem__() method, this will introduce no API change to reading these settings.

There are currently three places where the dict-like settings are written to:

  1. When defined in scrapy/settings/
  2. When reading from in the Settings.setmodule() method
  3. When combining the dictionaries with their _BASE in scrapy.utils.conf.build_component_list()

Scrapy's code could be updated in the following fashion with full backwards compatibility, even for non-intended uses (such as users working directly on a _BASE dictionary):

  1. Complete Settings dictionary-like interface by implementing:
  • __setitem__(self, k, v) method that will use some default priority (maybe 'project')
  • __iter__(self) method which returns an iterator over Settings.attributes
  • update(self, custom, priority = 'project') method that behaves like dict.update() while respecting mutability and priorities. If custom is a dict object, the given priority will be used for all keys. If it is a Settings object, the already existing per-key priority values will be used. The setdict() method should become a proxy to this (more general) method
  1. Deprecate _BASE dictionaries by replacing them with empty ones (for backwards-compatibility) and moving default settings into 'real' dictionary, i.e.

        'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,
        # ...


SPIDER_MIDDLEWARES = Settings( values = {
    'scrapy.contrib.spidermiddleware.httperror.HttpErrorMiddleware': 50,
    # ...
    }, priority = 'default' )

3. Configuration in `` should be no different for the user. The `Settings.setmodule()` method should therefore recognise which of its attributes are themselves `Settings` instances, and call their respective `update()` methods instead of replacing them. Alternatively, this check could be done in the `SettingsAttribute.set()` method.

4. Introduce a small change to the `build_component_list()` helper function such that it works on `Settings` instances instead of on `dict`:
   def build_component_list(base, custom):
       # ...
       # OLD:  compdict = base.copy()
       compdict = Settings(base, priority = 'default')
       # As before:
       # ...

5. For the `1.0` release, the settings, middleware managers and `build_component_list()` helper function could be tidied up by removing support for the deprecated `_BASE` settings

These changes will solve the problem of having to update dictionary-like settings in a specific order, as already achieved for other types of settings. They will also clean up the default settings by deprecating the `_BASE` dictionaries, as we can now directly use the dictionaries without appendix for default settings and need not fear them to be overwritten.

### Writing Add-ons

Add-ons are either `.py` files or modules that provide a certain set of variables and callback functions. Custom add-ons are proposed to be saved in an `addons` subfolder of the project, bundling the previously often used folders `pipelines`, `middlewares`, etc. (if wished).

It is *not* sufficient to simply require extensions (middlewares, pipelines, ...) to subclass a base `Addon` class. This would introduce the restriction that an add-on can only use a single Scrapy extension hook, again forcing the user to make multiple changes (i.e. enable multiple add-ons) for larger extensions.

Each add-on module or file must contain at least the following variables:

- `NAME`: string with human-readable add-on name
- `VERSION`: tuple containing major/minor/patchlevel version of the add-on

Additionally, it may (and should) provide one or more variables that can be used for automated detection of possible dependency clashes:

- `REQUIRES`: list of built-in or custom components needed by this add-on, as strings
- `MODIFIES`: list of built-in components whose functionality is affected or replaced by this add-on (a custom HTTP cache should list `httpcache` here)
- `PROVIDES`: list of components provided by this add-on (e.g. `mongodb` for an extension that provides generic read/write access to a MongoDB database, releasing other components from having to provide their own database access methods)

and one or more variables that can be used to check for possible configuration name clashes and provide information on what settings explicitly have to be set by the user (e.g. database passwords):

- `EXPOSED_SETTINGS`: list of (new) setting names that will be exposed into the global settings namespace
- `MINIMUM_CONFIGURATION_SETTINGS`: list of setting names that are *required* to be set by the user
- `MAPPED_SETTING_NAMES`: dictionary where keys represent setting names used in `scrapy.cfg`, and values correspond to the setting name used in the global settings namespace, e.g. `{ 'max': 'MYFILTER_FOOBAR' }`. If a setting name found in `scrapy.cfg` does not have an entry here, its exposed setting name will automatically be generated as the uppercase configuration name preceded by the section name, i.e. a `min` setting within the `[myfilter]` section will be exposed as `MYFILTER_MIN`

Furthermore, it must provide two callbacks that developers can use to configure settings and check dependencies and configuration requirements:

- `update_settings(settings, addon_path)`: Sets configuration (such as default values for this add-on and required settings for other extensions) and enables needed components. Receiving the add-on path (i.e. the path in which Scrapy found this callback) will allow easily enabling components defined in the same module without having to worry about where it is to be found relative to the Scrapy project
- `check_configuration(crawler)`: Receives the fully-initialized `Crawler` instance before it starts running, performs additional dependency and configuration requirement checks

Should an exception be raised in one of these callbacks, Scrapy should print it and exit.

### Add-on Manager

The add-on manager is a new core component which facilitates loading add-ons, gathering and providing information on them, calling their callbacks at appropriate times, and performing basic checks for dependency and configuration clashes.

#### Layout

A new `AddonManager` class is introduced, providing the following methods:

- `loadaddons(self, cfgfile)`:
  parses and stores `cfgfile`, searches, imports and stores add-on modules, reads and collects add-on variables (`NAME`, `VERSION`, ...) and bundles them in a class attribute, calls `check_dependency_clashes()` and `check_name_clashes()`
- `check_dependency_clashes(self)`:
  Checks for possible dependency incompatibilities by inspecting the collected `REQUIRES`, `MODIFIES` and `PROVIDES` add-on variables
- `check_name_clashes(self)`:
  Checks for name clashes using the collected `EXPOSED_SETTINGS` variables
- **`update_settings(self, settings)`**:
  Calls `export_cfg()`, then `update_settings()` method of every enabled add-on
- `export_cfg(self, settings)`:
  figures out correct global setting names for settings read from `scrapy.cfg` and adds them to the `Settings` object
- **`check_configuration(self, crawler)`**:
  Saves add-on information collected to `Crawler.addons` attribute, calls `check_configuration()` method of every enabled add-on

#### Integration into existing start-up process

The question where the add-on manager should be instantiated and when the add-on callbacks should be called is not as obvious as it may seem at first glance. Outlined below are three options and their rationale:

##### Outside of crawler

Besides the different priority and level of direct user involvement, settings made by the add-on callbacks are not different from settings made in ``. The callbacks should therefore be called around the same time that the settings module is loaded. This puts the instantiation of an `AddonManager` instance outside of the `Crawler` instance.

Given that the project settings are read in the `get_project_settings()` function in `scrapy.util.project`, this seems a reasonable place to call `AddonManager.update_settings()`. However, we cannot instantiate the add-on manager within this function, as the function is left (and the manager would therefore be lost) long before the crawler becomes ready (when we wish to call the second round of add-on callbacks).

There are two possible approaches to instantiating the add-on manager outside of `get_project_settings()`:

1. In a normal Scrapy start via the command-line tool, calling `get_project_settings()` is embedded into the `execute()` function in `scrapy.cmdline`. In summary, the changes to this function would be (with analogoue changes in `scrapy/`, where necessary, for backwards-compatibility):

  1. Instantiate add-on manager before calling `get_project_settings()`
  2. Pass add-on manager to `get_project_settings()` when calling it (the function then calls `update_settings()`).
  3. Connect the manager's `check_configuration` method to the `engine_started` signal (this could also be done in the add-on managers '__init__()' method)

2. Alternatively, we could (grudgingly) introduce another singleton to Scrapy (e.g. `scrapy.addonmanager.addonmanager`). This would allow moving the above code related to add-on management above into the more appropriate `get_project_settings()` function.

Integrating add-on management outside of the crawler ensures that settings management, except for spider-specific settings, remains within a rather small part of the start-up process (before instantiating a `Crawler`): the helper function `get_project_settings()` indeed keeps returning the full set of project (i.e. non-spider) settings. The downside is that it either introduces a new singleton or clutters up a function (`execute()`) that should be more about parsing command line options and starting a crawler process than about loading add-ons and settings.

##### Inside of crawler

The settings used to crawl are not complete until the spider-specific settings have been loaded in `Crawler.__init__()` (proposed by [PR #1128](, currently still in `CrawlerRunner._create_crawler()`). Add-on management could follow this approach and only start loading add-ons when the crawler is initialised.

Instantiation and the call to `AddonManager.update_settings()` would happen in `Crawler.__init__()`. The final checks (i.e. the callback to `AddonManager.check_configuration()`) could then again either be tied to the `engine_started` signal, or coded into the `Crawler.crawl()` method after creating the engine.

Integrating add-on management inside of the crawler avoids introducing a new singleton or cluttering up the `execute()` function, but rips apart compiling the complete configuration settings. This is especially critical since many of the settings previously made in `` will move to `scrapy.cfg` with the implementation of this proposal, and may prove backwards-incompatible since the `get_project_settings()` helper function no longer works as expected (and as its name suggests).

### Updating Existing Extensions

Add-on modules will be written for all built-in extensions and placed in `scrapy.contrib.addons`.

In principle, the `contrib` module could be further cleaned up by moving a lot of the existing extension code into the add-on modules, e.g. the `OffsiteMiddleware` class definition would move from `scrapy.contrib.spidermiddleware.offsite` into its corresponding add-on module in `scrapy.contrib.addons.offsite`, next to the add-on callback functions. For backwards compatibility, an import from the new location would remain at the original place of the extension.

For those extension which are disabled by default but still included in the dictionary-like settings, moving to the add-on structure would deprecate the corresponding `_ENABLED` setting. Instead, they are no longer included in the dict-like settings by default, but activated through a section in `scrapy.cfg` just like custom add-ons.

## Stretch Goal: Command Line Interface to Add-On Management

*By my planning (see proposed timeline below), this proposal should allow some extra time to also inplement the add-on management convenience CLI introduced below. However, I have marked it as strech goal to allow buffer for unplanned complications or different implementation ideas of the Scrapy maintainers and community, e.g. more robust dependency checks. This section is therefore not as detailed in the proposed implementation as the previous ones.*

Currently, activating a pre-made add-on requires downloading its files, inserting them into the Scrapy project folders at appropriate places (except when downloaded from pypi) and editing ``. This stretch goal proposes implementing new convenience Scrapy commands to ease and automate most of these tasks.

The commands would be bundled under an `addons` command, i.e. coded in `scrapy/commands/` and aid the user in downloading, updating, and configuring add-ons, taking advantage of some of the add-on attributes outlined above. The `addons` command provides various sub-commands:

- `scrapy addons download [addon]`:
  where `[addon]` is either the URL of a python add-on, i.e. either a `.py` file or a module, a github repository, or a generic string (not representing an URL or repository). In the first two cases, the file(s) will be downloaded and saved into the `addons` subfolder. In the latter case, the command will try downloading first `scrapy-[addon]`, then `[addon]` from PyPI using `pip` and (as a fallback) `easy_install`.
- `scrapy addons update [addon]`:
  Similar to `download`. Will first use add-on manager to load local add-on (from `addons` subfolder, then `scrapy.contrib.addons`, then python search path) and read out version. If version available online is higher, will delete local files and re-download or (in case of PyPI package) upgrade using `pip` or `easy_install`.
- `scrapy addons enable [addon]`:
  Loads local `[addon]` using add-on manager and reads out its `MINIMUM_CONFIGURATION_SETTINGS` variable. If the add-on is not found, prompts user whether he wants to download it. Guides the user through setting basic configuration (again using `MINIMUM_CONFIGURATION_SETTINGS`) and updates `scrapy.cfg` (not overwriting any already present settings not in `MINIMUM_CONFIGURATION_SETTINGS`).

- `scrapy addons disable [addon]`:
  Either deletes or comments out add-on section in `scrapy.cfg` (commenting it out bears the advantage of not losing the configuration for re-enabling it later).
- `scrapy addons configure [addon]`:
  Guides the user through configuring the `MINIMUM_CONFIGURATION_SETTINGS` again.

- `scrapy addons reset [addon]`:
  Removes all configuration for `[addon] ` from `scrapy.cfg`, then guides the user through configuring the `MINIMUM_CONFIGURATION_SETTINGS` again.

This will allow the user to download, enable, and be guiding through configuring a new add-on by calling a single command.

## Proposed Timeline

#### **Now - 27 April 2015**: Before Accepted Student Proposals Announcement

- Deepen understanding of Scrapy internals, submitting pull requests for smaller issues and further participating in issues discussions, especially related to this proposal. This will also help me familiarize with common procedures in open source development.
- Set up blog on my [homepage]( to write about progress

#### **27 April - 25 May 2015**: Community Bonding Period

- Discuss details of approaches with mentor(s) and the community
- Draft a more extensive version of [SEP-021](
- Agree on how to report, review, and get feedback on my progress during the summer

#### **25 May - 3 June 2015** (1.5 weeks): Replace `dict` settings with `Settings` instances
- Complete `dict`-like interface of `Settings` by implementing `__setitem__()`, `__iter__()`, and `update()` methods
- Update `` to use `Settings` instances instead of dictionaries
- Update either `Settings.setmodule()` or `SettingsAttribute.set()` method to identify which settings are themselves instances of `Settings`, and call the `update()` method on these instead of replacing them
- Update `build_component_list()` helper function

#### **4 June - 10 June 2015**: (1 week): Test, document and submit new settings
- Write corresponding unit tests and documentation for changes above
- As this is independent from further changes, submit pull request to allow feedback from a greater community

#### **11 June - 21 June 2015** (1.5 weeks): Implement sample add-on and part of add-on manager

- Write a dummy add-on module that can be used for testing and developing the add-on manager
- Write `loadaddons()`, `update_settings()`, `export_cfg()`, and `check_configuration()` methods of add-on manager
- In parallel, write tests for these methods

#### **22 June - 1 July 2015 (mid-term evaluation)** (1.5 weeks): Complete add-on manager

- Write `check_dependency_clashes()` and `check_name_clashes()` method of add-on manager
- In parallel, write tests for these methods
- Write complete add-on manager documentation and tidy up code to make it all shiny for the mid-term evaluation

#### **2 July 2015 - 8 July 2015** (1 week): Integrate into existing start-up process

- Update existing code base so it uses the add-on manager
- Write and update tests where necessary

#### **9 July - 19 July 2015** (1.5 weeks): Write add-ons for existing extensions

- Write add-on modules to be placed in `scrapy/contrib/addons`
- Move code from existing locations (spread out) into add-on modules if appropriate (i.e. if only used by this add-on), provide imports from new location at old location for backwards compatibility
- Submit pull requests for add-on management to allow feedback from a greater community

#### **20 July - 29 July 2015** (1.5 weeks): Write download/update parts for add-ons CLI

- Write basic structure of `addons` command class
- Write code to download and update add-ons from URLs, PyPI and github, where possible using the `pkgtools` interface to pypi and reusing code from `pip`

#### **30 July - 9 August 2015** (1.5 weeks): Write management/configuration parts of add-ons CLI

- Write code to update `scrapy.cfg` for enable/disable/configuring commands

#### **10 August - 16 August 2015 ('pencils down')** (1 week): Test and document add-ons CLI

- Write tests for `addons` command class
- Document new command

#### **17 August 2015 - 21 August 2015 (final evaluation)**: Polish all the things!

- Clean up code, tests, and documentation
- Submit pull request for add-ons CLI to allow feedback from a greater community

## About Myself

Hi! I am a 25-year old Masters student in Physics hailing from Göttingen, at the very heart of Germany.

Coming from a C background, I have switched to Python as my primary programming language about four years ago and have used it on a daily basis ever since. While usage in my studies focused primarily on numerical simulations, data analysis and image processing, I have used Scrapy and Django (among many other libraries and frameworks) in a couple of private projects.

While I have worked on code collaboratively before, I sadly have yet to make my first large contribution to an open source project. Time to change that! :)

## Code Samples

- [Pull Request for per-key priorities in settings](
- [Pull Request for a small convenience function added to Scrapy's ``Settings`` class](
- [Partial proof of concept for refactoring Scrapy's signaling backend]( ([see issue 8](
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.