Skip to content

Instantly share code, notes, and snippets.

@curita
Last active January 26, 2017 06:45
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save curita/83851743f0db0d5b825c to your computer and use it in GitHub Desktop.
Save curita/83851743f0db0d5b825c to your computer and use it in GitHub Desktop.
Google Summer of Code 2014 Proposal

Scrapy Project's Proposal

This proposal intends to add support to a new Scrapy feature, per-spider settings, for what it'll take a significant core API cleanup. It's based on a careful revision of the Scrapy Enhancement Proposal Sep019 draft regarding this project.

Motivation

  • Decouple major Scrapy components and make the API easier to use and develop.
  • Add a needed feature that will extend configurability and could allow Scrapy to be used as a library, without a preexisting project.

Summary of changes

  • New custom_settings class method will be added to spiders, to give them a chance to override settings.
  • Settings class will be split into two classes: SettingsLoader and SettingsReader, and a new concept of "setting priority" will be added.
  • SPIDER_MODULES and SPIDER_MANAGER_CLASS settings will be removed and replaced by entries on scrapy.cfg. Thus spider managers won't need project settings to configure themselves.
  • Spider managers will maintain loading spider classes functionality (with a new load method that will return a spider class given its name), but spider initialization will be delegated to crawlers (with a new from_crawler class method in spiders, that will allow them access to crawlers directly).
  • Spider manager will be striped out of Crawler class, as it will no longer need it.
  • CrawlerProcess will be remove, since crawlers will be created independently with a required spider class and optional SettingsReader instance.

Settings

Settings class will be split into two classes SettingsLoader and SettingsReader. This will avoid the current possible misconception that you can change settings after they have been populated. There'll be a new concept of settings priorities, and settings.overrides will be deprecated in favor of explicitly loaded settings with priorities, as it'll make the settings overriding not order-dependent.

Because of this, CrawlerSettings (with its overrides, settings_module and defaults) will be remove, but its interface could be maintained for backward compatibility in SettingsReader (as in SettingsLoader, overrides dictionary and settings with priorities don't get along with a consistent implementation). The later is not advisable since it breaks the read-only access nature of the class.

With the new per-spider settings, there's a need of a helper function that will take a spider and return a SettingsReader instance populated with defaults, project and the given spider settings. Motive behind this is that get_project_settings can't continue to be used for getting settings instance to pass to the crawler when using the API directly (instead of running it in command line). get_projects_settings will become an internal function because of that.

SettingsLoader

SettingsLoader is going to populate settings at startup, then it'll be converted to a SettingsReader instance and discarded afterwards.

It is supposed to be write-only, but many previously loaded settings are needed to be access before freezing them. For example, the COMMANDS_MODULE setting allows loading more command default settings. Another example is that we need to read LOG_* settings early because we must be able to log errors on the load settings process. ScrapyCommands may be configure based upon current settings, as users can plug custom commands. These are some of the reasons that suggest that we need a read-write access for this class.

  • Will have a method set(name, value, priority) to register a setting with a given priority. A setdict(dict, priority) method may come handy for loading project and per-spider settings.
  • Will have current Settings getter functions (get, getint, getfloat, getdict, etc.) (See above for reasons behind this).
  • Will have a freeze method that returns an instance of SettingsReader, with a copy of the current state of settings (already prioritized).

SettingsReader

It's intended to be the one used by core, extensions, and all components that use settings without modifying them. Because there are logical objects that change settings, such as ScrapyCommands, use cases of each settings class will be comprehensively explained.

New crawlers will be created with an instance of this class (The one returned by the freeze method on the already populated SettingsReader), because they are not expected to alter the settings.

It'll be read-only, keeping the same getter methods of current Settings (get, getint, getfloat, getdict, etc.). There could be a set method that will throw a descriptive explanatory error for debugging compatibility, avoiding its inadvertently usage.

Setting priorities

There will be 5 setting priorities used by default:

  • 0: global defaults (those in scrapy.settings.default_settings)
  • 10: per-command defaults (for example, shell runs with KEEP_ALIVE=True)
  • 20: project settings (those in settings.py)
  • 30: per-spider settings (those returned by Spider.custom_settings class method)
  • 40: command line arguments (those passed in the command line)

There are a couple of issues here:

  • SCRAPY_PICKLED_SETTINGS_TO_OVERRIDE and SCRAPY_{settings} need-to-be deprecated environment variables: Can be kept, with a new or existing priority.
  • We could have different priorities for settings passed with the -s options and other named options in the command line (For example, -s LOG_ENABLE=False --loglevel=ERROR will set LOG_ENABLE to True, because named options are overridden later in the current implementation), but because the process of command line options is done in one place we could leave them with the same priority and depend on the order of the set calls just for this case.

Deprecated code

scrapy.conf.settings singleton is a deprecated implementation concerning settings load. Could be maintained as it is, but the singleton should implement new SettingsReader interface in order to work.

Spider manager

Currently, the spider manager is part of the crawler which creates a cyclic loop between settings and spiders and it shouldn't belong there. The spiders should be loaded outside and passed to the crawler object, which will require a spider class to be instantiated.

This new spider manager will not have access to the settings (they won't be loaded yet) so it will use scrapy.cfg to configure itself.

The scrapy.cfg would look like this:

[settings]
default = myproject.settings

[spiders]
manager = scrapy.spidermanager.SpiderManager
modules = myproject.spiders
  • manager replaces SPIDER_MANAGER_CLASS setting and can, if omitted, will default to scrapy.spidermanager.SpiderManager
  • modules replaces SPIDER_MODULES setting and will be required

These ideas translate to the following changes:

  • __init__(spider_modules) -> __init__(). spider_modules will be looked in scrapy.cfg.
  • create('spider_name', **spider_kargs) -> load('spider_name'). This will return a spider class, not an instance. It's basically a __get__ to self._spiders.
  • All remaining functions should be deprecated or remove accordantly, since a crawler reference is no longer needed.
  • New helper get_spider_manager_class_from_scrapycfg in scrapy/utils/spidermanager.py.

Spiders

A new class method custom_settings is proposed, that could be use to override project and default settings before they're used to instantiate the crawler:

def MySpider(BaseSpider):

    @classmethod
    def custom_settings(cls):
        return {
            "DOWNLOAD_DELAY": 5.0,
            "RETRY_ENABLED": False,
        }

This will only involve a set call with the corresponding priority when populating SettingsLoader.

Regarding the API cleanup, new from_crawler class method will be added to spiders, to give them a chance to access settings, stats, or the crawler core components themselves. This should be the new way to create a spider from now on (instead of normally instantiate it, as is currently).

Scrapy commands

As already stated, ScrapyCommands modify the settings, so they need the SettingsLoader instance reference in order to do that.

Present process_option implementations on base and other commands read and override settings. These overrides should be changed to set calls with the allocated priority.

Each command with a custom run method should be modified to reflect the new refactored API (Particularly crawl command).

CrawlerProcess

CrawlerProcess should be remove because Scrapy crawl command no longer supports running multiple spiders. The preferred way for doing this is using the API manually, instantiating a separate Crawler for each spider, so CrawlerProcess has loosen its utility.

This change is not directly related to the project (it's not focus on settings but it fits in the API clean up task), but it's a great opportunity to make it since we're changing the crawling startup flow.

This class will be deleted and the attributes and methods will be merge with Crawler. For that effect, these are the specific merges and removals:

  • self.crawlers doesn't make sense is this new set up, each reference will be replace with self.
  • create_crawler will be __init__ of Crawler
  • _start_crawler will be merge with Crawler.start
  • start will be merge with Crawler.crawl but this will need from the later an extra parameter start_reactor (default: True) to crawl with or without starting twisted reactor (This is needed from commands.shell in order to start the reactor in another thread).

Startup process

This summarizes the current and new proposed mechanisms for starting up a Scrapy crawler. Imports and non representative functions are omitted for brevity.

Current (old) startup process

# execute in cmdline

# loads settings.py, returns CrawlerSettings(settings_module)
settings = get_project_settings()
settings.defaults.update(cmd.default_settings)

cmd.crawler_process = CrawlerProcess(settings)
cmd.run # (In a _run_print_help call)

    # Command.run in commands/crawl.py

    self.crawler_process.create_crawler()
    spider = crawler.spiders.create(spider_name, **spider_kwargs)
    crawler.crawl(spider)
    self.crawler_process.start() # starts crawling spider

        # CrawlerProcess._start_crawler in crawler.py

        crawler.configure()

Proposed (new) startup process

# execute in cmdline

smcls = get_spider_manager_class_from_scrapycfg()
sm = smcls() # loads spiders from module defined in scrapy.cfg
spidercls = sm.load(spider_name) # returns spider class, not instance

settings = get_project_settings() # loads settings.py
settings.setdict(cmd.default_settings, priority=40)

settings.setdict(spidercls.custom_settings(), priority=30)

settings = settings.freeze()
cmd.crawler = Crawler(spidercls, settings=settings)

    # Crawler.__init__ in crawler.py

    self.configure()

cmd.run # (In a _run_print_help call)

    # Command.run in commands/crawl.py

    self.crawler.crawl(**spider_kwargs)

        # Crawler.crawl in crawler.py

        spider = self.spidercls.from_crawler(self, **spider_kwargs)
        # starts crawling spider

Proposed Timeline

Until April 21: Accepted students proposals announce

  • Get confident with Scrapy internals as the suggested changes will required a deep understanding of all core components and will touch a significant amount of the project code files.
  • Submit more pull requests fixing bugs and improving code coverage while I'm reviewing Scrapy's code. These contributions will help me get more involve in the development process, and could make easier my future tasks.
  • Respond organization further questions regarding my proposal, if any.

April 21 - May 18: Community Bonding Period

  • Wrap up needed changes and their implementation for the project in a design document with support of mentors and the community. Special effort should be put on analyze trade offs of backward incompatible changes.
  • Agreed with mentor how I may approach and develop each task and how to report and review my progress.
  • Keep submitting pull requests to the repository.

May 19 (Official Coding Start) - June 1: Settings Changes

  • Agree on and code the interface of the new Settings classes.
  • Code logic of methods and auxiliary helpers.
  • Make unit test of settings consistent with the new usage.
  • Add needed tests to ensure that the older load settings order is preserved (and any additional tests to maintain test coverage).
  • Adjust current documentation and extend it if needed.

June 2 - June 22: Crawler Changes

  • Concur with crawler changes needed to remove CrawlerProcess and separate itself from the SpiderManager and Spider creation.
  • Write down decided modifications, adapt current tests and documentation, and extend both as required. Because the crawl process will make sense once all the components follow the proposed new API, integration tests are expected to fail.

June 23 (Midterm evaluation on 27) - July 6: Other Components Changes

  • Implement already decided crawler decouple revision on both Spider and SpiderManager (which will greatly simplify them in the present proposal), adjusting tests and documentation as needed.
  • Patch command line's execute with the new settings load and crawler instantiation.
  • Rewrite crawl Scrapy command to conform with new API (Other commands will be modified later).

July 7 - July 13: Integration Tests

  • Internal restructure made will be tested by crawling from command line. Expected deliverable from this milestones is the first clean run with the refactored API, fixing bugs on the work already done.

July 14 - August 3: Consistency Checks

  • Make usage of new API consistent across all Scrapy code.
  • Document exhaustively the engineered interface, and provide examples of how to use it in Scrapy common practices topic.

August 4 - August 10: Per-spider Settings Implementation

  • Code per-spider settings as design document dictates. At this point this should be easy, just a call from command line's execute when populating settings and a helper function to merge them with default and project settings.
  • Document new settings policies and how to override settings with spiders.

11 August - 18 August: 'Pencils down' date

As Google suggested, this week is scheduled for scrubbing code, writing missing tests and improving documentation.

18 August - 22 August: Final evaluation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment