Skip to content

Instantly share code, notes, and snippets.

@alexcepoi
Last active April 14, 2023 05:17
Show Gist options
  • Save alexcepoi/200128ad93648825e09f to your computer and use it in GitHub Desktop.
Save alexcepoi/200128ad93648825e09f to your computer and use it in GitHub Desktop.
Scrapy Contracts Evolution

This presents some possible improvements for scrapy contracts. They can be potentially all implemented, but curious which are good/bad ideas. All of them could potentially break existing custom contracts.

1. One callback can have multiple @url contracts

... or multiple contracts which generate requests in general

def parse_response(self, response):
    """
    @url http://example.org/foo
    @url http://example.org/bar
    @returns items 1 1
    """
    return MyItem(url=response.url)

@url is a contract which generates one request, @returns is a contract with a post-hook which checks that it returns exactly 1 item. @returns then applies to both requests generated by @url individually (so each request must return 1 item).

2. One contract can generate multiple requests which are handled in batches

def search(self, keywords):
    """
    @custom kw1 kw2
    @url http://example.org/bar
    @returns items 2 2
    """
    for kw in keywords:
        yield Request('http:/example.org/%s' % kw, callback=self.parse_response)

In this case, @custom is a contract which returns a list of requests which can be treated as a batch. @returns then applies to each batch. It will require the batch of requests returned by @custom (http://example.org/kw1 and http://example.org/kw2) and the batch of requests returned by @url (just the one: http://example.org/bar) to return 2 items each.

3. Multiple sets of contracts in one callback

This is useful in case we want a method to be tested by multiple scenarios.

def parse_response(self, response):
    """
    @url http://example.org/foo
    @returns items 1 1

    <!-- some sort of separator -->

    @url http://example.org/bar
    @returns items 0 0
    """
    pass

Here the first @returns contract only applies to http://example.org/foo but not http://example.org/bar and the second only applies to @url http://example.org/bar. For the separator there are a few options:

  1. blank line
  2. @@
  3. other ideas.. ?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment