Skip to content

Instantly share code, notes, and snippets.

@insin
Last active March 7, 2020 04:10
Show Gist options
  • Star 11 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save insin/5443802 to your computer and use it in GitHub Desktop.
Save insin/5443802 to your computer and use it in GitHub Desktop.
Scraping 101 (with Scrapy)

Scrapy takes care of the bulk of the work for common steps involved in web scraping so you only need to concentrate on the where and how of retrieving the information you want - it also caches HTTP requests, so you only need to hit the target site once and subsequent re-runs are quick.

Target Site

Tesco's Store Locator

Target URL

http://www.tesco.com/storeLocator/

Determining Store IDs

The target page is using XMLHttpRequest to fetch new content via JavaScript.

Using Firebug's console to trace what's happening, it looks like a radius variable is increased every time "See More Results" is clicked, like so:

  • storeLocator/sf.asp?Lat=55.83518069499999&Lng=-3.21980894099998&Rad=0.50&storeType=all
  • storeLocator/sf.asp?Lat=55.83518069499999&Lng=-3.21980894099998&Rad=0.58&storeType=all

Manually increasing this value to something massive, you end up with a 704KB JSON text, from which you can easily extract the ids of all their stores:

Determining Content URLs

When you click on a result, the following sort of URL is requested via XMLHttpRequest:

This returns the partial HTML which contains all the information we need, so we just need to create these sorts of URLs using the JSON data.

Extracting Content

The page partial content has been nicely marked-up with semantic tags and descriptive ids, so it's easy enough to identify and extract information:

...
<h2 id="storename" title="3004"><span class="logo"></span>Penicuik Superstore</h2>
<div id="mainContent">
  <div id="blockLeft">
    <address id="rAddr">9 EDINBURGH ROAD,<br />PENICUIK,<br />MIDLOTHIAN,<br />EH26 8NP.</address>
    <div id="rType">Store type: Superstore</div>
    ...

For example, XPATH which could be used to pull out the store's name is //h2[@id="storename"]/text(), the address is //address[@id="rAddr"]/text() - other content can be grabbed in the same way.

Firebug has an $x() function in its Console which you can use to test these XPATH expressions before you put them in your Scrapy spider, e.g.:

>>> $x('//h2[@id="storename"]/text()')
<TextNode textContent="Penicuik Superstore">]
>>> $x('//address[@id="rAddr"]/text()')
[<TextNode textContent="9 EDINBURGH ROAD,">,
 <TextNode textContent="PENICUIK,">,
 <TextNode textContent="MIDLOTHIAN,">,
 <TextNode textContent="EH26 8NP.">]

Code

This is the guts of the code you'd need to do all this in Scrapy (to be tested):

from scrapy.item import Item, Field

class TescoStore(Item):
    name = Field()
    address = Field()
    # ...

STORE_URL = 'http://www.tesco.com/storeLocator/get.store.belowMap.asp?bID=%s'

import json

from scrapy.http import Request
from scrapy.spider import BaseSpider

class TescoSpider(BaseSpider):
    name = 'tesco_store_locator'
    allowed_domains = ['tesco.com']
    start_urls = ['http://www.tesco.com/storeLocator/sf.asp?Lat=55.614681359&Lng=-2.804570662&Rad=999999&storeType=all']

    def parse(self, response):
        stores = json.loads(response.body)
        for store in stores:
            yield Request(url=STORE_URL % store['bID'], callback=self.parse_store)

    def parse_store(self, response):
        hxs = HtmlXPathSelecto(response)
        store = TescoStore()
        store['name'] = hxs.select('//h2[@id="storename"]/text()').extract()[0]
        store['address'] = '\n'.join(s.strip() for s in hxs.select('//address[@id="rAddr"]/text()').extract())[/code]
        # ...
        yield store

Running & Exporting Results

scrapy crawl tesco_store_locator --set FEED_URI=tesco_stores.csv --set FEED_FORMAT=csv
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment