insin/scraping-101.rst

## scraping-101.rst

      
    Raw
  

              scraping-101.rst
            
          
    Scrapy takes care of the bulk of the work for common steps involved in web scraping so you only need to concentrate on the where and how of retrieving the information you want - it also caches HTTP requests, so you only need to hit the target site once and subsequent re-runs are quick.

Target Site
Tesco's Store Locator
Target URL
http://www.tesco.com/storeLocator/


Determining Store IDs

The target page is using XMLHttpRequest to fetch new content via JavaScript.
Using Firebug's console to trace what's happening, it looks like a radius variable is increased every time "See More Results" is clicked, like so:

storeLocator/sf.asp?Lat=55.83518069499999&Lng=-3.21980894099998&Rad=0.50&storeType=all
storeLocator/sf.asp?Lat=55.83518069499999&Lng=-3.21980894099998&Rad=0.58&storeType=all

Manually increasing this value to something massive, you end up with a 704KB JSON text, from which you can easily extract the ids of all their stores:

http://www.tesco.com/storeLocator/sf.asp?Lat=55.614681359&Lng=-2.804570662&Rad=999999&storeType=all


Determining Content URLs

When you click on a result, the following sort of URL is requested via XMLHttpRequest:

http://www.tesco.com/storeLocator/get.store.belowMap.asp?bID=3004

This returns the partial HTML which contains all the information we need, so we just need to create these sorts of URLs using the JSON data.

Extracting Content

The page partial content has been nicely marked-up with semantic tags and descriptive ids, so it's easy enough to identify and extract information:
...
<h2 id="storename" title="3004"><span class="logo"></span>Penicuik Superstore</h2>
<div id="mainContent">
  <div id="blockLeft">
    <address id="rAddr">9 EDINBURGH ROAD,<br />PENICUIK,<br />MIDLOTHIAN,<br />EH26 8NP.</address>
    <div id="rType">Store type: Superstore</div>
    ...

For example, XPATH which could be used to pull out the store's name is //h2[@id="storename"]/text(), the address is //address[@id="rAddr"]/text() - other content can be grabbed in the same way.
Firebug has an $x() function in its Console which you can use to test these XPATH expressions before you put them in your Scrapy spider, e.g.:
>>> $x('//h2[@id="storename"]/text()')
<TextNode textContent="Penicuik Superstore">]
>>> $x('//address[@id="rAddr"]/text()')
[<TextNode textContent="9 EDINBURGH ROAD,">,
 <TextNode textContent="PENICUIK,">,
 <TextNode textContent="MIDLOTHIAN,">,
 <TextNode textContent="EH26 8NP.">]


Code

This is the guts of the code you'd need to do all this in Scrapy (to be tested):
from scrapy.item import Item, Field

class TescoStore(Item):
    name = Field()
    address = Field()
    # ...

STORE_URL = 'http://www.tesco.com/storeLocator/get.store.belowMap.asp?bID=%s'

import json

from scrapy.http import Request
from scrapy.spider import BaseSpider

class TescoSpider(BaseSpider):
    name = 'tesco_store_locator'
    allowed_domains = ['tesco.com']
    start_urls = ['http://www.tesco.com/storeLocator/sf.asp?Lat=55.614681359&Lng=-2.804570662&Rad=999999&storeType=all']

    def parse(self, response):
        stores = json.loads(response.body)
        for store in stores:
            yield Request(url=STORE_URL % store['bID'], callback=self.parse_store)

    def parse_store(self, response):
        hxs = HtmlXPathSelecto(response)
        store = TescoStore()
        store['name'] = hxs.select('//h2[@id="storename"]/text()').extract()[0]
        store['address'] = '\n'.join(s.strip() for s in hxs.select('//address[@id="rAddr"]/text()').extract())[/code]
        # ...
        yield store


Running & Exporting Results

scrapy crawl tesco_store_locator --set FEED_URI=tesco_stores.csv --set FEED_FORMAT=csv