Scrapy takes care of the bulk of the work for common steps involved in web scraping so you only need to concentrate on the where and how of retrieving the information you want - it also caches HTTP requests, so you only need to hit the target site once and subsequent re-runs are quick.
- Target Site
Tesco's Store Locator
- Target URL
The target page is using XMLHttpRequest to fetch new content via JavaScript.
Using Firebug's console to trace what's happening, it looks like a radius variable is increased every time "See More Results" is clicked, like so:
- storeLocator/sf.asp?Lat=55.83518069499999&Lng=-3.21980894099998&Rad=0.50&storeType=all
- storeLocator/sf.asp?Lat=55.83518069499999&Lng=-3.21980894099998&Rad=0.58&storeType=all
Manually increasing this value to something massive, you end up with a 704KB JSON text, from which you can easily extract the ids of all their stores:
When you click on a result, the following sort of URL is requested via XMLHttpRequest:
This returns the partial HTML which contains all the information we need, so we just need to create these sorts of URLs using the JSON data.
The page partial content has been nicely marked-up with semantic tags and descriptive ids, so it's easy enough to identify and extract information:
...
<h2 id="storename" title="3004"><span class="logo"></span>Penicuik Superstore</h2>
<div id="mainContent">
<div id="blockLeft">
<address id="rAddr">9 EDINBURGH ROAD,<br />PENICUIK,<br />MIDLOTHIAN,<br />EH26 8NP.</address>
<div id="rType">Store type: Superstore</div>
...
For example, XPATH which could be used to pull out the store's name is //h2[@id="storename"]/text()
, the address is //address[@id="rAddr"]/text()
- other content can be grabbed in the same way.
Firebug has an $x()
function in its Console which you can use to test these XPATH expressions before you put them in your Scrapy spider, e.g.:
>>> $x('//h2[@id="storename"]/text()')
<TextNode textContent="Penicuik Superstore">]
>>> $x('//address[@id="rAddr"]/text()')
[<TextNode textContent="9 EDINBURGH ROAD,">,
<TextNode textContent="PENICUIK,">,
<TextNode textContent="MIDLOTHIAN,">,
<TextNode textContent="EH26 8NP.">]
This is the guts of the code you'd need to do all this in Scrapy (to be tested):
from scrapy.item import Item, Field
class TescoStore(Item):
name = Field()
address = Field()
# ...
STORE_URL = 'http://www.tesco.com/storeLocator/get.store.belowMap.asp?bID=%s'
import json
from scrapy.http import Request
from scrapy.spider import BaseSpider
class TescoSpider(BaseSpider):
name = 'tesco_store_locator'
allowed_domains = ['tesco.com']
start_urls = ['http://www.tesco.com/storeLocator/sf.asp?Lat=55.614681359&Lng=-2.804570662&Rad=999999&storeType=all']
def parse(self, response):
stores = json.loads(response.body)
for store in stores:
yield Request(url=STORE_URL % store['bID'], callback=self.parse_store)
def parse_store(self, response):
hxs = HtmlXPathSelecto(response)
store = TescoStore()
store['name'] = hxs.select('//h2[@id="storename"]/text()').extract()[0]
store['address'] = '\n'.join(s.strip() for s in hxs.select('//address[@id="rAddr"]/text()').extract())[/code]
# ...
yield store
scrapy crawl tesco_store_locator --set FEED_URI=tesco_stores.csv --set FEED_FORMAT=csv