monkeini/items.py

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Simple Spider Task

Pre-requisites


python 2.7 (or 2.6 may suffice)
scrapy 0.16.3 +dependencies
pyquery

Aims


crawl online retailer oxygenboutique.com for appropriate product pages
return items representing products
output in json format

Tests


coverage: the set of urls deemed to be appropriate product pages
crawl efficiency: ratio of items scraped to requests made
item particulars: the exact item contents for a subset of products

Instructions


run "scrapy startproject oxygendemo"
update items.py in oxygendemo/oxygendemo to match items.py (below)
create oxygen.py in oxygendemo/oxygendemo/spiders and populate with skeleton class (below)
write crawling rules (these few lines are a big part of the task - you want to crawl the site in an efficient way)

to find appropriate category listing pages
to identify individual product pages (this rule should have a callback='parse_item')


fill out parse_item method to populate the item's fields (one method per field)
(import from standard python libraries where required, but nothing external other than what's already imported)
run "scrapy crawl oxygenboutique.com -o items.json -t json"
when satisfied, upload scrapy project to github, or oxygen.py to gist

Examples

This url: http://www.oxygenboutique.com/p-1022-the-looker-skinny-jeans-in-cream-for-a-day.aspx
Could yield an item dictionary:
    {'code': 'p-1022-the-looker-skinny-jeans-in-cream-for-a-day',
    'description': "The Looker Skinny Cream for a Day by Mother Denim. Skinny jeans with a little stretch! These dreamy cream jeans are a snug fit and have the 'M' embroidered onto the back pockets. There is also a touch of edge with ripped detail on the front and back pockets, left thigh and bottom of the jeans. 98% cotton, 2% elastane. Machine wash cold with like colours. Do not bleach, tumble dry low. Iron on medium heat if necessary or dry clean.",
    'designer': 'Mother Denim',
    'gbp_price': '205.0',
    'gender': 'F',
    'image_urls': ['http://oxygenboutique.com/images/PRODUCT/large/The Looker Skinny Jeans in Cream for a Day_1_.jpg',
                   'http://oxygenboutique.com/images/PRODUCT/large/The Looker Skinny Jeans in Cream for a Day_2_.jpg'],
    'name': u'The Looker Skinny Jeans in Cream for a Day',
    'raw_color': u'cream',
    'sale_discount': 50.0,
    'source_url': 'http://www.oxygenboutique.com/p-1022-the-looker-skinny-jeans-in-cream-for-a-day.aspx',
    'stock_status': {'24': 3},
    'type': 'A'}
Or this url: http://www.oxygenboutique.com/p-1587-lexi-tee.aspx
Could yield an item dictionary:
    {'code': 'p-1587-lexi-tee',
    'description': 'Lexi Tee by Gryphon NY. This simple tee shape features an artistic and abstract embroidered detail. A bright blue and white contrast against the navy cotton in a striped pattern. This tee is a cute light layer for your looks this season. 100% cotton. Specialty dry clean only, low heat.',
    'designer': 'Gryphon NY',
    'gbp_price': '285.0',
    'gender': 'F',
    'image_urls': ['http://oxygenboutique.com/images/PRODUCT/large/Lexi Tee_1_.jpg',
                   'http://oxygenboutique.com/images/PRODUCT/large/Lexi Tee_2_.jpg',
                   'http://oxygenboutique.com/images/PRODUCT/large/Lexi Tee_3_.jpg'],
    'name': u'Lexi Tee',
    'raw_color': u'navy',
    'sale_discount': 0.0,
    'source_url': 'http://www.oxygenboutique.com/p-1587-lexi-tee.aspx',
    'stock_status': {'XS': 3, 'S': 3, 'M': 3, 'L': 3],
    'type': 'A'}
Note that some fields are clearcut (gbp_price must equal 285.0), whereas some fields are open to interpretation (e.g. description, where arguably we could have included either more or less than in the sample here, and code, which should just be an identifier unique to this retailer.
Field details


type, try and make a best guess, one of:

'A' apparel
'S' shoes
'B' bags
'J' jewelry
'R' accessories


gender, one of:

'F' female
'M' male


designer - manufacturer of the item
code - unique identifier from a retailer perspective
name - short summary of the item
description - fuller description and details of the item
raw_color - best guess of what colour the item is (can be blank if unidentifiable)
image_urls - list of urls of large images representing the item
gbp_price - full (non-discounted) price of the item
sale_discount - percentage discount for sale items where applicable
stock_status - dictionary of sizes to stock status

1 - out of stock
3 - in stock


source_url - url of product page


## items.py
from scrapy.item import Item, Field

class OxygendemoItem(Item):
    type = Field()
    gender = Field()
    designer = Field()
    code = Field()
    name = Field()
    description = Field()
    raw_color = Field()
    image_urls = Field()
    gbp_price = Field()
    sale_discount = Field()
    stock_status = Field()
    source_url = Field()

## oxygen.py
from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

from oxygendemo.items import OxygendemoItem

import pyquery

class OxygenSpider(CrawlSpider):
    name = "oxygenboutique.com"
    allowed_domains = ["oxygenboutique.com"]
    start_urls = ['http://www.oxygenboutique.com']

    rules = (
        # insert rules here
    )

    def parse_item(self, response):
        self.pq = pyquery.PyQuery(response.body)
        item = OxygendemoItem()
        # populate item fields here
        return item
	from scrapy.item import Item, Field

	class OxygendemoItem(Item):
	type = Field()
	gender = Field()
	designer = Field()
	code = Field()
	name = Field()
	description = Field()
	raw_color = Field()
	image_urls = Field()
	gbp_price = Field()
	sale_discount = Field()
	stock_status = Field()
	source_url = Field()
	from scrapy.contrib.spiders import CrawlSpider
	from scrapy.contrib.spiders import Rule
	from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor

	from oxygendemo.items import OxygendemoItem

	import pyquery

	class OxygenSpider(CrawlSpider):
	name = "oxygenboutique.com"
	allowed_domains = ["oxygenboutique.com"]
	start_urls = ['http://www.oxygenboutique.com']

	rules = (
	# insert rules here
	)

	def parse_item(self, response):
	self.pq = pyquery.PyQuery(response.body)
	item = OxygendemoItem()
	# populate item fields here
	return item