Skip to content

Instantly share code, notes, and snippets.

@monkeini
Last active July 7, 2020 17:51
Show Gist options
  • Save monkeini/5160038 to your computer and use it in GitHub Desktop.
Save monkeini/5160038 to your computer and use it in GitHub Desktop.
Simple scraping task using Scrapy and PyQuery

Simple Spider Task

Pre-requisites

  • python 2.7 (or 2.6 may suffice)
  • scrapy 0.16.3 +dependencies
  • pyquery

Aims

  • crawl online retailer oxygenboutique.com for appropriate product pages
  • return items representing products
  • output in json format

Tests

  • coverage: the set of urls deemed to be appropriate product pages
  • crawl efficiency: ratio of items scraped to requests made
  • item particulars: the exact item contents for a subset of products

Instructions

  • run "scrapy startproject oxygendemo"
  • update items.py in oxygendemo/oxygendemo to match items.py (below)
  • create oxygen.py in oxygendemo/oxygendemo/spiders and populate with skeleton class (below)
  • write crawling rules (these few lines are a big part of the task - you want to crawl the site in an efficient way)
    • to find appropriate category listing pages
    • to identify individual product pages (this rule should have a callback='parse_item')
  • fill out parse_item method to populate the item's fields (one method per field)
  • (import from standard python libraries where required, but nothing external other than what's already imported)
  • run "scrapy crawl oxygenboutique.com -o items.json -t json"
  • when satisfied, upload scrapy project to github, or oxygen.py to gist

Examples

This url: http://www.oxygenboutique.com/p-1022-the-looker-skinny-jeans-in-cream-for-a-day.aspx Could yield an item dictionary:

    {'code': 'p-1022-the-looker-skinny-jeans-in-cream-for-a-day',
    'description': "The Looker Skinny Cream for a Day by Mother Denim. Skinny jeans with a little stretch! These dreamy cream jeans are a snug fit and have the 'M' embroidered onto the back pockets. There is also a touch of edge with ripped detail on the front and back pockets, left thigh and bottom of the jeans. 98% cotton, 2% elastane. Machine wash cold with like colours. Do not bleach, tumble dry low. Iron on medium heat if necessary or dry clean.",
    'designer': 'Mother Denim',
    'gbp_price': '205.0',
    'gender': 'F',
    'image_urls': ['http://oxygenboutique.com/images/PRODUCT/large/The Looker Skinny Jeans in Cream for a Day_1_.jpg',
                   'http://oxygenboutique.com/images/PRODUCT/large/The Looker Skinny Jeans in Cream for a Day_2_.jpg'],
    'name': u'The Looker Skinny Jeans in Cream for a Day',
    'raw_color': u'cream',
    'sale_discount': 50.0,
    'source_url': 'http://www.oxygenboutique.com/p-1022-the-looker-skinny-jeans-in-cream-for-a-day.aspx',
    'stock_status': {'24': 3},
    'type': 'A'}

Or this url: http://www.oxygenboutique.com/p-1587-lexi-tee.aspx Could yield an item dictionary:

    {'code': 'p-1587-lexi-tee',
    'description': 'Lexi Tee by Gryphon NY. This simple tee shape features an artistic and abstract embroidered detail. A bright blue and white contrast against the navy cotton in a striped pattern. This tee is a cute light layer for your looks this season. 100% cotton. Specialty dry clean only, low heat.',
    'designer': 'Gryphon NY',
    'gbp_price': '285.0',
    'gender': 'F',
    'image_urls': ['http://oxygenboutique.com/images/PRODUCT/large/Lexi Tee_1_.jpg',
                   'http://oxygenboutique.com/images/PRODUCT/large/Lexi Tee_2_.jpg',
                   'http://oxygenboutique.com/images/PRODUCT/large/Lexi Tee_3_.jpg'],
    'name': u'Lexi Tee',
    'raw_color': u'navy',
    'sale_discount': 0.0,
    'source_url': 'http://www.oxygenboutique.com/p-1587-lexi-tee.aspx',
    'stock_status': {'XS': 3, 'S': 3, 'M': 3, 'L': 3],
    'type': 'A'}

Note that some fields are clearcut (gbp_price must equal 285.0), whereas some fields are open to interpretation (e.g. description, where arguably we could have included either more or less than in the sample here, and code, which should just be an identifier unique to this retailer.

Field details

  • type, try and make a best guess, one of:
    • 'A' apparel
    • 'S' shoes
    • 'B' bags
    • 'J' jewelry
    • 'R' accessories
  • gender, one of:
    • 'F' female
    • 'M' male
  • designer - manufacturer of the item
  • code - unique identifier from a retailer perspective
  • name - short summary of the item
  • description - fuller description and details of the item
  • raw_color - best guess of what colour the item is (can be blank if unidentifiable)
  • image_urls - list of urls of large images representing the item
  • gbp_price - full (non-discounted) price of the item
  • sale_discount - percentage discount for sale items where applicable
  • stock_status - dictionary of sizes to stock status
    • 1 - out of stock
    • 3 - in stock
  • source_url - url of product page
from scrapy.item import Item, Field
class OxygendemoItem(Item):
type = Field()
gender = Field()
designer = Field()
code = Field()
name = Field()
description = Field()
raw_color = Field()
image_urls = Field()
gbp_price = Field()
sale_discount = Field()
stock_status = Field()
source_url = Field()
from scrapy.contrib.spiders import CrawlSpider
from scrapy.contrib.spiders import Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from oxygendemo.items import OxygendemoItem
import pyquery
class OxygenSpider(CrawlSpider):
name = "oxygenboutique.com"
allowed_domains = ["oxygenboutique.com"]
start_urls = ['http://www.oxygenboutique.com']
rules = (
# insert rules here
)
def parse_item(self, response):
self.pq = pyquery.PyQuery(response.body)
item = OxygendemoItem()
# populate item fields here
return item
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment