Skip to content

Instantly share code, notes, and snippets.

@mounarajan
Last active December 30, 2015 21:39
Show Gist options
  • Save mounarajan/7888528 to your computer and use it in GitHub Desktop.
Save mounarajan/7888528 to your computer and use it in GitHub Desktop.
Homedepot Practice wrappers and spiders
# YAML
setup:
meta:
site: homedepot.com
description: Products
type: Products
browser:
proxy: 0
storeHTML: 0
seeds:
seedUrls:
- http://www.homedepot.com
# Extractors
content:
- name: products
key:
- sku
urlFilter:
- \/p\/
entities:
- name: sku
regex:
- CURRENT_URL
transform:
- s/.*\/p\/.*(\d+).*/$1/s
- name: name
regex:
- <h1[^>]*>(.*?)<\/h1
- name: description
regex:
- <span itemprop="description"[^>]*>(.*?)<\/span>
- name: model
regex:
- Model\s\#\s(.*?)<\/h2>
- name: mpn
regex:
- <li>MFG PART\s\#\s*\:\s(\d*)<
- name: manufacturer
regex:
- var\sCI_ItemMfr\=\'(\w*)\'\;
- name: features
splitType: pair
PairSegmentRegex:
- SPECIFICATIONS<\/h\d\>\s*<table\s*[^>]*>(.*?)<\/table>
PairRegex:
- <tr>\s*<td>(\w+\s\w*\s\(\w+\.\))\&\w+\;<\/td>
- <td[^>]*>([^<]+)<\/td>\s*<td>([^<]+)<\/td>\s*<\/tr>
- name: offers
entities:
- name: price
regex:
- itemprop\="price\">\s\$(\d+.\d+)<\/span>
- name: seller
set: homedepot.com
- name: currency
set: USD
# SPIDERING
links:
urlExtract:
- linkType: content
regex:
- \/p\/
priority: 101
urlFilter:
- \/b\/
urlExtract:
- linkType: category
regex:
- \/b\/
priority: 102
urlFilter:
- \/b\/
urlExtract:
- linkType: pagination
regex:
- \/b\/
priority: 103
urlFilter:
- \/b\/.*?\/.*?\=\d+
filters:
goodDomainNames:
- homedepot\.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment