Skip to content

Instantly share code, notes, and snippets.

@erikhansen
Last active April 17, 2023 11:42
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save erikhansen/6ed95a7daee05c58ecfbb7abbd922dd0 to your computer and use it in GitHub Desktop.
Save erikhansen/6ed95a7daee05c58ecfbb7abbd922dd0 to your computer and use it in GitHub Desktop.
Crawl Magento 1 / Magento 2 site for pricing

Overview

This basic script crawls a Magento 1 or Magento 2 website and logs the prices, SKUs, and product urls to a CSV file. This script was put togther for a company that had consent from the website(s) being scraped. Please use responsibility.

This script uses https://scrapy.org/

This script was tested on macOS Mojave, but it should run on any *NIX system.

You can tweak the body.css code to match the specific CSS selectors on the site you're crawling. See this documentation. When you're testing this script, refer to the command above the def parse_item line to learn how to run the code for only a single product.

Todo

  • Add support for grouped/configurable products

Usage

  1. Ensure pip is installed on your system.

  2. Run this command:

    pip install scrapy
    
  3. Create a crawl.py file with the contents from the file in this Gist that matches your version of Magento.

  4. Run this command:

    scrapy runspider crawl.py --output=crawled_urls.csv
    cat crawled_urls.csv
    
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class DomainSpider(CrawlSpider):
name = 'roguefitness'
allowed_domains = ["www.roguefitness.com"]
# Start the crawl with a known product detail page so that you can tweak the `yield` queries below before crawling the entire site
start_urls = ['https://www.roguefitness.com/rogue-barrel-bag']
# If you only want to crawl a subfolder, then change the `allow=r'/'` string to something like `allow=r'/en'`
rules = (
Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
)
# Rename this from `parse_item` to `parse` and comment out the `rules` above to crawl just a single url
def parse_item(self, response):
for body in response.css('body'):
# Don't log non-product urls
if not body.css('body.catalog-product-view .price-box .price::text').get():
continue
yield {
# TODO: Update CSS selector to match SKU, if the site you're crawling outputs the SKU
#'sku': body.css('[itemprop="sku"]::text').get(),
'price': body.css('.price-box .price::text').get().strip(),
'name': body.css('.product-title::text').get(),
'url': response.url
}
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule
class DomainSpider(CrawlSpider):
name = 'example'
allowed_domains = ["www.example.com"]
# Start the crawl with a known product detail page so that you can tweak the `yield` queries below before crawling the entire site
start_urls = ['https://www.example.com/example-product']
# If you only want to crawl a subfolder, then change the `allow=r'/'` string to something like `allow=r'/en'`
rules = (
Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
)
# Rename this from `parse_item` to `parse` and comment out the `rules` above to crawl just a single url
def parse_item(self, response):
for body in response.css('body'):
# Don't log non-product urls
if not body.css('[itemprop="sku"]::text').get():
continue
yield {
'sku': body.css('[itemprop="sku"]::text').get(),
'price': body.css('[itemprop="price"] .price::text').get(),
'name': body.css('[itemprop="name"]::text').get(),
'url': response.url
}
price name url
$19.50 Rogue SR-1S Short Handle Bearing Speed Rope https://www.roguefitness.com/sr-1s-short-handle-bearing-speed-rope-color-series
$55.00 Rogue Crop Pants - Women's https://www.roguefitness.com/rogue-crop-pants-womens-urban-blue-camo
$22.25 OSO Mighty Collars https://www.roguefitness.com/oso-mighty-collars-multi-color
$123.00 Strongman Throw Bag https://www.roguefitness.com/rogue-strongman-throwbag
$265.00 Rogue Bella Bar 2.0 - Cerakote - Red Bushings https://www.roguefitness.com/bella-bar-cerakote-app-excl
$725.00 Rogue Echo Bike https://www.roguefitness.com/rogue-echo-bike
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment