erikhansen/_readme.md

## _readme.md

      
    Raw
  

              _readme.md
            
          
    Overview

This basic script crawls a Magento 1 or Magento 2 website and logs the prices, SKUs, and product urls to a CSV file. This script was put togther for a company that had consent from the website(s) being scraped. Please use responsibility.
This script uses https://scrapy.org/
This script was tested on macOS Mojave, but it should run on any *NIX system.
You can tweak the body.css code to match the specific CSS selectors on the site you're crawling. See this documentation. When you're testing this script, refer to the command above the def parse_item line to learn how to run the code for only a single product.
Todo


Add support for grouped/configurable products

Usage


Ensure pip is installed on your system.


Run this command:
pip install scrapy


Create a crawl.py file with the contents from the file in this Gist that matches your version of Magento.


Run this command:
scrapy runspider crawl.py --output=crawled_urls.csv
cat crawled_urls.csv


## crawl_m1.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class DomainSpider(CrawlSpider):
    name = 'roguefitness'
    allowed_domains = ["www.roguefitness.com"]
    # Start the crawl with a known product detail page so that you can tweak the `yield` queries below before crawling the entire site
    start_urls = ['https://www.roguefitness.com/rogue-barrel-bag']

    # If you only want to crawl a subfolder, then change the `allow=r'/'` string to something like `allow=r'/en'`
    rules = (
        Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
    )

    # Rename this from `parse_item` to `parse` and comment out the `rules` above to crawl just a single url
    def parse_item(self, response):
        for body in response.css('body'):
            # Don't log non-product urls
            if not body.css('body.catalog-product-view .price-box .price::text').get():
                continue
            yield {
                # TODO: Update CSS selector to match SKU, if the site you're crawling outputs the SKU
                #'sku': body.css('[itemprop="sku"]::text').get(),
                'price': body.css('.price-box .price::text').get().strip(),
                'name': body.css('.product-title::text').get(),
                'url': response.url
            }

## crawl_m2.py
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

class DomainSpider(CrawlSpider):
    name = 'example'
    allowed_domains = ["www.example.com"]
    # Start the crawl with a known product detail page so that you can tweak the `yield` queries below before crawling the entire site
    start_urls = ['https://www.example.com/example-product']

    # If you only want to crawl a subfolder, then change the `allow=r'/'` string to something like `allow=r'/en'`
    rules = (
        Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
    )

    # Rename this from `parse_item` to `parse` and comment out the `rules` above to crawl just a single url
    def parse_item(self, response):
        for body in response.css('body'):
            # Don't log non-product urls
            if not body.css('[itemprop="sku"]::text').get():
                continue
            yield {
                'sku': body.css('[itemprop="sku"]::text').get(),
                'price': body.css('[itemprop="price"] .price::text').get(),
                'name': body.css('[itemprop="name"]::text').get(),
                'url': response.url
            }

## example_crawled_urls.csv

          
            price
            name
            url

            
              $19.50
              Rogue SR-1S Short Handle Bearing Speed Rope
              https://www.roguefitness.com/sr-1s-short-handle-bearing-speed-rope-color-series

            
              $55.00
              Rogue Crop Pants - Women's
              https://www.roguefitness.com/rogue-crop-pants-womens-urban-blue-camo

            
              $22.25
              OSO Mighty Collars
              https://www.roguefitness.com/oso-mighty-collars-multi-color

            
              $123.00
              Strongman Throw Bag
              https://www.roguefitness.com/rogue-strongman-throwbag

            
              $265.00
              Rogue Bella Bar 2.0 - Cerakote - Red Bushings
              https://www.roguefitness.com/bella-bar-cerakote-app-excl

            
              $725.00
              Rogue Echo Bike
              https://www.roguefitness.com/rogue-echo-bike
	import scrapy
	from scrapy.linkextractors import LinkExtractor
	from scrapy.spiders import CrawlSpider, Rule

	class DomainSpider(CrawlSpider):
	name = 'roguefitness'
	allowed_domains = ["www.roguefitness.com"]
	# Start the crawl with a known product detail page so that you can tweak the `yield` queries below before crawling the entire site
	start_urls = ['https://www.roguefitness.com/rogue-barrel-bag']

	# If you only want to crawl a subfolder, then change the `allow=r'/'` string to something like `allow=r'/en'`
	rules = (
	Rule(LinkExtractor(allow=r'/'), callback='parse_item', follow=True),
	)

	# Rename this from `parse_item` to `parse` and comment out the `rules` above to crawl just a single url
	def parse_item(self, response):
	for body in response.css('body'):
	# Don't log non-product urls
	if not body.css('body.catalog-product-view .price-box .price::text').get():
	continue
	yield {
	# TODO: Update CSS selector to match SKU, if the site you're crawling outputs the SKU
	#'sku': body.css('[itemprop="sku"]::text').get(),
	'price': body.css('.price-box .price::text').get().strip(),
	'name': body.css('.product-title::text').get(),
	'url': response.url
	}
price	name	url
$19.50	Rogue SR-1S Short Handle Bearing Speed Rope	https://www.roguefitness.com/sr-1s-short-handle-bearing-speed-rope-color-series
$55.00	Rogue Crop Pants - Women's	https://www.roguefitness.com/rogue-crop-pants-womens-urban-blue-camo
$22.25	OSO Mighty Collars	https://www.roguefitness.com/oso-mighty-collars-multi-color
$123.00	Strongman Throw Bag	https://www.roguefitness.com/rogue-strongman-throwbag
$265.00	Rogue Bella Bar 2.0 - Cerakote - Red Bushings	https://www.roguefitness.com/bella-bar-cerakote-app-excl
$725.00	Rogue Echo Bike	https://www.roguefitness.com/rogue-echo-bike