Skip to content

Instantly share code, notes, and snippets.

@LeeMeng2020
Forked from scrapehero/amazon-reviews.json
Last active October 11, 2020 20:05
Show Gist options
  • Save LeeMeng2020/6ac97d21aa41841ef2033f0467f3c316 to your computer and use it in GitHub Desktop.
Save LeeMeng2020/6ac97d21aa41841ef2033f0467f3c316 to your computer and use it in GitHub Desktop.
Amazon reviews scraper updated for 2020. This is a sitemap to extract review listings for a single product on Amazon.com using Web Scraper Chrome Extension. Handles pagination and now includes ability to limit number of pages. Please read the instructions and update info in the comments section below.
{
"_id": "amazon-reviews-scraper-2020",
"startUrl": ["https://www.amazon.com/Ovente-Dual-Sided-Magnification-Electrical-MPWD3185BZ1X7X/product-reviews/B074GCRS9D",
"https://www.amazon.com/Columbia-Redmond-Waterproof-Cordovan-Regular/product-reviews/B07JH35P96",
"https://www.amazon.com/Merrell-Mens-Moab-Waterproof-Hiking/product-reviews/B01HF9ZN7I",
"https://www.amazon.com/Screen-Protector-SPARIN-Tempered-Glass/product-reviews/B013JZCAZK"
],
"selectors": [{
"id": "Product name",
"type": "SelectorText",
"parentSelectors": ["_root"],
"selector": "div[class*='product-title']",
"multiple": false,
"regex": "",
"delay": 0
}, {
"id": "Review wrappers",
"type": "SelectorElement",
"parentSelectors": ["_root", "Click Next"],
"selector": "div.a-section.review",
"multiple": true,
"delay": 0
}, {
"id": "author",
"type": "SelectorText",
"parentSelectors": ["Review wrappers"],
"selector": "span.a-profile-name",
"multiple": false,
"regex": "",
"delay": 0
}, {
"id": "title",
"type": "SelectorText",
"parentSelectors": ["Review wrappers"],
"selector": "a.a-size-base.review-title",
"multiple": false,
"regex": "",
"delay": 0
}, {
"id": "date",
"type": "SelectorText",
"parentSelectors": ["Review wrappers"],
"selector": "span.a-size-base.a-color-secondary",
"multiple": false,
"regex": "",
"delay": 0
}, {
"id": "content",
"type": "SelectorText",
"parentSelectors": ["Review wrappers"],
"selector": "div.a-row.review-data span.a-size-base",
"multiple": false,
"regex": "",
"delay": 0
}, {
"id": "rating",
"type": "SelectorText",
"parentSelectors": ["Review wrappers"],
"selector": "span.a-icon-alt",
"multiple": false,
"regex": "",
"delay": 0
}, {
"id": "Click Next",
"type": "SelectorElementClick",
"parentSelectors": ["_root"],
"selector": "div.review-views",
"multiple": false,
"delay": "4500",
"clickElementSelector": "div.a-col-left:not(\":contains('Showing 51-60 of')\") ul .a-last a",
"clickType": "clickMore",
"discardInitialElements": "discard",
"clickElementUniquenessType": "uniqueText"
}]
}
@LeeMeng2020
Copy link
Author

Amazon US reviews scraper updated for 2020. This sitemap extracts review listings for a single product on Amazon.com using the Web Scraper Chrome Extension. This sitemap handles pagination and now includes the ability to limit number of pages. Please read the instructions and changelog in the comments section below.

INSTRUCTIONS
This sitemap will extract review listings for single products on Amazon US. I have added a pagination limiter which makes it stop at page 6.
The limiter works by searching for pagination text which looks like “Showing 1-10 of 1,766 reviews”.
In this example, the paginator will click Next until it finds “'Showing 51-60 of” which indicates page 6 (the Amazon US site has 10 reviews per page). You need to do some testing and perhaps a bit of math to figure what text will appear on the page you want to stop at.
This limiter can also be removed by deleting the :NOT selector, leaving only
div.a-col-left ul .a-last a
I have tested this sitemap on 4 different urls, which are included in the Starturl section.

CHANGELOG
This sitemap was forked from scrapehero’s sitemap from Jan 2019. Pagination for that sitemap no longer works so I have improved it for 2020.
For pagination, Amazon has switched to JS links from HTML links, so Type: HTML no longer works here. These are the main changes from scrapehero’s sitemap:

  1. Changed paginator to Type: Element Click, Click Type: Click More.
  2. The paginator no longer needs to be child of itself (recursive). The Click More option handles this.
  3. Added a method to limit number of pages. It is based on the :NOT CSS selector. This limiter can easily be removed (see Instructions section).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment