Skip to content

Instantly share code, notes, and snippets.

@jescalan
Created January 6, 2012 20:35
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save jescalan/1572289 to your computer and use it in GitHub Desktop.
Save jescalan/1572289 to your computer and use it in GitHub Desktop.
Ruby Amazon Scraper
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'colored'
# this is just a preview of what's to come - a proof of concept.
# it will be converted to a api-type library, gemified, and put in it's own repo
# for now, a cool way to experiment with amazon's data
query = 'ruby'
page = '2'
doc = Nokogiri::HTML(open("http://www.amazon.com/s/field-keywords=#{query}?page=#{page}"))
puts "Amazon search for '#{query}', page ##{page}\n".red.underline
doc.css('div.product').each do |el|
# grab the title
title = el.css('a.title').first.content
# grab the author (can be linked or not, hence the logic)
author = el.css('.ptBrand a').empty? ? el.css('.ptBrand').first.content.gsub!(/by /, '') : el.css('.ptBrand a').first.content
# grab the image
image = el.css('.productImage').attribute 'src'
# grab the product link
link = el.css('a.title').attribute 'href'
puts "#{title} by #{author}".green
puts "image url:".yellow + " #{image}"
puts "amazon link:".yellow + " #{link}"
puts ""
end
@jonbarlo
Copy link

jonbarlo commented Aug 9, 2019

This will throw an exception OpenURI::HTTPError (503 Service Unavailable) looks like Amazon is behind cloufare DNS to prevent attacks

@jescalan
Copy link
Author

Yeah this gist was created 8 years ago, not surprised

@yudechen0820
Copy link

@jonbarlo, have you found the solution?

@jonbarlo
Copy link

jonbarlo commented Apr 23, 2020

@codemicky yeah but involves to pay a third party service for proxy'ing, Amazon is super strict and doesn't likes headless browsers.

Another solution is running capybara w/ non-headless browser, if you create a dummy amz account and perform a log-in before checking the amz url i think you wont have issues but i might be wrong (i have done the same for another platform using this approach)

And last thing is you might try to use a gem called kimurai

https://github.com/vifreefly/kimuraframework

Wondering what would be the result (i have used this as well so its another approach)

@yudechen0820
Copy link

I see. Thank you 👍

@jonbarlo
Copy link

@codemicky try kimurai and see if it works otherwise try nokogiri but behind a proxy, something like this https://scrapinghub.com/crawlera

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment