Skip to content

Instantly share code, notes, and snippets.

@atduskgreg
Created February 9, 2012 17:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save atduskgreg/1781386 to your computer and use it in GitHub Desktop.
Save atduskgreg/1781386 to your computer and use it in GitHub Desktop.
Scrape the text of the first page amazon product reviews
require 'rubygems'
require 'nokogiri'
require 'open-uri'
product_ids = ["B004UETB20", "B005JK01GO"]
product_ids.each do |product_id|
puts
puts "getting product: #{product_id}"
puts "downloading comments page with 10 highest rated comments"
page = open("http://www.amazon.com/product-reviews/#{product_id}").read
puts "parsing page, taking only divs longer than 1000 characters"
doc = Nokogiri::HTML(page)
results = []
doc.css("#productReviews tr td div").each do |i|
if(i.inner_text.length > 1000)
results << i.inner_text
end
end
puts "found #{results.length} results"
chopped_results = []
puts "cleaning up results"
results.each do |full_result|
paragraphs = full_result.split(/\n/)
chopped_results << paragraphs[16..paragraphs.length-24].join(' ')
end
File.open("results/#{product_id}.txt", "w"){|f| f << chopped_results.join("\n\n\n\n\n")}
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment