Skip to content

Instantly share code, notes, and snippets.

@D3MZ
Created September 10, 2013 15:40
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save D3MZ/6511279 to your computer and use it in GitHub Desktop.
Save D3MZ/6511279 to your computer and use it in GitHub Desktop.
%w{csv mechanize pp mongo peach parallel}.each { |x| require x }
include Mongo
@coll = MongoClient.new("localhost", 27017)['etsy']['stores']
def extracted url
@agent = Mechanize.new { |agent| agent.user_agent_alias = 'Mac Safari' } #resets cookies, prevent some tracking
page = @agent.get url
product_hash = lambda do |listing_card|
{
name: listing_card.search('.title').text.strip,
price: listing_card.search('.currency-value').text.to_f,
uri: "http://www.etsy.com" << listing_card.search('.listing-thumb').first.attributes.select {|elm| elm["href"]}["href"].value,
img: listing_card.search('.listing-thumb//img/@src').to_s,
}
end
def review_score page
score = page.search('//*[@id="shop-info"]/ul/li[3]/a/span/div/input[2]/@value').first
score ? score.value.to_f : score
end
{
_id: url,
sales: page.search('//*[@id="shop-info"]/ul/li[4]/a').text.to_i,
reviews: {
count: page.search('.review-rating-count').text.gsub(/\(|\)|\,/,'').to_i,
score: review_score(page),
},
products: page.search('.listing-card').collect(&product_hash)
}
end
#pp extracted "http://www.etsy.com/ca/shop/luckykaerufabric?order=price_asc&page=1"
path = "/Users/ZAIR2/Google Drive/" << 'etsy_merchants.csv'
merchants = CSV.read(path, headers:true)
already_scraped_merchants = @coll.find({_id:/order/},fields:{_id:1}).collect { |doc| doc["_id"] }
merchants = merchants.reject {|merchant| already_scraped_merchants.include? "#{merchant["shop_uri"]}?order=price_asc&page=1"}
Parallel.map(merchants) do |merchant|
p @coll.insert extracted "#{merchant["shop_uri"]}?order=price_asc&page=1"
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment