Skip to content

Instantly share code, notes, and snippets.

@zerothabhishek
Created June 29, 2012 04:08
Show Gist options
  • Save zerothabhishek/3015666 to your computer and use it in GitHub Desktop.
Save zerothabhishek/3015666 to your computer and use it in GitHub Desktop.
scrapes the top alexa sites. For http://hackerstreet.in/item?id=17578
##
# How to run:
# ruby scrape.alexa.rb
#
# Output:
# 20 files each with the data in csv format
#
# Dependencies:
# ruby version > 1.9 # "ruby -v" to check
# nokogiri gem # "gem install nokogiri" to install
##
require 'nokogiri'
require 'open-uri'
require 'csv'
def bar(li)
begin
site_name=li.css(".desc-container h2 a")[0].content
site_link=li.css(".desc-container .topsites-label")[0].content
site_rank = li.css(".count")[0].content
[site_rank, site_link, site_link]
rescue
[]
end
end
def foo1(url)
doc = Nokogiri::HTML(open(url))
listings=doc.css(".site-listing")
output = listings.collect{|li| bar(li) }
end
def foo2(data, i)
CSV.open("alexa-#{i}","w") do |csv|
data.each{|o| csv << o}
end
end
urls = (1..20).to_a.map{|i| "http://www.alexa.com/topsites/countries;#{i}/IN "}
urls.each_with_index do |url,i|
data=foo1(url)
foo2(data,i)
end
Copy link

ghost commented Mar 15, 2015

Hey there @zerothabhishek currently this script is spitting out 20 empty files. Any ideas how to get it working?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment