Skip to content

Instantly share code, notes, and snippets.

@3dd13
Created October 19, 2010 08:36
Show Gist options
  • Save 3dd13/633844 to your computer and use it in GitHub Desktop.
Save 3dd13/633844 to your computer and use it in GitHub Desktop.
OpenRice.com is a famous restaurant online directory in Hong Kong. Here is a ruby script to grep all the contact information of all shops. at this moment of time, there are around 26K records on the website. this program took ~ 15mins to finish. and
require 'rubygems'
require 'mechanize'
require 'fastercsv'
def get_all_option_values(page, attr_name)
page.search("[name=#{attr_name}]").search("option").map do |opt|
option_value = opt.attribute("value").value
[option_value, opt.text.strip.gsub(/\302\240\302\240/, '')] if option_value && option_value.length > 0
end.compact
end
def get_all_checkbox_values(page, attr_name)
page.search("input[type=checkbox][name=#{attr_name}]").map do |opt|
option_value = opt.attribute("value").value
[option_value, opt.attribute("title").value] if option_value && option_value.length > 0
end.compact
end
def parse_single_page(page, district_name, cuisine_name)
single_page_rows = []
page.search(".sr1_list").each do |shop|
shop_name = shop.search(".restname").children.first.text
# map x, y coordinate
shop_tel = shop.search(".tel").text
shop_expenditure = shop.search(".price").text
shop_address = shop.search(".add").text
shop_type = shop.search(".type").children.text
shop_rating = shop.search(".sr1score span").map do |score| score.text end.join('/')
single_page_rows << [shop_name, shop_tel, district_name, shop_address, shop_type, cuisine_name, shop_expenditure, shop_rating]
end
single_page_rows
end
def max_paging_count(page)
field = page.search('.pagination form div')
field.first.text.match(/Jump to page\(1-(.*)\)/)[1] if field.any?
end
start_time = Time.now
agent = Mechanize.new
page = agent.get('http://www.openrice.com/english/restaurant/advancesearch.htm?tc=top2')
cuisines = get_all_checkbox_values(page, "cuisine_id")
cuisines.reject!{ |cuisine| cuisine[0] =~ /999$/ }
#27048
districts = get_all_option_values(page,"district_id")
districts.reject!{ |district| district[0] =~ /999$/ }
# 27350
# dishes = get_all_option_values(page, "dishes_id")
# #17613
# amenities= get_all_option_values(page, "amenity_id")
# #19218
# themes = get_all_option_values(page, "theme_id")
# #2039
total_rows = []
# record_num = 0
p "Number of cuisines: #{cuisines.count}, Number of districts: #{districts.count}"
district_count = 0
districts.each do |district|
district_count += 1
district_id = district[0]
district_name = district[1]
p "Processing: #{district_count}, #{district_name}"
cuisines.each do |cuisine|
cuisine_id = cuisine[0]
cuisine_name = cuisine[1]
url = "http://www.openrice.com/english/restaurant/sr1.htm?district_id=#{district_id}&cuisine_id=#{cuisine_id}"
agent = Mechanize.new
page = agent.get(url)
total_rows += parse_single_page(page, district_name, cuisine_name)
page_count = max_paging_count(page)
# paging_desc = page.search(".paginationinfo").text.match(/Showing (.*) of (.*) Restaurants/)
# record_num += paging_desc[2].to_i if paging_desc
if page_count
(2..page_count.to_i).each do |page_index|
page = agent.get("#{url}&page=#{page_index}")
total_rows += parse_single_page(page, district_name, cuisine_name)
end
end
end
end
FasterCSV.open("open_rice_export.csv", 'w') {|csv|
csv << ["shop_name", "tel", "district_name", "shop_address", "shop_type", "cuisine_name", "shop_expenditure", "rating(Good/Bad)"]
total_rows.each do |row|
csv << row
end
}
p Time.now - start_time
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment