Skip to content

Instantly share code, notes, and snippets.

@kaid
Created September 14, 2011 10:59
Show Gist options
  • Save kaid/1216311 to your computer and use it in GitHub Desktop.
Save kaid/1216311 to your computer and use it in GitHub Desktop.
a script for fetching and parsing hostel list from yhachina.com then process it into a serialized json file
require 'nokogiri'
require 'open-uri'
base = 'http://www.yhachina.com/3g/'
def nopen(page)
Nokogiri::HTML(open(page))
end
doc = nopen(base + 'allthehostel.html')
uri_list = doc.xpath('//table//a/@href').map do |href|
href.value
end
hostels = uri_list.map do |l|
h = nopen(l)
h_pic_url = h.xpath('//img[contains(@src, "imgls")]/@src').first.value
h_title = h.xpath('//td[@class="hostel_t"]/text()').first.value
h_meta = h.xpath('//img[contains(@src, "images/hostel")]/following-sibling::text()').map do |t|
t.text
end.tap {|x| x[0].chomp!(' (');x.delete_at(1)}
h_desc = h.xpath('//*[@class="dotted"]/text()').first.text
h_more_urls = h.xpath('//table[3]//a/@href').map {|h| base + h.value}
h_facilities = nopen(h_more_urls[0]).xpath('//*[@class="administrative"]/text()').first.text
h_roomtypes_urls = nopen(h_more_urls[0]).xpath('//table[3]//a/@href').map {|h| base + h.value}.uniq!
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment