Skip to content

Instantly share code, notes, and snippets.

@keating
Created July 15, 2012 15:32
Show Gist options
  • Save keating/3117470 to your computer and use it in GitHub Desktop.
Save keating/3117470 to your computer and use it in GitHub Desktop.
crawl a website
#encoding: utf-8
require "nokogiri"
require "open-uri"
domain = 'http://sample.com&page='
begin
1.upto 1000 do |i|
html = open(domain + i.to_s, :proxy => "http://127.0.0.1:8087", :read_timeout => 1).read
html.force_encoding("gbk")
html.encode("utf-8")
main_doc = Nokogiri::HTML(html)
main_doc.css('table td').each do |tr|
page_num = /^[\d]+$/.match(tr.content)
puts page_num.to_s.strip.to_i if page_num
end
end
rescue => e
puts "error:#{e.to_s}"
retry
end
@keating
Copy link
Author

keating commented Dec 21, 2012

write to a file,
f = File.new("somefile.format", "+a")
f.print "some content"
f.close
http://stackoverflow.com/questions/1581674/differences-between-ruby-file-access-mode-r-and-w

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment