Skip to content

Instantly share code, notes, and snippets.

@parano
Last active October 1, 2015 05:28
Show Gist options
  • Save parano/181bcbf4046bab23a716 to your computer and use it in GitHub Desktop.
Save parano/181bcbf4046bab23a716 to your computer and use it in GitHub Desktop.
scrape jingdina
#!/usr/bin/ruby
# coding: utf-8
require 'uri'
require 'iconv'
require 'open-uri'
if $0 == __FILE__
url = 'http://jingdian.tuniu.com/fengjing/'
num = 1
regexp = /\<h1\>.*span\>/
regexp_jingdian = /\<h1\>(.*)\<\/h1\>/
regexp_jingdian1 = /title=\".*\"\>(.*)\<\/a\>\//
regexp_jingdian2 = /href=.*"\>(.*)\<\/a\>"/
regexp_title = /href=.*"\>(.*)\<\/a\>\/<a.*"\>(.*)\<\/a\>/
file = File.open("data.txt","w+")
for num in 1..38144
page = open(url + num.to_s)
text = page.read; nil
text = text.to_s.scan(regexp)
if( regexp_title =~ text.to_s )
print "#{$1},#{$2},"
file << "#{$1},#{$2},"
end
if( regexp_jingdian =~ text.to_s )
print "#{$1}\n"
file << "#{$1}\n"
end
end
file.close
end
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment