Skip to content

Instantly share code, notes, and snippets.

@googya
Created April 26, 2011 12:43
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save googya/942190 to your computer and use it in GitHub Desktop.
Save googya/942190 to your computer and use it in GitHub Desktop.
抽取rottentomatoes上的数据
# coding: utf-8
require 'rubygems'
require 'nokogiri'
require 'open-uri'
uris = File.open("c:\\url1.txt",'r')
#ratings = File.open("ratings.txt",'w')
i=0
uris.each_line do |uri|
line=uri.chop!
s=line.split("/")
s=s.last
rating = File.open(s+".txt",'w')
page=1
while(true)
ext ="?cats=&genreid=&letter=&switches=&sortby=&limit=50&page=#{page}"
begin
doc = Nokogiri::HTML(open(uri+ext))
puts uri+ext
rescue
retry
else
doc.xpath('/html/body/div/div[5]/div[2]/div[2]/div[2]/table/tbody/tr').each do |e|
t=e.elements[0].text
capition = e.elements[2].text
begin
t=t.strip
capition = capition.strip
rescue ArgumentError
print "canshu Exception :",t,capition,"\n"
next
else
rating.puts "#{i},#{t},#{capition}" if t=~ /\d\/\d{1,2}|^[A-Z]/
end
end
nextpage = doc.xpath('/html/body/div/div[5]/div[2]/div[2]/div[2]/div[2]/div[2]/a')
if (nextpage.children.size==2)
if nextpage.children[1].text =~ /Next/
page+=1
else
break
end
elsif (nextpage.children.size==1)
if nextpage.children[0].text =~ /Next/
page+=1
else
break
end
else
break
end
end
end
rating.close
i+=1
end
uris.close
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment