Skip to content

Instantly share code, notes, and snippets.

@timothyklim
Created May 18, 2011 17:55
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save timothyklim/979111 to your computer and use it in GitHub Desktop.
Save timothyklim/979111 to your computer and use it in GitHub Desktop.
require 'nokogiri'
require 'open-uri'
i = 0
heroes = []
descriptions = []
puts "Parsing..."
while i+=1
@doc = Nokogiri::HTML.parse(open("http://www.topnews.ru/photo_id_5341_#{i}.html"))
name = (@doc/"p"/"b").find_all {|name| name.text[/^\d+\..*\)$/]}.map {|name| name.text.match(/^\d+\.\s+([^\s]+\s[^\s]+)/)[1]}
description = (@doc/"div.pvtElement"/"p._ga1_on_").map {|desc| desc.text}.join("\n")
descriptions << description
unless name.empty?
heroes += name
puts "find: #{name.join(", ")}"
end
if (@doc/"a.pvtNext").length == 0
puts "\nLast page: #{i}"
break
end
end
puts "\nHeroes: #{heroes.compact.uniq.join(", ")}"
puts "\nDescriptions: #{descriptions.join(", ")}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment