Skip to content

Instantly share code, notes, and snippets.

@serghost
Created June 12, 2011 13:13
Show Gist options
  • Save serghost/1021539 to your computer and use it in GitHub Desktop.
Save serghost/1021539 to your computer and use it in GitHub Desktop.
wikiheroes
require 'nokogiri'
require 'open-uri'
i = 0
heroes = []
descriptions = []
image_urls = []
puts "Parsing..."
while i+=1
@doc = Nokogiri::HTML.parse(open("http://www.topnews.ru/photo_id_5341_#{i}.html"))
name = (@doc/"p"/"b").find_all {|name| name.text[/^\d+\..*\)$/]}.map {|name| name.text.match(/^\d+\.\s+([^\s]+\s[^\s]+)/)[1]}
description = (@doc/"div.pvtElement"/"p._ga1_on_").map {|desc| desc.text}.join("\n")
descriptions << description
image_url = (@doc.xpath('//*[(@class = "pvtPic")]//img')).map {|img| img['src']}.join("\n")
image_urls << image_url
unless name.empty?
heroes += name
puts "find: #{name.join(", ")}"
end
if (@doc/"a.pvtNext").length == 0
puts "\nLast page: #{i}"
break
end
end
puts "\nHeroes: #{heroes.compact.uniq.join(", ")}"
puts "\nDescriptions: #{descriptions.join(", ")}"
puts "\nImages: #{image_urls.join(", ")}"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment