# = Top 100 Greatest Movie Characters Scraper # This program just scrapes the empire online 100 # greatest movie characters web site (http://www.empireonline.com/100-greatest-movie-characters/) # and generates a simple page to display all the # characters on one page so you don't have to go # clicking through 100 pages # # This was just built b/c I'd rather learn something new # during the time it took me to view all 100 characters # and still get to see who they are. I'm too lazy to click # through all of them # Author:: Jason Amster (mailto:jayamster@gmail.com) # Copyright:: Copyright (c) 2008 Jason Amster # License:: Distributes under the same terms as Ruby require 'rubygems' require 'nokogiri' require 'open-uri' #This class just scrapes the the pages and collects # the relelvent information. Then it can generate HTML # based upon that. class Top100 BASE_URL="http://www.empireonline.com/100-greatest-movie-characters/default.asp?c=" def initialize @top100 = [] @html = "" end # Iterates 100 times and just scrapes each page collecting the position (redundant), # name of the character, and the image def scrape (1..100).each do |num| doc = Nokogiri::HTML(open(BASE_URL+num.to_s)) elements = doc.xpath('//head/title').first.inner_html.split("|")[1].split(". ") pos = elements.delete_at(0) name = elements.join(". ") # For the few names that have a period in it... lazy hack @top100 << { :pos=>pos, :name=>name.to_a.join(". "), :image=>"http://www.empireonline.com/images/features/100greatestcharacters/photos/#{num}.jpg" } end @top100 end # Checs to see if the *@top100* array has been set. If so, it returns it. If not, # it runs the *scrape* method def top100 @top100.empty? ? scrape : @top100 end # Genreates simple HTML to for display of the scraped data def generate top100.each do |entry| @html << <<-EOS