Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
Scrapes top 100 greatest movie characters with Ruby and Nokogiri
# = Top 100 Greatest Movie Characters Scraper
# This program just scrapes the empire online 100
# greatest movie characters web site (http://www.empireonline.com/100-greatest-movie-characters/)
# and generates a simple page to display all the
# characters on one page so you don't have to go
# clicking through 100 pages
#
# This was just built b/c I'd rather learn something new
# during the time it took me to view all 100 characters
# and still get to see who they are. I'm too lazy to click
# through all of them
# Author:: Jason Amster (mailto:jayamster@gmail.com)
# Copyright:: Copyright (c) 2008 Jason Amster
# License:: Distributes under the same terms as Ruby
require 'rubygems'
require 'nokogiri'
require 'open-uri'
#This class just scrapes the the pages and collects
# the relelvent information. Then it can generate HTML
# based upon that.
class Top100
BASE_URL="http://www.empireonline.com/100-greatest-movie-characters/default.asp?c="
def initialize
@top100 = []
@html = ""
end
# Iterates 100 times and just scrapes each page collecting the position (redundant),
# name of the character, and the image
def scrape
(1..100).each do |num|
doc = Nokogiri::HTML(open(BASE_URL+num.to_s))
elements = doc.xpath('//head/title').first.inner_html.split("|")[1].split(". ")
pos = elements.delete_at(0)
name = elements.join(". ") # For the few names that have a period in it... lazy hack
@top100 << {
:pos=>pos,
:name=>name.to_a.join(". "),
:image=>"http://www.empireonline.com/images/features/100greatestcharacters/photos/#{num}.jpg"
}
end
@top100
end
# Checs to see if the *@top100* array has been set. If so, it returns it. If not,
# it runs the *scrape* method
def top100
@top100.empty? ? scrape : @top100
end
# Genreates simple HTML to for display of the scraped data
def generate
top100.each do |entry|
@html << <<-EOS
<div class="entry">
<h1>#{entry[:pos]}. #{entry[:name]}</h1>
<div class="image">
<img src="#{entry[:image]}" />
</div>
</div>
EOS
end
@html
end
end
top100 = Top100.new
puts top100.generate
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment