jamster (owner)

Revisions

gist: 36636 Download_button fork
public
Description:
Scrapes top 100 greatest movie characters with Ruby and Nokogiri
Public Clone URL: git://gist.github.com/36636.git
Embed All Files: show embed
top_100_greatest_movie_characters_scraper.rb #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
# = Top 100 Greatest Movie Characters Scraper
# This program just scrapes the empire online 100
# greatest movie characters web site (http://www.empireonline.com/100-greatest-movie-characters/)
# and generates a simple page to display all the
# characters on one page so you don't have to go
# clicking through 100 pages
#
# This was just built b/c I'd rather learn something new
# during the time it took me to view all 100 characters
# and still get to see who they are. I'm too lazy to click
# through all of them
 
 
# Author:: Jason Amster (mailto:jayamster@gmail.com)
# Copyright:: Copyright (c) 2008 Jason Amster
# License:: Distributes under the same terms as Ruby
      
require 'rubygems'
require 'nokogiri'
require 'open-uri'
 
#This class just scrapes the the pages and collects
# the relelvent information. Then it can generate HTML
# based upon that.
class Top100
  
  BASE_URL="http://www.empireonline.com/100-greatest-movie-characters/default.asp?c="
  def initialize
      @top100 = []
      @html = ""
  end
 
  # Iterates 100 times and just scrapes each page collecting the position (redundant),
  # name of the character, and the image
  def scrape
    (1..100).each do |num|
      doc = Nokogiri::HTML(open(BASE_URL+num.to_s))
      elements = doc.xpath('//head/title').first.inner_html.split("|")[1].split(". ")
      pos = elements.delete_at(0)
      name = elements.join(". ") # For the few names that have a period in it... lazy hack
      @top100 << {
        :pos=>pos,
        :name=>name.to_a.join(". "),
        :image=>"http://www.empireonline.com/images/features/100greatestcharacters/photos/#{num}.jpg"
      }
    end
    @top100
  end
  
  # Checs to see if the *@top100* array has been set. If so, it returns it. If not,
  # it runs the *scrape* method
  def top100
    @top100.empty? ? scrape : @top100
  end
  
  # Genreates simple HTML to for display of the scraped data
  def generate
    top100.each do |entry|
      @html << <<-EOS
 
<div class="entry">
<h1>#{entry[:pos]}. #{entry[:name]}</h1>
<div class="image">
<img src="#{entry[:image]}" />
</div>
</div>
 
EOS
    end
    @html
  end
 
end
 
top100 = Top100.new
puts top100.generate