Skip to content

Instantly share code, notes, and snippets.

@shuma
Created April 7, 2012 13:08
Show Gist options
  • Save shuma/2328752 to your computer and use it in GitHub Desktop.
Save shuma/2328752 to your computer and use it in GitHub Desktop.
Ruby crawler, csv
require 'rubygems'
require 'nokogiri'
require 'ap'
require 'csv'
class MyDocument < Nokogiri::XML::SAX::Document
def initialize
@infodata = {}
@infodata[:titles] = Array.new([])
end
def start_element(name, attrs)
@attrs = attrs
@content = ''
end
def end_element(name)
if name == 'title'
Hash[@attrs]['xml:lang']
@infodata[:titles] << @content.inspect
@content = nil
end
if name == 'identifier'
@infodata[:identifier] = @content.inspect
@content = nil
end
if name == 'typeOfLevel'
@infodata[:typeOfLevel] = @content.inspect
@content = nil
end
if name == 'typeOfResponsibleBody'
@infodata[:typeOfResponsibleBody] = @content.inspect
@content = nil
end
if name == 'type'
@infodata[:type] = @content.inspect
@content = nil
end
if name == 'exact'
@infodata[:exact] = @content.inspect
@content = nil
end
if name == 'degree'
@infodata[:degree] = @content.inspect
@content = nil
end
if name == 'academic'
@infodata[:academic] = @content.inspect
@content = nil
end
if name == 'code'
Hash[@attrs]['source="vhs"']
@infodata[:code] = @content.inspect
@content = nil
end
if name == 'ct:text'
@infodata[:beskrivning] = @content.inspect
@content = nil
end
end
def characters(string)
@content << string if @content
end
def cdata_block(string)
characters(string)
end
def end_document
CSV.open("info.csv", "wb") do |row|
row << @infodata.values
end
puts "Finished..."
end
end
parser = Nokogiri::XML::SAX::Parser.new(MyDocument.new)
parser.parse(File.open("arkivvetenskap.xml", 'rb'))
@SixArm
Copy link

SixArm commented Apr 7, 2012

Ruby has a fast way to write comparisons like this:

 case name
 when "foo"
    ...do something
 when "bar"
    ...do something else
 end

@SixArm
Copy link

SixArm commented Apr 7, 2012

You can make your code cleaner like this:

case name 
when 'title'
  Hash[@attrs]['xml:lang']
  @infodata[:titles] << @content.inspect
  @content = nil
when 'identifier', 'typeOfLevel', 'typeOfResponsibleBody', 'type', 'exact', 'degree', 'academic'
   @infodata[name.to_sym] = @content.inspect
   @content = nil
when 'code'
   Hash[@attrs]['source="vhs"']
   @infodata[:code] = @content.inspect 
   @content = nil
when 'ct:text'
   @infodata[:beskrivning] = @content.inspect
   @content = nil
end 

@shuma
Copy link
Author

shuma commented Apr 7, 2012

Thanks SixArm! that is awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment