Skip to content

Instantly share code, notes, and snippets.

@shuma
Created April 7, 2012 13:08
Show Gist options
  • Save shuma/2328752 to your computer and use it in GitHub Desktop.
Save shuma/2328752 to your computer and use it in GitHub Desktop.
Ruby crawler, csv
require 'rubygems'
require 'nokogiri'
require 'ap'
require 'csv'
class MyDocument < Nokogiri::XML::SAX::Document
def initialize
@infodata = {}
@infodata[:titles] = Array.new([])
end
def start_element(name, attrs)
@attrs = attrs
@content = ''
end
def end_element(name)
if name == 'title'
Hash[@attrs]['xml:lang']
@infodata[:titles] << @content.inspect
@content = nil
end
if name == 'identifier'
@infodata[:identifier] = @content.inspect
@content = nil
end
if name == 'typeOfLevel'
@infodata[:typeOfLevel] = @content.inspect
@content = nil
end
if name == 'typeOfResponsibleBody'
@infodata[:typeOfResponsibleBody] = @content.inspect
@content = nil
end
if name == 'type'
@infodata[:type] = @content.inspect
@content = nil
end
if name == 'exact'
@infodata[:exact] = @content.inspect
@content = nil
end
if name == 'degree'
@infodata[:degree] = @content.inspect
@content = nil
end
if name == 'academic'
@infodata[:academic] = @content.inspect
@content = nil
end
if name == 'code'
Hash[@attrs]['source="vhs"']
@infodata[:code] = @content.inspect
@content = nil
end
if name == 'ct:text'
@infodata[:beskrivning] = @content.inspect
@content = nil
end
end
def characters(string)
@content << string if @content
end
def cdata_block(string)
characters(string)
end
def end_document
CSV.open("info.csv", "wb") do |row|
row << @infodata.values
end
puts "Finished..."
end
end
parser = Nokogiri::XML::SAX::Parser.new(MyDocument.new)
parser.parse(File.open("arkivvetenskap.xml", 'rb'))
@shuma
Copy link
Author

shuma commented Apr 7, 2012

Thanks SixArm! that is awesome.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment