Skip to content

Instantly share code, notes, and snippets.

@joalbertg
Created May 10, 2022 20:14
Show Gist options
  • Save joalbertg/f225cc988d2764edb457a0d881f11387 to your computer and use it in GitHub Desktop.
Save joalbertg/f225cc988d2764edb457a0d881f11387 to your computer and use it in GitHub Desktop.
Parsing huge XML in Ruby
<?xml version="1.0"?>
<catalog>
<book id="bk101">
<author>Gauthorambardella, Matthew</author>
<title>XML Developer's Guide</title>
<genre>Computer</genre>
<price>44.95</price>
<publish_date>2000-10-01</publish_date>
<description>An in-depth look at creating applications
with XML.</description>
</book>
<!-- comment -->
<book id="bk102">
<author>Ralls, Kim</author>
<title>Midnight Rain</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-12-16</publish_date>
<description>A former architect battles corporate zombies,
an evil sorceress, and her own childhood to become queen
of the world.</description>
</book>
<book id="bk103">
<author>Corets, Eva</author>
<title>Maeve Ascendant</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2000-11-17</publish_date>
<description>After the collapse of a nanotechnology
society in England, the young survivors lay the
foundation for a new society.</description>
</book>
<book id="bk104">
<author>Corets, Eva</author>
<title>Oberon's Legacy</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-03-10</publish_date>
<description>In post-apocalypse England, the mysterious
agent known only as Oberon helps to create a new life
for the inhabitants of London. Sequel to Maeve
Ascendant.</description>
</book>
<book id="bk105">
<author>Corets, Eva</author>
<title>The Sundered Grail</title>
<genre>Fantasy</genre>
<price>5.95</price>
<publish_date>2001-09-10</publish_date>
<description>The two daughters of Maeve, half-sisters,
battle one another for control of England. Sequel to
Oberon's Legacy.</description>
</book>
<book id="bk106">
<author>Randall, Cynthia</author>
<title>Lover Birds</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-09-02</publish_date>
<description>When Carla meets Paul at an ornithology
conference, tempers fly as feathers get ruffled.</description>
</book>
<book id="bk107">
<author>Thurman, Paula</author>
<title>Splish Splash</title>
<genre>Romance</genre>
<price>4.95</price>
<publish_date>2000-11-02</publish_date>
<description>A deep sea diver finds true love twenty
thousand leagues beneath the sea.</description>
</book>
<book id="bk108">
<author>Knorr, Stefan</author>
<title>Creepy Crawlies</title>
<genre>Horror</genre>
<price>4.95</price>
<publish_date>2000-12-06</publish_date>
<description>An anthology of horror stories about roaches,
centipedes, scorpions and other insects.</description>
</book>
<book id="bk109">
<author>Kress, Peter</author>
<title>Paradox Lost</title>
<genre>Science Fiction</genre>
<price>6.95</price>
<publish_date>2000-11-02</publish_date>
<description>After an inadvertant trip through a Heisenberg
Uncertainty Device, James Salway discovers the problems
of being quantum.</description>
</book>
<book id="bk110">
<author>O'Brien, Tim</author>
<title>Microsoft .NET: The Programming Bible</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-09</publish_date>
<description>Microsoft's .NET initiative is explored in
detail in this deep programmer's reference.</description>
</book>
<book id="bk111">
<author>O'Brien, Tim</author>
<title>MSXML3: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>36.95</price>
<publish_date>2000-12-01</publish_date>
<description>The Microsoft MSXML3 parser is covered in
detail, with attention to XML DOM interfaces, XSLT processing,
SAX and more.</description>
</book>
<book id="bk112">
<author>Galos, Mike</author>
<title>Visual Studio 7: A Comprehensive Guide</title>
<genre>Computer</genre>
<price>49.95</price>
<publish_date>2001-04-16</publish_date>
<description>Microsoft Visual Studio 7 is explored in depth,
looking at how Visual Basic, Visual C++, C#, and ASP+ are
integrated into a comprehensive development
environment.</description>
</book>
</catalog>
# frozen_string_literal: true
require 'ox'
module BigXML
class SaxParse < ::Ox::Sax
def initialize(block)
@yield_to = block
end
def start_element(name)
case name
when :catalog
@catalog = {}
when :book
@book = {}
else
@current_node = name
end
end
def end_element(name)
case name
when :catalog
yield_to.call(catalog)
@catalog = nil
when :book
catalog.merge!("book_#{attr_id}": book)
@book = nil
end
end
def attr(name, value)
@attr_id = value
end
def text(value)
book.merge!("#{current_node}": value)
end
def comment(value); end
private
attr_reader :yield_to, :catalog, :book, :attr_id, :current_node
end
end
# frozen_string_literal: true
require 'nokogiri'
module BigXML
class Parser
def initialize(xml_file)
raise(ArgumentError, "Please provide the path of the XML file, not a #{xml_file.class}") unless xml_file.is_a?(String)
@xml_file = xml_file
end
def each_node
raise(SyntaxError, "Please provide a XML valid") unless valid?
file_open = File.open(xml_file)
reader = Nokogiri::XML::Reader(file_open)
reader.each do |node|
yield(node) if node.node_type == TYPE_ELEMENT
end
file_open.close
end
private
attr_reader :xml_file
TYPE_ELEMENT = Nokogiri::XML::Reader::TYPE_ELEMENT
def valid?
file_open = File.open(xml_file)
Nokogiri::XML::Reader(file_open).each { }
file_open.close
true
rescue
return false
end
end
end
# frozen_string_literal: true
# Quickstart:
# > ruby read_big_xml.rb big.xml
require_relative 'big_xml.rb'
require_relative 'big_sax.rb'
require 'ox'
require 'pry'
require 'pry-nav'
require 'json'
def execute_with_nokogiri
xml = BigXML::Parser.new(ARGV[0])
xml.each_node do |node|
name = node.name
case name
when 'catalog'
puts "#{name.capitalize}:"
when 'book'
puts " #{name.capitalize}:"
else
puts " #{name.capitalize}: #{node.inner_xml}"
end
end
end
def execute_with_ox
file = File.open(ARGV[0])
pro = proc { |catalog| puts "Catalog: #{Pry::ColorPrinter.pp(catalog.to_h)}" }
# pro = proc { |catalog| puts "Catalog: #{JSON.pretty_generate(catalog.to_h)}" }
handler = BigXML::SaxParse.new(pro)
Ox.sax_parse(handler, file)
file.close
end
puts '- With Nokogiri'
execute_with_nokogiri
puts "\n- With OX"
execute_with_ox
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment