Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
A small nokogiri xml reader DSL.
# A small DSL for helping parsing documents using Nokogiri::XML::Reader. The
# XML Reader is a good way to move a cursor through a (large) XML document fast,
# but is not as cumbersome as writing a full SAX document handler. Read about
# it here: http://nokogiri.org/Nokogiri/XML/Reader.html
#
# Just pass the reader in this parser and specificy the nodes that you are interested
# in in a block. You can just parse every node or only look inside certain nodes.
#
# A small example:
#
# Xml::Parser.new(Nokogiri::XML::Reader(open(file))) do
# inside_element 'User' do
# for_element 'Name' do puts "Username: #{inner_xml}" end
# for_element 'Email' do puts "Email: #{inner_xml}" end
#
# for_element 'Address' do
# puts 'Start of address:'
# inside_element do
# for_element 'Street' do puts "Street: #{inner_xml}" end
# for_element 'Zipcode' do puts "Zipcode: #{inner_xml}" end
# for_element 'City' do puts "City: #{inner_xml}" end
# end
# puts 'End of address'
# end
# end
# end
#
# It does NOT fail on missing tags, and does not guarantee order of execution. It parses
# every tag regardless of nesting. The only way to guarantee scope is by using
# the `inside_element` method. This limits the parsing to the current or the named tag.
# If tags are encountered multiple times, their blocks will be called multiple times.
require 'nokogiri'
module Xml
class Parser
def initialize(node, &block)
@node = node
@node.each do
self.instance_eval &block
end
end
def name
@node.name
end
def inner_xml
@node.inner_xml.strip
end
def is_start?
@node.node_type == Nokogiri::XML::Reader::TYPE_ELEMENT
end
def is_end?
@node.node_type == Nokogiri::XML::Reader::TYPE_END_ELEMENT
end
def attribute(attribute)
@node.attribute(attribute)
end
def for_element(name, &block)
return unless self.name == name and is_start?
self.instance_eval &block
end
def inside_element(name=nil, &block)
return if @node.self_closing?
return unless name.nil? or (self.name == name and is_start?)
name = @node.name
depth = @node.depth
@node.each do
return if self.name == name and is_end? and @node.depth == depth
self.instance_eval &block
end
end
end
end
@dimitarvp

This comment has been minimized.

Show comment Hide comment
@dimitarvp

dimitarvp Jun 17, 2011

This is the most amazing Ruby code I seen (and used) for this year. Thanks a ton, mate!

This is the most amazing Ruby code I seen (and used) for this year. Thanks a ton, mate!

@kmile

This comment has been minimized.

Show comment Hide comment
@kmile

kmile Jun 17, 2011

You're very welcome, glad you've found a use for it :)

Owner

kmile commented Jun 17, 2011

You're very welcome, glad you've found a use for it :)

@dimitarvp

This comment has been minimized.

Show comment Hide comment
@dimitarvp

dimitarvp Jun 17, 2011

Your code brings to the XML Pull approach the one thing it seriously lacks -- readability.

With it, you just need to isolate the "for_element" / "inside_element" starting statements with one newline on the top and the bottom, and the XML Pull parsing code becomes a charm to read. Plus, the indenting as you go further and further in is intuitive as we are parsing an XML (which is also hierarchical).

In short, I love you. :)

Your code brings to the XML Pull approach the one thing it seriously lacks -- readability.

With it, you just need to isolate the "for_element" / "inside_element" starting statements with one newline on the top and the bottom, and the XML Pull parsing code becomes a charm to read. Plus, the indenting as you go further and further in is intuitive as we are parsing an XML (which is also hierarchical).

In short, I love you. :)

@dimitarvp

This comment has been minimized.

Show comment Hide comment
@dimitarvp

dimitarvp Jun 17, 2011

BTW, I removed the "name", "inner_xml" and "attribute" methods and just added a "method_missing" handler:

def method_missing(sym, *args, &block)
  @node.send sym, *args, &block
end

If there are security concerns I might end up limiting the method names which will get propagated to the node, but for now this works well enough.

Considering of doing a mini-gem and sharing it here on GitHub. My laziness will have the final word. :)

BTW, I removed the "name", "inner_xml" and "attribute" methods and just added a "method_missing" handler:

def method_missing(sym, *args, &block)
  @node.send sym, *args, &block
end

If there are security concerns I might end up limiting the method names which will get propagated to the node, but for now this works well enough.

Considering of doing a mini-gem and sharing it here on GitHub. My laziness will have the final word. :)

@kmile

This comment has been minimized.

Show comment Hide comment
@kmile

kmile Jun 17, 2011

I decided to write this DSL after I looked at my 300 line XML reader code and could not understand what was going on even a day later. Turns out using this code it got easier to read and maintain, faster, and less prone to errors as well!

The reason I wanted to define the inner_xml explicitly is because of the strip. If there are newlines and indenting in the XML, you also get them in your result which is usually not what you want.

Otherwise the method_missing is a nice addition if you end up hooking into the other node properties a lot as well.

My laziness had the final word for me creating this gist instead of a full fledged gem or project ;)

Owner

kmile commented Jun 17, 2011

I decided to write this DSL after I looked at my 300 line XML reader code and could not understand what was going on even a day later. Turns out using this code it got easier to read and maintain, faster, and less prone to errors as well!

The reason I wanted to define the inner_xml explicitly is because of the strip. If there are newlines and indenting in the XML, you also get them in your result which is usually not what you want.

Otherwise the method_missing is a nice addition if you end up hooking into the other node properties a lot as well.

My laziness had the final word for me creating this gist instead of a full fledged gem or project ;)

@dimitarvp

This comment has been minimized.

Show comment Hide comment
@dimitarvp

dimitarvp Jun 26, 2011

BTW, do you know a faster replacement for the "instance_eval(&block)"? I did some profiling and I didn't like the results in this regard. ;)

BTW, do you know a faster replacement for the "instance_eval(&block)"? I did some profiling and I didn't like the results in this regard. ;)

@joshuaflanagan

This comment has been minimized.

Show comment Hide comment
@joshuaflanagan

joshuaflanagan Jan 9, 2012

This is very helpful, thanks a lot. I fixed a bug that occurs when inside_element matches on a self-closing tag:

def inside_element(name=nil, &block)
  return if @node.self_closing? # there is nothing inside this element, get out now
  return unless name.nil? or (self.name == name and is_start?)

This is very helpful, thanks a lot. I fixed a bug that occurs when inside_element matches on a self-closing tag:

def inside_element(name=nil, &block)
  return if @node.self_closing? # there is nothing inside this element, get out now
  return unless name.nil? or (self.name == name and is_start?)
@kmile

This comment has been minimized.

Show comment Hide comment
@kmile

kmile Jan 9, 2012

Ah, I have not taken self-closing elements into account. You're right, thanks!

Owner

kmile commented Jan 9, 2012

Ah, I have not taken self-closing elements into account. You're right, thanks!

@dimitarvp

This comment has been minimized.

Show comment Hide comment
@dimitarvp

dimitarvp Feb 28, 2012

Just a quick question -- I haven't been able to decipher what the method "state" returns? Does anyone have any idea of what is the possible set of values it might return?

Just a quick question -- I haven't been able to decipher what the method "state" returns? Does anyone have any idea of what is the possible set of values it might return?

@kmile

This comment has been minimized.

Show comment Hide comment
@kmile

kmile Feb 28, 2012

This appears to be from libxml2. It seems that it references the XmlTextReaderMode enum:

Enum xmlTextReaderMode {
    XML_TEXTREADER_MODE_INITIAL = 0
    XML_TEXTREADER_MODE_INTERACTIVE = 1
    XML_TEXTREADER_MODE_ERROR = 2
    XML_TEXTREADER_MODE_EOF = 3
    XML_TEXTREADER_MODE_CLOSED = 4
    XML_TEXTREADER_MODE_READING = 5
}

But I cannot tell for sure without looking at the source what these states/modes mean, or if this is a complete list.

Owner

kmile commented Feb 28, 2012

This appears to be from libxml2. It seems that it references the XmlTextReaderMode enum:

Enum xmlTextReaderMode {
    XML_TEXTREADER_MODE_INITIAL = 0
    XML_TEXTREADER_MODE_INTERACTIVE = 1
    XML_TEXTREADER_MODE_ERROR = 2
    XML_TEXTREADER_MODE_EOF = 3
    XML_TEXTREADER_MODE_CLOSED = 4
    XML_TEXTREADER_MODE_READING = 5
}

But I cannot tell for sure without looking at the source what these states/modes mean, or if this is a complete list.

@santuxus

This comment has been minimized.

Show comment Hide comment
@santuxus

santuxus Feb 12, 2013

That's a really nice and useful gist.
Thanks!

That's a really nice and useful gist.
Thanks!

@cmartin81

This comment has been minimized.

Show comment Hide comment
@cmartin81

cmartin81 Mar 1, 2013

Very nice code! Thanks a lot!

Very nice code! Thanks a lot!

@lesterz

This comment has been minimized.

Show comment Hide comment
@lesterz

lesterz Nov 15, 2013

Is there any more documentation or examples on how to use this anywhere? I'm having a hard time instantiating classes inside for_element. Keep getting NoMethodErrors...

lesterz commented Nov 15, 2013

Is there any more documentation or examples on how to use this anywhere? I'm having a hard time instantiating classes inside for_element. Keep getting NoMethodErrors...

@joonty

This comment has been minimized.

Show comment Hide comment
@joonty

joonty Feb 7, 2014

This really is fantastic - excellent work!

joonty commented Feb 7, 2014

This really is fantastic - excellent work!

@gal-at-aljazeera

This comment has been minimized.

Show comment Hide comment
@gal-at-aljazeera

gal-at-aljazeera May 25, 2014

I just woke up in the middle of the night envisioning something like this.

And it already exists.

Good work.

I just woke up in the middle of the night envisioning something like this.

And it already exists.

Good work.

@nicka

This comment has been minimized.

Show comment Hide comment
@nicka

nicka Jan 13, 2015

OMG! This is awesome!!!! +1

nicka commented Jan 13, 2015

OMG! This is awesome!!!! +1

@twigbranch

This comment has been minimized.

Show comment Hide comment
@twigbranch

twigbranch May 12, 2015

inner_xml doesn't seem to unescape & -- what's the recommended way to do this?

inner_xml doesn't seem to unescape & -- what's the recommended way to do this?

@saroar

This comment has been minimized.

Show comment Hide comment
@saroar

saroar Apr 7, 2016

Is anybody can help me i have xml file which is 1gb i need find some category and import 100 product from 1gb xml file
here is my code

in controller

def import
    if params[:xml_file]
      file = params[:xml_file]
      doc = Nokogiri::XML::Document.parse(file)
      total_product = doc.xpath('//shop/offers/offer').take(2).length

      Product.import(doc, params[:category_id])
      redirect_to products_path, notice: "#{total_product} Product added."
    end
 end

and in product model
def self.import(doc, category)
parsed_products = doc.xpath('//shop/offers/offer').take(2)

if !self.fashion.nil?
  self.transaction do
    parsed_products.each do |product|
      if product.at_xpath('categoryId').text == category
        Product.create!(
          price: product.at_xpath('price').text,
          category_id: product.at_xpath('categoryId').text,
          remote_image_url: product.at_xpath('picture').text.strip,
          brand_id: product.at_xpath('vendor').text,
          title: product.at_xpath('name').text,
          description: product.at_xpath('description').text,

          gender: product.at_xpath('fashion/gender').present? ? product.at_xpath('fashion/gender').text.gsub("m","Male").gsub("f","Female") : nil,

          product_type: product.at_xpath('fashion/type').present? ? product.at_xpath('fashion/type').text : '',

        )
      end
    end
  end
end

end

form

h2.text-center Import Products

= form_tag import_products_path, multipart: true do |f|
  = file_field_tag :xml_file
  br

  br
  br
  = submit_tag "Import"

any advice will be appreciated thanks advance

saroar commented Apr 7, 2016

Is anybody can help me i have xml file which is 1gb i need find some category and import 100 product from 1gb xml file
here is my code

in controller

def import
    if params[:xml_file]
      file = params[:xml_file]
      doc = Nokogiri::XML::Document.parse(file)
      total_product = doc.xpath('//shop/offers/offer').take(2).length

      Product.import(doc, params[:category_id])
      redirect_to products_path, notice: "#{total_product} Product added."
    end
 end

and in product model
def self.import(doc, category)
parsed_products = doc.xpath('//shop/offers/offer').take(2)

if !self.fashion.nil?
  self.transaction do
    parsed_products.each do |product|
      if product.at_xpath('categoryId').text == category
        Product.create!(
          price: product.at_xpath('price').text,
          category_id: product.at_xpath('categoryId').text,
          remote_image_url: product.at_xpath('picture').text.strip,
          brand_id: product.at_xpath('vendor').text,
          title: product.at_xpath('name').text,
          description: product.at_xpath('description').text,

          gender: product.at_xpath('fashion/gender').present? ? product.at_xpath('fashion/gender').text.gsub("m","Male").gsub("f","Female") : nil,

          product_type: product.at_xpath('fashion/type').present? ? product.at_xpath('fashion/type').text : '',

        )
      end
    end
  end
end

end

form

h2.text-center Import Products

= form_tag import_products_path, multipart: true do |f|
  = file_field_tag :xml_file
  br

  br
  br
  = submit_tag "Import"

any advice will be appreciated thanks advance

@wdiechmann

This comment has been minimized.

Show comment Hide comment
@wdiechmann

wdiechmann Dec 6, 2016

Had a 60+GB xml on my hands - and until @kmile showed me the path I was utterly lost in XML up above my ears :)

Thank you - from the bottom of my ❤️

Had a 60+GB xml on my hands - and until @kmile showed me the path I was utterly lost in XML up above my ears :)

Thank you - from the bottom of my ❤️

@WendyBeth

This comment has been minimized.

Show comment Hide comment
@WendyBeth

WendyBeth Dec 23, 2016

This is beautiful and saved me so much time and pain. Thank you @kmile.

This is beautiful and saved me so much time and pain. Thank you @kmile.

@EricDuminil

This comment has been minimized.

Show comment Hide comment
@EricDuminil

EricDuminil Feb 17, 2017

Thanks a lot for this wonderful piece of code. Did anyone get it to work with JRuby?

Thanks a lot for this wonderful piece of code. Did anyone get it to work with JRuby?

@cmalpeli

This comment has been minimized.

Show comment Hide comment
@cmalpeli

cmalpeli Mar 7, 2017

@kmile this is awesome! Is there a way to prevent the text coming back with CDATA wrappers?

<![CDATA[My Text]]>

cmalpeli commented Mar 7, 2017

@kmile this is awesome! Is there a way to prevent the text coming back with CDATA wrappers?

<![CDATA[My Text]]>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment