Created
November 17, 2008 18:25
-
-
Save flavorjones/25854 to your computer and use it in GitHub Desktop.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
For an html snippet 2374 bytes long ... | |
user system total real | |
regex * 1000 0.160000 0.010000 0.170000 ( 0.182207) | |
nokogiri * 1000 1.440000 0.060000 1.500000 ( 1.537546) | |
hpricot * 1000 5.740000 0.650000 6.390000 ( 6.401207) | |
it took an average of 0.0015 seconds for Nokogiri to parse and operate on an HTML snippet 2374 bytes long | |
it took an average of 0.0064 seconds for Hpricot to parse and operate on an HTML snippet 2374 bytes long | |
For an html snippet 97517 bytes long ... | |
user system total real | |
regex * 10 0.100000 0.020000 0.120000 ( 0.122117) | |
nokogiri * 10 0.310000 0.020000 0.330000 ( 0.322290) | |
hpricot * 10 3.190000 0.300000 3.490000 ( 3.502819) | |
it took an average of 0.0322 seconds for Nokogiri to parse and operate on an HTML snippet 97517 bytes long | |
it took an average of 0.3503 seconds for Hpricot to parse and operate on an HTML snippet 97517 bytes long |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
#! /usr/bin/env ruby | |
require 'rubygems' | |
gem 'nokogiri', '>=1.0.6' | |
gem 'hpricot', '>=0.6.170' | |
require 'open-uri' | |
require 'benchmark' | |
require 'nokogiri' | |
require 'hpricot' | |
[ | |
[1000, "#{File.dirname(__FILE__)}/sample_post.html"], | |
[10, "http://slashdot.com/"], | |
].each do |ntimes, uri| | |
html = open(uri).read | |
summary = [] | |
puts "For an html snippet #{html.size} bytes long ..." | |
Benchmark.bm(20) do |x| | |
x.report("regex * #{ntimes}") do | |
ntimes.times do |j| | |
html.gsub(/<a\s+(.*)>(.*)<\/a>/i, '<a \1><span>\2</span></a>') # broken regex | |
html.gsub(/<(script|noscript|object|embed|style|frameset|frame|iframe)[>\s\S]*<\/\1>/, '') | |
html | |
end | |
end | |
stime = Time.now | |
x.report("nokogiri * #{ntimes}") do | |
ntimes.times do | |
doc = Nokogiri::HTML(html) | |
doc.search("a/text()").wrap("<span></span>") | |
doc.search("script","noscript","object","embed","style","frameset","frame","iframe").unlink | |
doc.inner_html | |
end | |
end | |
etime = Time.now | |
summary << ("it took an average of %.4f seconds for Nokogiri to parse and operate on an HTML snippet #{html.size} bytes long" % ((etime - stime) / ntimes)) | |
stime = Time.now | |
x.report("hpricot * #{ntimes}") do | |
ntimes.times do | |
doc = Hpricot(html) | |
doc.search("a/text()").wrap("<span></span>") | |
doc.search(["script","noscript","object","embed","style","frameset","frame","iframe"]).remove | |
doc.inner_html | |
end | |
end | |
etime = Time.now | |
summary << ("it took an average of %.4f seconds for Hpricot to parse and operate on an HTML snippet #{html.size} bytes long" % ((etime - stime) / ntimes)) | |
end | |
puts | |
puts summary | |
puts | |
end |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<p>Yesterday was a big day, and I nearly missed it, since I spent nearly all of the sunlight hours at the wheel of a car. Nine hours sitting on your butt is no way to ... oh wait, that's actually how I spend every day. Just usually not in a rental Hyundai. Never mind, I digress. | |
</p> | |
<p>It was a big day because <a href='http://nokogiri.rubyforge.org/nokogiri/'>Nokogiri</a> was released. I've spent quite a bit of time over the last couple of months working with <a href='http://tenderlovemaking.com/'>Aaron Patterson</a> (of <a href='http://rubyforge.org/projects/mechanize/'>Mechanize</a> fame) on this excellent library, and so I'm walking around, feeling satisfied. | |
</p> | |
<p>"What's Nokogiri?" Good question, I'm glad I asked it. | |
</p> | |
<p>Nokogiri is the best damn XML/HTML parsing library out there in Rubyland. What makes it so good? You can search by XPath. You can search by CSS. You can search by both XPath <i>and</i> CSS. Plus, it uses <a href='http://xmlsoft.org/'>libxml2</a> as the parsing engine, <a href='http://www.xml.com/pub/a/2007/05/09/xml-parser-benchmarks-part-1.html'>so it's fast</a>. But the best part is, it's got a dead-simple interface that we shamelessly lifted from <a href='http://code.whytheluckystiff.net/hpricot/'>Hpricot</a>, everyone's favorite delightful parser. | |
</p> | |
<p>I had big plans to do a series of posts with examples and benchmarks, but right now I'm in <a href='http://www.google.com/search?q=dst+hell'>DST Hell</a> and don't have the quality time to invest. | |
</p> | |
<p>So, as I am wont to do, I'm punting. Thankfully, Aaron was his usual prolific self, and has kindly provided lots of documentation and examples: | |
<ul> | |
<li><a href='http://tenderlovemaking.com/2008/10/30/nokogiri-is-released/'>Aaron's blog post</a> | |
<li><a href='http://nokogiri.rubyforge.org/nokogiri/'>Documentation (RDoc)</a> | |
<li><a href='http://github.com/tenderlove/nokogiri/wikis'>Nokogiri-the-Wiki</a> | |
<li><a href='http://rubyforge.org/projects/nokogiri'>Nokogiri on Rubyforge</a> | |
<li><a href='http://gist.github.com/18533'>Benchmarks</a> | |
<li><a href='http://github.com/tenderlove/nokogiri/'>Git repository</a> | |
</ul> | |
</p> | |
<p>Use it in good health! Carry on.</p> | |
<p>P.S. Please start following Aaron on <a href='http://twitter.com/tenderlove'>Twitter</a>. :)</p> | |
<object>dumb-object</object> | |
<embed>dumb-embed</embed> |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment