flavorjones (owner)

Revisions

gist: 25854 Download_button fork
public
Public Clone URL: git://gist.github.com/25854.git
Embed All Files: show embed
_Results.txt #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
For an html snippet 2374 bytes long ...
                          user system total real
regex * 1000 0.160000 0.010000 0.170000 ( 0.182207)
nokogiri * 1000 1.440000 0.060000 1.500000 ( 1.537546)
hpricot * 1000 5.740000 0.650000 6.390000 ( 6.401207)
 
it took an average of 0.0015 seconds for Nokogiri to parse and operate on an HTML snippet 2374 bytes long
it took an average of 0.0064 seconds for Hpricot to parse and operate on an HTML snippet 2374 bytes long
 
For an html snippet 97517 bytes long ...
                          user system total real
regex * 10 0.100000 0.020000 0.120000 ( 0.122117)
nokogiri * 10 0.310000 0.020000 0.330000 ( 0.322290)
hpricot * 10 3.190000 0.300000 3.490000 ( 3.502819)
 
it took an average of 0.0322 seconds for Nokogiri to parse and operate on an HTML snippet 97517 bytes long
it took an average of 0.3503 seconds for Hpricot to parse and operate on an HTML snippet 97517 bytes long
 
benchmark_spanify_links.rb #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#! /usr/bin/env ruby
 
require 'rubygems'
gem 'nokogiri', '>=1.0.6'
gem 'hpricot', '>=0.6.170'
 
require 'open-uri'
require 'benchmark'
require 'nokogiri'
require 'hpricot'
 
[
 [1000, "#{File.dirname(__FILE__)}/sample_post.html"],
 [10, "http://slashdot.com/"],
].each do |ntimes, uri|
  
  html = open(uri).read
  summary = []
 
  puts "For an html snippet #{html.size} bytes long ..."
  Benchmark.bm(20) do |x|
    x.report("regex * #{ntimes}") do
      ntimes.times do |j|
        html.gsub(/<a\s+(.*)>(.*)<\/a>/i, '<a \1><span>\2</span></a>') # broken regex
        html.gsub(/<(script|noscript|object|embed|style|frameset|frame|iframe)[>\s\S]*<\/\1>/, '')
        html
      end
    end
    
    stime = Time.now
    x.report("nokogiri * #{ntimes}") do
      ntimes.times do
        doc = Nokogiri::HTML(html)
        doc.search("a/text()").wrap("<span></span>")
        doc.search("script","noscript","object","embed","style","frameset","frame","iframe").unlink
        doc.inner_html
      end
    end
    etime = Time.now
    summary << ("it took an average of %.4f seconds for Nokogiri to parse and operate on an HTML snippet #{html.size} bytes long" % ((etime - stime) / ntimes))
 
    stime = Time.now
    x.report("hpricot * #{ntimes}") do
      ntimes.times do
        doc = Hpricot(html)
        doc.search("a/text()").wrap("<span></span>")
        doc.search(["script","noscript","object","embed","style","frameset","frame","iframe"]).remove
        doc.inner_html
      end
    end
    etime = Time.now
    summary << ("it took an average of %.4f seconds for Hpricot to parse and operate on an HTML snippet #{html.size} bytes long" % ((etime - stime) / ntimes))
  end
 
  puts
  puts summary
  puts
end
 
sample_post.html #
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
<p>Yesterday was a big day, and I nearly missed it, since I spent nearly all of the sunlight hours at the wheel of a car. Nine hours sitting on your butt is no way to ... oh wait, that's actually how I spend every day. Just usually not in a rental Hyundai. Never mind, I digress.
</p>
 
<p>It was a big day because <a href='http://nokogiri.rubyforge.org/nokogiri/'>Nokogiri</a> was released. I've spent quite a bit of time over the last couple of months working with <a href='http://tenderlovemaking.com/'>Aaron Patterson</a> (of <a href='http://rubyforge.org/projects/mechanize/'>Mechanize</a> fame) on this excellent library, and so I'm walking around, feeling satisfied.
</p>
 
<p>"What's Nokogiri?" Good question, I'm glad I asked it.
</p>
 
<p>Nokogiri is the best damn XML/HTML parsing library out there in Rubyland. What makes it so good? You can search by XPath. You can search by CSS. You can search by both XPath <i>and</i> CSS. Plus, it uses <a href='http://xmlsoft.org/'>libxml2</a> as the parsing engine, <a href='http://www.xml.com/pub/a/2007/05/09/xml-parser-benchmarks-part-1.html'>so it's fast</a>. But the best part is, it's got a dead-simple interface that we shamelessly lifted from <a href='http://code.whytheluckystiff.net/hpricot/'>Hpricot</a>, everyone's favorite delightful parser.
</p>
 
<p>I had big plans to do a series of posts with examples and benchmarks, but right now I'm in <a href='http://www.google.com/search?q=dst+hell'>DST Hell</a> and don't have the quality time to invest.
</p>
 
<p>So, as I am wont to do, I'm punting. Thankfully, Aaron was his usual prolific self, and has kindly provided lots of documentation and examples:
<ul>
<li><a href='http://tenderlovemaking.com/2008/10/30/nokogiri-is-released/'>Aaron's blog post</a>
<li><a href='http://nokogiri.rubyforge.org/nokogiri/'>Documentation (RDoc)</a>
<li><a href='http://github.com/tenderlove/nokogiri/wikis'>Nokogiri-the-Wiki</a>
<li><a href='http://rubyforge.org/projects/nokogiri'>Nokogiri on Rubyforge</a>
<li><a href='http://gist.github.com/18533'>Benchmarks</a>
<li><a href='http://github.com/tenderlove/nokogiri/'>Git repository</a>
</ul>
</p>
 
<p>Use it in good health! Carry on.</p>
 
<p>P.S. Please start following Aaron on <a href='http://twitter.com/tenderlove'>Twitter</a>. :)</p>
 
<object>dumb-object</object>
<embed>dumb-embed</embed>