Skip to content

Instantly share code, notes, and snippets.

@miyohide
Created October 13, 2014 14:43
Show Gist options
  • Save miyohide/ba1de0f8040b2ef713e9 to your computer and use it in GitHub Desktop.
Save miyohide/ba1de0f8040b2ef713e9 to your computer and use it in GitHub Desktop.
JRuby 1.7.15 & Nokogiri 1.6.3.1(java) encoding problem?
[~/work/hypermicrodata]$ ruby -v
jruby 1.7.15 (1.9.3p392) 2014-09-03 82b5cc3 on Java HotSpot(TM) 64-Bit Server VM 1.8.0_11-b12 +jit [darwin-x86_64]
[~/work/hypermicrodata]$ bundle exec gem list | grep nokogiri
nokogiri (1.6.3.1 java)
[~/work/hypermicrodata]$ cat test/data/example.html
<!doctype html>
<html>
<!-- shameless -->
<head>
<title>Jason Ronallo</title>
</head>
<body>
<span itemscope itemtype="http://schema.org/Person"
itemid="http://ronallo.com#me">
<a itemprop="url" href="http://twitter.com/ronallo">
<span itemprop="name">Jason Ronallo</span>
</a> is the
<span itemprop="jobTitle">Associate Head of Digital Library Initiatives</span> at
<span itemprop="affiliation" itemscope itemtype="http://schema.org/Library" itemid="http://lib.ncsu.edu">
<span itemprop="name">
<a itemprop="url" href="http://www.lib.ncsu.edu">NCSU Libraries</a>
</span>
</span>.
</span>
</body>
</html>
[~/work/hypermicrodata]$ sw_vers
ProductName: Mac OS X
ProductVersion: 10.9.5
BuildVersion: 13F34
[~/work/hypermicrodata]$
irb(main):001:0> require 'nokogiri'
=> true
irb(main):002:0> Nokogiri.HTML(open('test/data/example.html'))
=> #<Nokogiri::HTML::Document:0xca0 name="document" children=[#<Nokogiri::XML::Element:0xc9e name="html" children=[#<Nokogiri::XML::Element:0xc9a name="head">, #<Nokogiri::XML::Element:0xc9c name="body">]>]>
irb(main):003:0> Nokogiri.HTML(open('test/data/example.html'), nil, 'UTF-8')
=> #<Nokogiri::HTML::Document:0xd02 name="document" children=[#<Nokogiri::XML::DTD:0xca2 name="html">, #<Nokogiri::XML::Element:0xd00 name="html" children=[#<Nokogiri::XML::Text:0xca4 "\n ">, #<Nokogiri::XML::Comment:0xca6 " shameless ">, #<Nokogiri::XML::Text:0xca8 "\n ">, #<Nokogiri::XML::Element:0xcb2 name="head" children=[#<Nokogiri::XML::Text:0xcaa "\n ">, #<Nokogiri::XML::Element:0xcae name="title" children=[#<Nokogiri::XML::Text:0xcac "Jason Ronallo">]>, #<Nokogiri::XML::Text:0xcb0 "\n ">]>, #<Nokogiri::XML::Text:0xcb4 "\n\n ">, #<Nokogiri::XML::Element:0xcfe name="body" children=[#<Nokogiri::XML::Text:0xcb6 "\n ">, #<Nokogiri::XML::Element:0xcfa name="span" attributes=[#<Nokogiri::XML::Attr:0xcb8 name="itemid" value="http://ronallo.com#me">, #<Nokogiri::XML::Attr:0xcba name="itemscope">, #<Nokogiri::XML::Attr:0xcbc name="itemtype" value="http://schema.org/Person">] children=[#<Nokogiri::XML::Text:0xcbe "\n ">, #<Nokogiri::XML::Element:0xcce name="a" attributes=[#<Nokogiri::XML::Attr:0xcc0 name="href" value="http://twitter.com/ronallo">, #<Nokogiri::XML::Attr:0xcc2 name="itemprop" value="url">] children=[#<Nokogiri::XML::Text:0xcc4 "\n ">, #<Nokogiri::XML::Element:0xcca name="span" attributes=[#<Nokogiri::XML::Attr:0xcc6 name="itemprop" value="name">] children=[#<Nokogiri::XML::Text:0xcc8 "Jason Ronallo">]>, #<Nokogiri::XML::Text:0xccc "\n ">]>, #<Nokogiri::XML::Text:0xcd0 " is the \n ">, #<Nokogiri::XML::Element:0xcd6 name="span" attributes=[#<Nokogiri::XML::Attr:0xcd2 name="itemprop" value="jobTitle">] children=[#<Nokogiri::XML::Text:0xcd4 "Associate Head of Digital Library Initiatives">]>, #<Nokogiri::XML::Text:0xcd8 " at \n ">, #<Nokogiri::XML::Element:0xcf6 name="span" attributes=[#<Nokogiri::XML::Attr:0xcda name="itemid" value="http://lib.ncsu.edu">, #<Nokogiri::XML::Attr:0xcdc name="itemprop" value="affiliation">, #<Nokogiri::XML::Attr:0xcde name="itemscope">, #<Nokogiri::XML::Attr:0xce0 name="itemtype" value="http://schema.org/Library">] children=[#<Nokogiri::XML::Text:0xce2 "\n ">, #<Nokogiri::XML::Element:0xcf2 name="span" attributes=[#<Nokogiri::XML::Attr:0xce4 name="itemprop" value="name">] children=[#<Nokogiri::XML::Text:0xce6 "\n ">, #<Nokogiri::XML::Element:0xcee name="a" attributes=[#<Nokogiri::XML::Attr:0xce8 name="href" value="http://www.lib.ncsu.edu">, #<Nokogiri::XML::Attr:0xcea name="itemprop" value="url">] children=[#<Nokogiri::XML::Text:0xcec "NCSU Libraries">]>, #<Nokogiri::XML::Text:0xcf0 "\n ">]>, #<Nokogiri::XML::Text:0xcf4 "\n ">]>, #<Nokogiri::XML::Text:0xcf8 ".\n ">]>, #<Nokogiri::XML::Text:0xcfc "\n \n">]>]>]>
irb(main):004:0>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment