Skip to content

Instantly share code, notes, and snippets.

@flavorjones
Created July 8, 2010 16:06
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save flavorjones/468226 to your computer and use it in GitHub Desktop.
Save flavorjones/468226 to your computer and use it in GitHub Desktop.
# String Encoding, Nokogiri and You: A Primer
## First: A Few Thing To Consider
1. It is common for HTML pages to declare
1. Without information instructing it differently, Nokogiri's core library ([libxml2][]) will
[libxml2]: http://xmlsoft.org/
[meta-bug]: https://bugzilla.gnome.org/show_bug.cgi?id=579317
useful?: http://www.w3schools.com/XML/xml_encoding.asp
TODO:
- https://gist.github.com/2fc6c8b3dc42282dc081
- what happens when meta tag says ISO-859-1 and user specifies UTF-8?
- xchat log:
Dec 30 13:47:17 <rnewman> hi. how do I get nokogiri to respect the encoding I pass to Nokogiri.parse?
Dec 30 13:47:51 <rnewman> I do this, and it tells me the string is utf-8, and then the regular expression match fails with an incompatible encoding error:
Dec 30 13:48:00 <rnewman> url = "http://liberation.fr"
Dec 30 13:48:00 <rnewman> doc = Nokogiri.parse(open(url), url, "iso-8859-1")
Dec 30 13:48:00 <rnewman> title = (doc / 'title').text
Dec 30 13:48:00 <rnewman> puts title
Dec 30 13:48:00 <rnewman> puts title.encoding
Dec 30 13:48:01 <rnewman> puts title.gsub(/[èéëê]/, 'e')
Dec 30 13:53:47 <flavorjones> rnewman: instead of text(), try inner_html()
Dec 30 13:54:52 <flavorjones> the convention followed is that serialized contents (with to_html, inner_html, etc.) are in the requested encoding ...
Dec 30 13:55:03 <flavorjones> but .text et al always return utf-8
Dec 30 13:55:58 <rnewman> but I don't want the whole HTML tag, I just want the text contents.
Dec 30 13:56:26 <rnewman> (is this documented somewhere that I have missed?)
Dec 30 13:58:21 <rnewman> I tried changing .text to .first.child.content but that did not help
Dec 30 13:58:57 <tenderlove> rnewman: strings returned from nokogiri are always UTF-8
Dec 30 13:59:03 <tenderlove> I take it you're using ruby 1.9?
Dec 30 14:01:17 <rnewman> yes, ruby 1.9
Dec 30 14:01:35 <rnewman> so what does the encoding parameter of Nokogiri.parse() do? I am very confused.
Dec 30 14:01:35 <tenderlove> rnewman: okay, cool
Dec 30 14:01:54 * pauldix (n=pauldix@65.213.72.189) has joined #nokogiri
Dec 30 14:02:02 <tenderlove> so, the parameter to parse() means that the document you're giving to nokogiri is encoded as iso-8859-1
Dec 30 14:02:09 * ujihisa (n=ujihisa@S01060003520a5ddd.vc.shawcable.net) has joined #nokogiri
Dec 30 14:02:26 <tenderlove> no matter what the document you're parsing is encoded as, nokogiri will return strings as utf-8
Dec 30 14:02:37 <tenderlove> so internally, it converts the iso-8859-1 document to UTF-8
Dec 30 14:03:05 <tenderlove> since you're using 1.9
Dec 30 14:03:15 <tenderlove> you can convert the UTF-8 string to whatever encoding you want
Dec 30 14:03:23 <tenderlove> one sec, and I'll whip up an example
Dec 30 14:04:00 <rnewman> now suppose I'm using 1.8.[67] instead -- what will nokogiri do?
Dec 30 14:04:15 <tenderlove> rnewman: even in 1.8, the strings are returned as UTF-8
Dec 30 14:04:27 <tenderlove> it's just that 1.8 doesn't know about string encodings vs regular expressions
Dec 30 14:04:42 <rnewman> in 1.8, this doesn't do the substitution and I don't know why:
Dec 30 14:04:42 <rnewman> puts title.gsub(/[èéëê]/, 'e')
Dec 30 14:05:05 <rnewman> (I'd like to write a program that works correctly for both 1.8 and 1.9)
Dec 30 14:07:19 <tenderlove> rnewman: this works in 1.8 and 1.9 https://gist.github.com/d96b037c0a01612cae52
Dec 30 14:07:32 <tenderlove> probably you needed the magic comment
Dec 30 14:08:13 <rnewman> I have the magic comment but I don't think it does anything in 1.8
Dec 30 14:08:29 <tenderlove> no, the magic comment doesn't do anything in 1.8
Dec 30 14:09:27 <flavorjones> tenderlove, rnewman: that gist outputs differently for me in 1.8 and 1.9:
Dec 30 14:09:45 <flavorjones> "Toute l'actualitee avec Libeeration" versus "Toute l'actualite avec Liberation"
Dec 30 14:09:56 <flavorjones> note the doubled 'e' in 1.8
Dec 30 14:10:25 <tenderlove> flavorjones: ya, that's happening to me too
Dec 30 14:10:25 <tenderlove> wtf
Dec 30 14:11:28 <tenderlove> flavorjones, rnewman: here we go: https://gist.github.com/2fc6c8b3dc42282dc081
Dec 30 14:11:34 <tenderlove> need to tell ruby the regex is UTF-8
Dec 30 14:11:42 <flavorjones> ah, nice
Dec 30 14:12:15 <flavorjones> tenderlove: Imma put together a short tutorial on encoding and use this example. mmkey?
Dec 30 14:12:26 <tenderlove> sounds like a good plan to me
Dec 30 14:12:38 <tenderlove> we should demonstrate using 1.9
Dec 30 14:12:40 <flavorjones> do you care if I push straight to heroku? or do you want to be the point man for that
Dec 30 14:12:45 <tenderlove> since we can show the string encoding easily
Dec 30 14:12:52 <tenderlove> flavorjones: nope, totally don't care
Dec 30 14:13:03 <flavorjones> tenderlove: well, I'll try not to f it up
Dec 30 14:13:05 <tenderlove> in fact, I think we need to push the new docco to heroku anyway
Dec 30 14:15:08 <rnewman> am trying to use your code; I get:
Dec 30 14:15:18 <rnewman> encoding-test.rb:7:in `<main>': undefined local variable or method ` ' for main:Object (NameError)
Dec 30 14:17:38 <tenderlove> rnewman: are you actually checking out the gist, or did you copy and paste?
Dec 30 14:19:18 <rnewman> just copied and pasted.
Dec 30 14:19:44 <tenderlove> can you try cloning the gist and running from there?
Dec 30 14:19:52 <tenderlove> I just want to make sure there are no copy-paste errors
Dec 30 14:20:04 <rnewman> i don't know what 'clone the gist' or 'checking out the gist' means.
Dec 30 14:20:15 <tenderlove> hmmm
Dec 30 14:20:16 <tenderlove> okay
Dec 30 14:20:29 <rnewman> I just copied and pasted from the 'raw' link.
Dec 30 14:20:43 <tenderlove> what OS are you on?
Dec 30 14:20:49 <rnewman> Leopard
Dec 30 14:21:18 <rnewman> this is all pretty new to me. it should be simple but seems to be very confusing.
Dec 30 14:21:32 <tenderlove> dealing with encodings is confusing
Dec 30 14:21:40 <tenderlove> try this:
Dec 30 14:21:45 <tenderlove> curl https://gist.github.com/raw/2fc6c8b3dc42282dc081/8ffe609c9057c043c2595151e2d5a724cd536eba/gistfile1.txt > foo.rb
Dec 30 14:21:56 <tenderlove> then run foo.rb
Dec 30 14:22:22 * cwalcott (n=cwalcott@209-20-65-112.slicehost.net) has joined #nokogiri
Dec 30 14:23:42 <rnewman> it printed this (in 1.9.1):
Dec 30 14:23:43 <rnewman> {"Toute l'actualité avec Libération"=>"Toute l'actualité avec Libération"}
Dec 30 14:23:51 <rnewman> that doesn't look right to me
Dec 30 14:24:38 <tenderlove> what is the output of "nokogiri -v" ?
Dec 30 14:25:06 <rnewman> nokogiri: 1.4.1
Dec 30 14:25:06 <rnewman> libxml:
Dec 30 14:25:06 <rnewman> binding: extension
Dec 30 14:25:06 <rnewman> compiled: 2.7.3
Dec 30 14:25:06 <rnewman> loaded: 2.7.3
Dec 30 14:25:38 <tenderlove> huh
Dec 30 14:25:38 <rnewman> in 1.8.7 I had to do ruby -r rubygems foo.rb, and I got this:
Dec 30 14:25:46 <rnewman> {"Toute l'actualit\303\203\302\251 avec Lib\303\203\302\251ration"=>"Toute l'actualit\303\203\302\251 avec Lib\303\203\302\251ration"}
Dec 30 14:26:03 <tenderlove> flavorjones: any ideas?
Dec 30 14:28:23 <tenderlove> rnewman: I'm not sure what the difference between our machines is. Something is different on your box than on mine.
Dec 30 14:30:50 * stepheneb (n=stephene@otrunk/stepheneb) has joined #nokogiri
Dec 30 14:34:49 * ujihisa_ (n=ujihisa@S01060003520a5ddd.vc.shawcable.net) has joined #nokogiri
Dec 30 14:35:37 <rnewman> so i'm not understanding what i am doing wrong here or what I should try next.
Dec 30 14:36:12 <tenderlove> I don't think you're doing anything wrong.
Dec 30 14:36:17 <tenderlove> I'm at a loss for next steps
Dec 30 14:36:27 <tenderlove> that code does what you want on my machine with 1.8 and 1.9
Dec 30 14:36:37 <tenderlove> seems that way for flavorjones too
Dec 30 14:36:49 <tenderlove> so there must be a difference in our environments that I don't know about
Dec 30 14:38:25 * serabe (n=sergio@89.131.182.39) has joined #nokogiri
Dec 30 14:38:28 <serabe> hi
Dec 30 14:38:36 <tenderlove> serabe: hey
Dec 30 14:38:51 <rnewman> what should I look at ?
Dec 30 14:39:13 * ujihisa has quit (Read error: 60 (Operation timed out))
Dec 30 14:39:57 <tenderlove> rnewman: I'm not sure. You're running libxml2 that's a couple revisions behind, but I don't think that's the problem
Dec 30 14:40:51 <tenderlove> rnewman: what does "locale" say?
Dec 30 14:41:42 <rnewman> $ locale
Dec 30 14:41:42 <rnewman> LANG="en_US.UTF-8"
Dec 30 14:41:42 <rnewman> LC_COLLATE="en_US.UTF-8"
Dec 30 14:41:42 <rnewman> LC_CTYPE="en_US.UTF-8"
Dec 30 14:41:42 <rnewman> LC_MESSAGES="en_US.UTF-8"
Dec 30 14:41:43 <rnewman> LC_MONETARY="en_US.UTF-8"
Dec 30 14:41:45 <rnewman> LC_NUMERIC="en_US.UTF-8"
Dec 30 14:41:47 <rnewman> LC_TIME="en_US.UTF-8"
Dec 30 14:41:49 <rnewman> LC_ALL=
Dec 30 14:41:51 <rnewman> $ locale
Dec 30 14:41:53 <rnewman> LANG="en_US.UTF-8"
Dec 30 14:41:55 <rnewman> LC_COLLATE="en_US.UTF-8"
Dec 30 14:41:57 <rnewman> LC_CTYPE="en_US.UTF-8"
Dec 30 14:41:59 <rnewman> LC_MESSAGES="en_US.UTF-8"
Dec 30 14:42:01 <rnewman> LC_MONETARY="en_US.UTF-8"
Dec 30 14:42:03 <rnewman> LC_NUMERIC="en_US.UTF-8"
Dec 30 14:42:05 <rnewman> LC_TIME="en_US.UTF-8"
Dec 30 14:42:07 <rnewman> LC_ALL=
Dec 30 14:42:09 <rnewman> $ locale
Dec 30 14:42:11 <rnewman> LANG="en_US.UTF-8"
Dec 30 14:42:13 <rnewman> LC_COLLATE="en_US.UTF-8"
Dec 30 14:42:15 <rnewman> LC_CTYPE="en_US.UTF-8"
Dec 30 14:42:16 <tenderlove> uh
Dec 30 14:42:17 <rnewman> LC_MESSAGES="en_US.UTF-8"
Dec 30 14:42:19 <rnewman> LC_MONETARY="en_US.UTF-8"
Dec 30 14:42:21 <rnewman> LC_NUMERIC="en_US.UTF-8"
Dec 30 14:42:23 <rnewman> LC_TIME="en_US.UTF-8"
Dec 30 14:42:25 <rnewman> LC_ALL=
Dec 30 14:42:27 <rnewman> oops, sorry for the repeat
Dec 30 14:43:40 <tenderlove> that's all the same as mine
Dec 30 14:45:42 <rnewman> I added in p title.chars.to_a and p title.bytes.to_a (before the gsub) and got:
Dec 30 14:45:56 <rnewman> ["T", "o", "u", "t", "e", " ", "l", "'", "a", "c", "t", "u", "a", "l", "i", "t", "Ã", "©", " ", "a", "v", "e", "c", " ", "L", "i", "b", "Ã", "©", "r", "a", "t", "i", "o", "n"]
Dec 30 14:45:56 <rnewman> [84, 111, 117, 116, 101, 32, 108, 39, 97, 99, 116, 117, 97, 108, 105, 116, 195, 131, 194, 169, 32, 97, 118, 101, 99, 32, 76, 105, 98, 195, 131, 194, 169, 114, 97, 116, 105, 111, 110]
Dec 30 14:47:09 <tenderlove> oh!
Dec 30 14:47:16 <tenderlove> that's helpful!
Dec 30 14:47:17 <tenderlove> okay
Dec 30 14:47:36 <rnewman> (obviously that will only work in 1.9)
Dec 30 14:48:47 <tenderlove> rnewman: try changing this Nokogiri::HTML(open(url))
Dec 30 14:48:52 <tenderlove> to this: Nokogiri::HTML(open(url), url, 'UTF-8')
Dec 30 14:49:18 <rnewman> will do so now. utf-8 is not the default then?
Dec 30 14:49:47 <tenderlove> not for html
Dec 30 14:49:57 <tenderlove> html spec says ISO-8859-1
Dec 30 14:50:03 <tenderlove> :-(
Dec 30 14:50:34 <rnewman> but ... the html of that page says: <meta http-equiv="content-type" content="text/html; charset=UTF-8" />
Dec 30 14:51:07 <rnewman> (and that still doesn't explain why it worked for you without the 'UTF-8" parameter)
Dec 30 14:52:18 <rnewman> in ruby 1.9.1 your revised script works correctly.
Dec 30 14:52:24 <rnewman> however in ruby 1.8.7 I get:
Dec 30 14:52:31 <rnewman> "Toute l'actualit\303\251 avec Lib\303\251ration"
Dec 30 14:52:31 <rnewman> "Toute l'actualitee avec Libeeration"
Dec 30 14:54:08 <tenderlove> rnewman: I suspect the meta tag problem is a bug in libxml2 (I recall submitting a patch for that)
Dec 30 14:54:23 <tenderlove> does your script have the "u" switch on the regex?
Dec 30 14:54:45 * stepheneb has quit ()
Dec 30 14:54:48 <rnewman> oops, i took it out while expeimenting ... let me put it back right now and try again.
Dec 30 14:56:04 <rnewman> much better now. so the answer seems to be (a) always specify 'UTF-8' explicitly to Nokogiri.parse; (b) always use the /u switch in regex if I have accented characters in the regex
Dec 30 14:56:54 <rnewman> if I don't specify a character encoding to Nokogirl.parse then you say it ignores the meta tag and always uses ISO-8859-1 instead of UTF-8 ?
Dec 30 14:57:57 <tenderlove> rnewman: no, it doesn't always do that. It will use the meta tag
Dec 30 14:58:24 <tenderlove> but there was a bug if libxml2 saw non iso-8859-1 characters before the meta tag
Dec 30 14:58:30 <tenderlove> it would try to guess then encoding
Dec 30 14:58:40 <tenderlove> then ignore what the meta tag says
Dec 30 14:59:10 <rnewman> and that page has a <head> tag with accented chars before the <meta> tag.
Dec 30 15:00:18 <rnewman> now ... is this all still going to work for a page whose declared encoding is ISO-8859-1 instead of utf-8 ?
Dec 30 15:00:42 <tenderlove> that is a good question. I don't know
Dec 30 15:00:55 <rnewman> (as this started with my trying to read pages at http://extension.harvard.edu with professor names that have accented chars, and those pages are iso-8859-1)
Dec 30 15:01:43 <tenderlove> huh, http://www.extension.harvard.edu/default.jsp has a meta tag that says UTF-8
Dec 30 15:01:50 <rnewman> it would be good if you can put advice on the nokogiri website for how to write a program that is portable between ruby 1.8.[67] and 1.9.1 when multi-byte chars are involved.
Dec 30 15:02:11 <tenderlove> ya, that's what flavorjones and I were just talking about
Dec 30 15:02:13 <flavorjones> rnewman: we discussed that about an hour ago
Dec 30 15:02:31 <flavorjones> i'm going to be borrowing heavily from this conversation and put together a tutorial on nokogiri.org
Dec 30 15:02:55 <rnewman> but http://www.extension.harvard.edu/2009-10/about/faculty/a_c.jsp has iso-8859-1
Dec 30 15:02:57 <tenderlove> flavorjones: we need to figure out the behavior wrt specifying an encoding vs the encoding in the meta tag
Dec 30 15:03:09 <flavorjones> tenderlove: indeed. i thought that was a libxml2 bug
Dec 30 15:03:24 <tenderlove> flavorjones: I think rnewman is seeing a libxml2 bug
Dec 30 15:03:29 <flavorjones> ya
Dec 30 15:03:45 <tenderlove> but we need to know what happens when a user specifies UTF-8 and the meta tag declares ISO-8859-1
Dec 30 15:03:45 <rnewman> i'm using the libxml2 from XAMPP which is more recent than the one that came with Leopard.
Dec 30 15:03:47 <flavorjones> libxml2's meta tag handling is pretty brittle all around
Dec 30 15:04:07 <flavorjones> tenderlove: ya, i'll do a matrix of libxml2 and ruby versions and figure out what happens
Dec 30 15:04:16 <tenderlove> IMO, the behavior should be to always use what the user specifies
Dec 30 15:04:19 <rnewman> when I used the Leopard one, your basic example on your home page did not work.
Dec 30 15:04:41 <tenderlove> rnewman: the Leopard version is ~5 years old
Dec 30 15:04:52 <tenderlove> we're hoping to drop support for it soon
Dec 30 15:04:59 <flavorjones> fingers crossed
Dec 30 15:05:52 <tenderlove> foods. bbiab
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment