andrewblim/text_encodings.md

## text_encodings.md

      
    Raw
  

              text_encodings.md
            
          
    Text encodings (Ruby-oriented)

Introduction

Text is a bunch of bytes, just like anything else. An encoding is a standard that prescribes how to turn them into human readable characters. There are many standards but UTF-8 is now clearly the dominant standard on the Web. However, other standards are common enough that we have to watch out for them. If you interpret text encoded one way with another encoding, you could get:

the correct text, if the read encoding is backwards compatible with the right encoding
invalid strings, if the text specifies sequences of bytes that do not correspond to any characters in the read encoding
valid but incorrect text, if all of the byte sequences in the text correspond to characters in the read encoding

It is common to write characters' byte representations in hexadecimal, for example 5E corresponds to 0011 1110.
ASCII

The old common encoding in the US is ASCII. 1 char is 1 byte. ASCII is 7 bits, using the range 00-7F. ASCII encompasses a very limited set of characters, and in fact when people say "ASCII" these days they probably mean...
"Extended ASCII" is an umbrella term to describe a few different schemes that are backwards compatible with ASCII. The most common extended ASCII scheme is ISO-8859-1, which uses 1 byte. 00-7F are the regular ASCII characters, 80-9F are not used, and A0-FF are other characters. Notably, this includes Latin alphabet characters with common accents used in Western European languages. ISO-8859-1 is still reasonably common.
There are lots of other encodings, both historical and still in use, which I won't get into with the exception of Unicode. Here is a list of common encodings used on the web.
Unicode and UTF-8

Unicode is not an encoding per se, but a standard that maps numbers, or "code points," to characters. Unicode handles a very large number of alphabets, symbols, shapes, and other things you didn't realize you'd ever want. While you could design a Unicode encoding that naively maps code points directly to bytes, with zero padding to fill out the bytes, this is not done for reasons discussed below. It is common to write out a Unicode code point with U+[at least four digits], for example U+005E in the example above.
UTF-8 is a popular encoding that implements the Unicode standard by representing characters with a variable number of bytes. It's easiest to just read the Wikipedia article, which both clearly explains the standard and points out the important features behind the at-first funny-looking encoding rules.
With a variable-byte encoding like UTF-8, if you are reading a fixed number of bytes at a time, you may end up with part of the encoding for a character and the result may not be valid even if the full text was.
UTF-8 is backwards comaptible with basic ASCII but not backwards compatible with ISO-8859-1. For example:

'e' is 65 (0110 0101) in ASCII, ISO-8859-1, and UTF-8
'é' (lower case e with an acute accent) is not in ASCII, is E9 (1110 1001) in ISO-8859-1, and is U+00E9 in Unicode => 11000011 10101001 in UTF-8.

Note that even though the numerical value of the byte for 'é' in ISO-8859-1 is the same as the Unicode code point, the encodings are not compatible because of the way UTF-8 encodes code points to bytes.
Encodings in Ruby

This next part assumes you have at least Ruby 2.0.0. I think some of these were different in 1.9 and I'm not even sure if Ruby had the concept of encoding prior to 1.9.
Strings in Ruby have encodings, which you can retrieve with (surprise) the encoding method. By default they are UTF-8 (I guess this might be set by some environment variable or something).
[18] pry(main)> "foo".encoding
=> #<Encoding:UTF-8>
[19] pry(main)> "foo".encoding == Encoding::UTF_8
=> true

Raw 2-byte hex sequences can be specified with '\x' and Unicode code points with '\u'.
[23] pry(main)> "\xC3\xA9"
=> "é"
[24] pry(main)> "\u00E9"
=> "é"
[25] pry(main)> "\xC3\xA9" == "\u00E9"
=> true

You can convert strings in one encoding to another using the encode method of String. This converts the "meaning" of the byte sequences across encodings, for example the following will convert a string "é" in UTF-8 to a string "é" in ISO-8859-1.
[26] pry(main)> "\u00E9".encode("ISO-8859-1")
=> "\xE9"
[27] pry(main)> "\u00E9".encode("ISO-8859-1").encoding
=> #<Encoding:ISO-8859-1>

You get an exception if you try to encode a character that is not represented in the target encoding, unless you use some of the extra arguments to encode:
[32] pry(main)> "\u2622".encode("ISO-8859-1")  # radioactive sign
Encoding::UndefinedConversionError: U+2622 from UTF-8 to ISO-8859-1
from (pry):20:in `encode'
[33] pry(main)> "\u2622".encode("ISO-8859-1", undef: :replace, replace: "FOO")
=> "FOO"

You can force a string to use a certain encoding with force_encode, which doesn't attempt to make any changes to the bytes, simply lays on a new encoding interpretation:
[41] pry(main)> "\u2622".force_encoding("ISO-8859-1")
=> "\xE2\x98\xA2"
[42] pry(main)> "\u2622".force_encoding("ISO-8859-1").encode("UTF-8")  # valid but garbage
=> "â\u0098¢"

You can check if a string has validly encoded text according to its associated encoding:
[43] pry(main)> "\u2622".force_encoding("ISO-8859-1").encode("UTF-8").valid_encoding?  # valid but garbage
=> true
[44] pry(main)> "\xFF".valid_encoding?   # not valid UTF-8
=> false

To properly understand Unicode and encodings, it is helpful to actually carry out some of these conversion exercises by hand, or to do out an encoding by hand on paper and verify that you tie out with Ruby's results.
ASCII-8BIT

One gotcha that I came across is the ASCII-8BIT encoding. According to the Ruby docs:

Encoding::ASCII_8BIT is a special encoding that is usually used for a byte string, not a character string. But as the name insists, its characters in the range of ASCII are considered as ASCII characters. This is useful when you use ASCII-8BIT characters with other ASCII compatible characters.

You may get these if you do something like a read call from an IO object. If you are reading a fixed number of bytes, it is possible that what you get may not be valid text in the encoding, because you may have read only up to the middle of a character encoding, or you may have started in the middle of a character encoding. For example, note the encoding differences in what is returned by read and gets below:
[48] pry(main)> require 'stringio'
=> true
[49] pry(main)> StringIO.new("abcdefg").read(4)
=> "abcd"
[50] pry(main)> StringIO.new("abcdefg").read(4).encoding
=> #<Encoding:ASCII-8BIT>
[51] pry(main)> StringIO.new("abcdefg").gets("d")
=> "abcd"
[52] pry(main)> StringIO.new("abcdefg").gets("d").encoding
=> #<Encoding:UTF-8>