Skip to content

Instantly share code, notes, and snippets.

@harrisj
Created October 15, 2013 19:23
Show Gist options
  • Save harrisj/6997205 to your computer and use it in GitHub Desktop.
Save harrisj/6997205 to your computer and use it in GitHub Desktop.
A bit of text about encodings from a piece I'll never run, but I felt like sharing this bit

Encodings are the bane of any developer who works with the web. The problem is, the world is filled with many different languages and the ways to represent them have evolved with the evolution of computers. In the beginning, each character was represented as a single byte. This was enough for many Western languages, and the ISO standards defined orthogonal character sets for that. For instance, most Western European languages can be represent with the ISO-8859-1 (aka ISO-Latin-1) character set. ISO-8859-2 has characters used by Central European languages and so on. Asian languages have many more characters and thus used more complicated encoding schemes like Shift JIS. What these means is that each page on the World Wide Web needs to specify its encoding to ensure it's rendered correctly in the browser (although the browser can sometimes guess).

As computers became more powerful and computer memory/storage became markedly cheaper, the Unicode standard was proposed of specifying one language in a single standardized encoding. The key was to just use more bytes per character – 2 of them to be exact (it was later expanded to 4 to incorporate ancient languages). Supersizing string characters really freaked some developers out, especially since half of those bytes would only be 0s for Western languages (and the C language uses null characters to terminate strings). And the Unicode standard was proposed in the early 1990s, when even its 2-byte version seemed too much for the hardware and developers of the time. So, the UTF-8 standard was proposed: a variable scheme that used 1 to 4 bytes to represent a character depending on where it lay in the Unicode specification. The best part of this approach for American and English developers was that UTF-8 was identical to 7-bit ASCII, diverging only when you got to accented characters (which are 1 byte in ISO-8859-1 and 2 bytes in UTF-8).

The problem is we can no longer pretend we live in a 7-bit ASCII world. Even if you only work with data and pages in American English, you will almost certainly encounter accented characters in names and places. And that's where it all goes wrong. If your code accidentally interprets a UTF-8 string as ISO-8859-1, it is displayed as two junk characters (TK). If your code accidentally reads a ISO-8859-1 string with accented characters as UTF-8, it may crash or silently drop corrupt characters. That's annoying enough as it is, but then Microsoft made things worse. Back in the early days of computing, Microsoft forked the ISO-8859-1 standard into a new character set called Windows-1252 to add extra display elements like curly quotes and em dashs. They did this by using some reserved control areas in the ISO-8859-1 standard. The problem is that many sites will often identify Windows-1252 pages as vanilla ISO-8859-1 and some converters will blow up when they encounter those extra characters. Argh.

So, that's where we stand today. What does it mean for your code? You should standardize on a single encoding. Convert all your scraped documents to UTF-8 first before you work with them and make sure your code and wherever you are storing documents can handle UTF-8 correctly.

Converting a document means figuring out what the original encoding is. Since UTF-8 is the default encoding for XML, those are usually easy to work with. Microsoft documents will likely be in Windows-1252. HTML is a lot more complicated. Although HTTP specifies a header in responses to indicate what the encoding of a document is, the actual process of determining a HTML document's encoding is a lot more convoluted:

  • First, look at the Content-Type header to see if the encoding is explicitly specified there.
  • The encoding might also be specifed within using the <meta http-equiv="content-type"> or <meta charset> tags.
  • The presence of a byte order mark both indicates a document is encoded in UTF-16 and how to interpret it. You're not likely to encounter one of those documents, but your code will crash if you try to parse as either UTF-8 or ISO-8859-1.
  • If you still can't figure out the encoding, you'll have to guess if it's ISO-8859-1 or UTF-8. There probably is a code library you can use – here's one for fixing encodings in Ruby for instance. You might even decide to guess the encoding even if the document explicitly specifies one, because I have seen many cases where sites are blatantly mistaken about their encodings.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment