Skip to content

Instantly share code, notes, and snippets.

@jtacoma
Last active August 29, 2015 14:08
Show Gist options
  • Save jtacoma/ffff32743f89229d7df2 to your computer and use it in GitHub Desktop.
Save jtacoma/ffff32743f89229d7df2 to your computer and use it in GitHub Desktop.
UTF is not a character encoding

Unicode transformation formats (UTF) is not a character encoding. It is a family of mutually incompatible character encodings that are each capable of expressing the full range of possible Unicode characters.

Microsoft desktop applications that deal with plain text files, e.g. Notepad and Excel, use UTF-16LE under the name Unicode. Newer versions also offer UTF-16BE under the name Unicode big endian. An idiosyncrasy of Microsoft applications is that the character encoding of a plain text file is declared in a byte order mark (BOM) at the beginning of the file. This works like magic in many cases, but results in a few garbled characters at the beginning of the file when the BOM is not respected as such.

While the preferred encoding for web applications these days is UTF-8, not all platforms allow custom content to declare its character encoding. Even Microsoft's own IIS doesn't respect the BOM. Plain text file formats like CSS and JavaScript that, unlike XML and HTML, can't declare their own character encoding, should therefore be encoded in ASCII. JavaScript supports Unicode escape sequences so that expressivity is not lost.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment