Skip to content

Instantly share code, notes, and snippets.

@xem
Last active August 29, 2015 14:00
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save xem/b33906eaf62388ee7b1c to your computer and use it in GitHub Desktop.
Save xem/b33906eaf62388ee7b1c to your computer and use it in GitHub Desktop.
Character encodings on the Web

Introduction

Every single component of the Web handles strings, and especially Unicode characters encoding, differently.

This document aims to gather all the important information about character encodings on the Web.

File charsets

  • HTML, CSS and JS files can theoretically be encoded with any charset registered at the IANA. (source).
  • UTF-8 and ISO-8859-1 are generally used for their large browser support.
  • It is recommended to use UTF-8 without BOM.

File charsets, as seen by the browsers

  • Browsers use complex rules to guess the encoding of Web pages (HTML, XML, ...), unless a charset is defined manually. It is recommended to do so via a HTTP header: Content-Type: text/html; charset=UTF-8 and/or by mentioning it explicitly in the file (<meta charset="utf-8"> for HTML, <?xml version="1.0" encoding="UTF-8"?> for XML). (source)
  • Browsers parse JavaScript and CSS files using the same encoding as the document including them, unless a different charset is defined manually. It is recommended to do so via a HTTP header and/or a charset attribute on link / script tags (<script charset="utf-8" src="...">, <link rel="stylesheet" charset="utf-8" href="...">).

Internal charsets

  • JavaScript uses a mix of two encodings (UTF-16 in the JS engine, UCS-2 in the language itself). (source)
  • URLs can only use a subset of the ASCII characters: 0-9a-zA-Z$_.+!*'(),-. (source)
  • HTTP headers can only contain ASCII characters. (source)

Unicode support

  • HTML and XML support amlost all Unicode characters (source), except some that are illegal: C0 and C1 control characters, DEL character, UTF-16 surrogate halves and BOM-related characters. (source)
  • CSS and JS files can use all Unicode characters.

Displaying unicode

By default, each browser, on each OS, is able to display some Unicode characters, but not all of them. Custom fonts can be used to fill the gaps. (source)

Escaping

  • In HTML and XML files, all the legal Unicode characters that are not part of the markup may be escaped. Some of them can be written as HTML entities (source) and all of them can use the forms &#DDDD; or &#xXXXX;. The only characters that officially need to be escaped, in theory, are <, > and &, as well as quotes (" and ') inside attributes surrounded by the same quotes. In practice, escaping > is never necessary.
  • In CSS selectors, all the characters may be escaped using the form \xxxx . In practise, only special characters (`!"#$%&'()*+,-./:;<=>?@[]^{|}~``) and leading digits / hyphens / underscores actually need to be escaped. (source)
  • In CSS @font-face rules, unicode ranges are marked like this: unicode-range: U+E000-U+E005. (source)
  • In JS strings, all the characters that are represented on two bytes in UTF-16 may be escaped using the forms \uXXXX, \xXX or \OOO. The four-byte characters can be encoded using a surrogate pair \uXXXX\uXXXX (on ES5-) (source) and can use a new format (on ES6+): \u{XXXXX}. (source)
  • The only escaping required in JavaScript is in querySelector() and querySelectorAll(), like in CSS selectors, except that the backslashes need to be escaped too: "\\xxxx " (source)
  • In URLs, all the illegal characters are converted in bytes sequences, and each byte of these sequences are escaped with a percent (%XX)
  • In HTTP headers, other encodings can be used as byte-sequences, with quoted-printable escaping: =?UTF-8?Q?=XX=XX=?= or base64 =?UTF-8?B?.......?= (source)

Escaping with JavaScript

  • For HTML, we can use a shadow Option element: link
  • For URLs/URIs, we can use the functions escape() / encodeURI() / encodeURIComponent(), and reuse the result for other byte-sequence escapings. link
  • For JavaScript (converting surrogate pairs charcodes into Unicode codepoints and vice-versa), here is the formula: link
  • Converting UTF-8 code points in ISO-8859-1 code points and vice-versa, can be done with the following table: link

Existing escaping tools in JS

@mathiasbynens
Copy link

In CSS selectors, […]

Note that CSSOM defines CSS.escape() to deal with this. Here’s a polyfill: http://mths.be/cssescape

In HTML and XML files, all the legal Unicode characters that are not part of the markup may be escaped. Some of them can be written as HTML entities (source) and all of them can use the forms &#DDDD; or &#xXXXX;.

This is not true for HTML. See http://www.whatwg.org/specs/web-apps/current-work/multipage/tokenization.html#table-charref-overrides (or https://github.com/mathiasbynens/he/blob/master/data/decode-map-overrides.json).

The only characters that officially need to be escaped, in theory, are <, > and &, as well as quotes (" and ') inside attributes surrounded by the same quotes. In practice, escaping > is never necessary.

Escaping > is necessary if it’s part of an unquoted attribute value.

URLs can only use a subset of the ASCII characters: 0-9a-zA-Z$_.+!*'(),-. (source)

Not really – see the URL Standard.

In JS strings, all the characters that are represented on two bytes in UTF-16 may be escaped using the forms \uXXXX, \xXX or \OOO.

I think you meant \0xx (zero instead of capital letter O).

Note that there are other escape sequences too. See http://mathiasbynens.be/notes/javascript-escapes.

Existing escaping tools in JS

FWIW, here’s my old “JavaScript escapes” tool: http://mothereff.in/js-escapes

For JavaScript (converting surrogate pairs charcodes into Unicode codepoints and vice-versa), here is the formula: link

Direct link to the formula on that page: http://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae

Converting UTF-8 code points in ISO-8859-1 code points and vice-versa, can be done with the following table: link

There is no such thing as a “UTF-8 code point” or an “ISO-8859-1 code point”. Did you mean encoding a string of Unicode symbols in ISO-8859-1 a.k.a. windows-1252? See the Encoding Standard for the index table. (I have a library for it, too: https://github.com/mathiasbynens/windows-1252)

@xem
Copy link
Author

xem commented May 9, 2014

@mathiasbynens,

Thank you very much for all those comments!

Actually, I'm still a noob in charsets vs the Web and wrote this page while learning things here and there (but mostly from your site). I knew I was making some mistakes in the writing, thanks for finally pointing them to me! :)

I'll include your remarks in the next revision of this draft (a big rewrite is planned, actually).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment