Skip to content

Instantly share code, notes, and snippets.

@mathiasbynens
Last active October 27, 2016 14:46
Show Gist options
  • Star 4 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mathiasbynens/5760113 to your computer and use it in GitHub Desktop.
Save mathiasbynens/5760113 to your computer and use it in GitHub Desktop.
Let’s create a JavaScript-compatible regular expression that matches any URL code point, as per the URL Standard.
// “The URL code points are ASCII alphanumeric, "!", "$", "&", "'", "(", ")",
// "*", "+", ",", "-", ".", "/", ":", ";", "=", "?", "@", "_", "~", and code
// points in the ranges U+00A0 to U+D7FF, U+E000 to U+FDCF, U+FDF0 to U+FFEF,
// U+10000 to U+1FFFD, U+20000 to U+2FFFD, U+30000 to U+3FFFD, U+40000 to
// U+4FFFD, U+50000 to U+5FFFD, U+60000 to U+6FFFD, U+70000 to U+7FFFD, U+80000
// to U+8FFFD, U+90000 to U+9FFFD, U+A0000 to U+AFFFD, U+B0000 to U+BFFFD,
// U+C0000 to U+CFFFD, U+D0000 to U+DFFFD, U+E1000 to U+EFFFD, U+F0000 to
// U+FFFFD, U+100000 to U+10FFFD.”
// — http://url.spec.whatwg.org/#url-code-points
// Let’s create a JavaScript-compatible regular expression that matches any URL
// code point, as per the above definition.
var regenerate = require('regenerate'); // http://mths.be/regenerate
var set = regenerate()
.addRange(0x0030, 0x0039) // ASCII digits
.addRange(0x0041, 0x005A).addRange(0x0061, 0x007A) // ASCII alpha
.add(
'!', '$', '&', '\'', '(', ')', '*', '+', ',', '-', '.', '/', ':', ';',
'=', '?', '@', '_', '~'
)
.addRange(0x00A0, 0xD7FF)
.addRange(0xE000, 0xFDCF)
.addRange(0xFDF0, 0xFFEF)
.addRange(0x10000, 0x1FFFD)
.addRange(0x20000, 0x2FFFD)
.addRange(0x30000, 0x3FFFD)
.addRange(0x40000, 0x4FFFD)
.addRange(0x50000, 0x5FFFD)
.addRange(0x60000, 0x6FFFD)
.addRange(0x70000, 0x7FFFD)
.addRange(0x80000, 0x8FFFD)
.addRange(0x90000, 0x9FFFD)
.addRange(0xA0000, 0xAFFFD)
.addRange(0xB0000, 0xBFFFD)
.addRange(0xC0000, 0xCFFFD)
.addRange(0xD0000, 0xDFFFD)
.addRange(0xE1000, 0xEFFFD)
.addRange(0xF0000, 0xFFFFD)
.addRange(0x100000, 0x10FFFD);
console.log(set.toString());
@mathiasbynens
Copy link
Author

The set.toString() at the end returns the following JavaScript string:

'[\\x21\\x24\\x26-\\x3B\\x3D\\x3F-Z\\x5Fa-z\\x7E\\xA0-\\uD7FF\\uE000-\\uFDCF\\uFDF0-\\uFFEF]|[\\uD800-\\uD83E\\uD840-\\uD87E\\uD880-\\uD8BE\\uD8C0-\\uD8FE\\uD900-\\uD93E\\uD940-\\uD97E\\uD980-\\uD9BE\\uD9C0-\\uD9FE\\uDA00-\\uDA3E\\uDA40-\\uDA7E\\uDA80-\\uDABE\\uDAC0-\\uDAFE\\uDB00-\\uDB3E\\uDB44-\\uDB7E\\uDB80-\\uDBBE\\uDBC0-\\uDBFE][\\uDC00-\\uDFFF]|[\\uD83F\\uD87F\\uD8BF\\uD8FF\\uD93F\\uD97F\\uD9BF\\uD9FF\\uDA3F\\uDA7F\\uDABF\\uDAFF\\uDB3F\\uDB7F\\uDBBF\\uDBFF][\\uDC00-\\uDFFD]'

Logging it shows:

[\x21\x24\x26-\x3B\x3D\x3F-Z\x5Fa-z\x7E\xA0-\uD7FF\uE000-\uFDCF\uFDF0-\uFFEF]|[\uD800-\uD83E\uD840-\uD87E\uD880-\uD8BE\uD8C0-\uD8FE\uD900-\uD93E\uD940-\uD97E\uD980-\uD9BE\uD9C0-\uD9FE\uDA00-\uDA3E\uDA40-\uDA7E\uDA80-\uDABE\uDAC0-\uDAFE\uDB00-\uDB3E\uDB44-\uDB7E\uDB80-\uDBBE\uDBC0-\uDBFE][\uDC00-\uDFFF]|[\uD83F\uD87F\uD8BF\uD8FF\uD93F\uD97F\uD9BF\uD9FF\uDA3F\uDA7F\uDABF\uDAFF\uDB3F\uDB7F\uDBBF\uDBFF][\uDC00-\uDFFD]

This can easily be used as part of a regular expression literal in JavaScript.

@rhgb
Copy link

rhgb commented Aug 8, 2014

Thanks, this script helps me a lot.
There's something in the generated string I don't understand. The pattern like [.....]|[.....], is that necessary? Can I simply replace that with a single [..........]?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment