Skip to content

Instantly share code, notes, and snippets.

@BenjaminVerble
Last active January 14, 2018 22:55
Show Gist options
  • Save BenjaminVerble/7088082f326464bc8fce8c0867fff544 to your computer and use it in GitHub Desktop.
Save BenjaminVerble/7088082f326464bc8fce8c0867fff544 to your computer and use it in GitHub Desktop.
why string.length is not always reliable
// BMP: Basic Multilingual Plane (U+0000 to U+FFFF)
// UTF-16 (16-bit Unicode Transformation Format) is an extension of UCS-2 that allows representing code points outside the BMP.
// It produces a variable-length result of either one or two 16-bit code units per code point.
// This way, it can encode code points in the range from 0 to 0x10FFFF. (source: https://mathiasbynens.be/notes/javascript-encoding)
// "Unicode code points 2^16 and above are represented in JavaScript by two code units, known as a surrogate pair." Effective JS, Herman (29)
import stringToCodePointArray from 'string-to-code-point-array'
const outsideBMP = '𝌆'
const insideBMP = 'a'
console.log('string length outside BMP')
console.log(outsideBMP.length) // 2
console.log('string length')
console.log(insideBMP.length) // 1
console.log('code point array length outside BMP')
console.log(stringToCodePointArray(outsideBMP).length) // 1
console.log('code point array length')
console.log(stringToCodePointArray(insideBMP).length) // 1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment