Skip to content

Instantly share code, notes, and snippets.

@joni
Last active June 30, 2023 04:02
Show Gist options
  • Star 46 You must be signed in to star a gist
  • Fork 12 You must be signed in to fork a gist
  • Save joni/3760795 to your computer and use it in GitHub Desktop.
Save joni/3760795 to your computer and use it in GitHub Desktop.
toUTF8Array: Javascript function for encoding a string in UTF8.
function toUTF8Array(str) {
var utf8 = [];
for (var i=0; i < str.length; i++) {
var charcode = str.charCodeAt(i);
if (charcode < 0x80) utf8.push(charcode);
else if (charcode < 0x800) {
utf8.push(0xc0 | (charcode >> 6),
0x80 | (charcode & 0x3f));
}
else if (charcode < 0xd800 || charcode >= 0xe000) {
utf8.push(0xe0 | (charcode >> 12),
0x80 | ((charcode>>6) & 0x3f),
0x80 | (charcode & 0x3f));
}
// surrogate pair
else {
i++;
// UTF-16 encodes 0x10000-0x10FFFF by
// subtracting 0x10000 and splitting the
// 20 bits of 0x0-0xFFFFF into two halves
charcode = 0x10000 + (((charcode & 0x3ff)<<10)
| (str.charCodeAt(i) & 0x3ff))
utf8.push(0xf0 | (charcode >>18),
0x80 | ((charcode>>12) & 0x3f),
0x80 | ((charcode>>6) & 0x3f),
0x80 | (charcode & 0x3f));
}
}
return utf8;
}
@andywangevertz
Copy link

toUTF8Array() contains an error which results in incorrect utf-8 encoding for surrogate pairs branch.
In the line containing charcode = ((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff) its necessary to add 0x010000
to make it work properly.
The error can be easily reproduced using glyph '🜄' which stands for 'Alchemical Symbol For Water'.
This should be rendered to: [240, 159, 156, 132] or hex 0xF0 0x9F 0x9C 0x84 but ends up to be [240, 143, 144, 128].
Correcting the according line to charcode = (((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff)) + 0x010000; fixes that error.
Btw. function fromUTF8Array() does not suffer from this issue.

The latest version (from Aug 6 2013) no longer has this bug. Did you perchance refer to a previous version of this gist?

The latest version on the top render toUTF8Array('🜄') as (4) [240, 143, 156, 132]
Where as mos0711 pointed out, it should be [240, 159, 156, 132] with his fix.

@MariuszGajewski
Copy link

Hi, I really would like to use this useful snippet in my code but I wondering about a legal stuff like a copyright, licensing etc. I can't see any copyright notice in this snippet nor on @joni github profile. In this case, according to GitHub policy I can't use this code :(
https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository

image

Any thoughts how can I deal with such legal problems? I wondering if maybe @joni you could add some commonly used open source license to your snippet? Or any other ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment