Skip to content

Instantly share code, notes, and snippets.

@joni
Last active June 30, 2023 04:02
Show Gist options
  • Star 46 You must be signed in to star a gist
  • Fork 12 You must be signed in to fork a gist
  • Save joni/3760795 to your computer and use it in GitHub Desktop.
Save joni/3760795 to your computer and use it in GitHub Desktop.
toUTF8Array: Javascript function for encoding a string in UTF8.
function toUTF8Array(str) {
var utf8 = [];
for (var i=0; i < str.length; i++) {
var charcode = str.charCodeAt(i);
if (charcode < 0x80) utf8.push(charcode);
else if (charcode < 0x800) {
utf8.push(0xc0 | (charcode >> 6),
0x80 | (charcode & 0x3f));
}
else if (charcode < 0xd800 || charcode >= 0xe000) {
utf8.push(0xe0 | (charcode >> 12),
0x80 | ((charcode>>6) & 0x3f),
0x80 | (charcode & 0x3f));
}
// surrogate pair
else {
i++;
// UTF-16 encodes 0x10000-0x10FFFF by
// subtracting 0x10000 and splitting the
// 20 bits of 0x0-0xFFFFF into two halves
charcode = 0x10000 + (((charcode & 0x3ff)<<10)
| (str.charCodeAt(i) & 0x3ff))
utf8.push(0xf0 | (charcode >>18),
0x80 | ((charcode>>12) & 0x3f),
0x80 | ((charcode>>6) & 0x3f),
0x80 | (charcode & 0x3f));
}
}
return utf8;
}
@saumyamehta17
Copy link

@frozn How to use fromUTF8Array for array of bytes like [0,1,0,-1] ??

@bkdotcom
Copy link

there's "something" wrong with the surrogate pair portion of your code.

I think

    charcode = 0x10000 + (((charcode & 0x3ff)<<10)
                      | (str.charCodeAt(i) & 0x3ff))

should be

    charcode = 0x10000 + (((charcode & 0x3ff)<<10)
                      | (str.charCodeAt(i+1) & 0x3ff))

All I know is that you never add in the following pair and increment the index accordingly

@pgqueme
Copy link

pgqueme commented Dec 11, 2017

This worked great when creating a .doc file from Blob. For some reason, my old method using charCodeAt(i) stopped working, showing some weir characters when opening the file. This is how my method works:

var htmlString = '<div>Your html á é í ó ú</div>';
var arrayUTF8 = toUTF8Array(htmlString); //Your function
var byteNumbers = new Uint8Array(arrayUTF8.length);
for (var i = 0; i < arrayUTF8.length; i++) {
    byteNumbers[i] = arrayUTF8[i]; 
}
var blob = new Blob([byteNumbers], {type: 'text/html;charset=UTF-8;' });
FileSaver.saveAs(blob, 'yourfile.doc');

@kachkaev
Copy link

kachkaev commented Nov 29, 2018

@bkdotcom looks like there should not be that extra i+1 in str.charCodeAt, because there is i++; right after else. WDYT?

What I don't quite understand is

} else if (charcode < 0x800) {
// ...
} else if (charcode < 0xd800 || charcode >= 0xe000) {
//                  ^ never true given previous if

**UPD:** Here is a similar function inside google closure library: [stringToUtf8ByteArray()](https://github.com/google/closure-library/blob/8598d87242af59aac233270742c8984e2b2bdbe0/closure/goog/crypt/crypt.js#L117-L143). The fact that strings are UTF16 in the JavaScript's memory has been an opening to me!

@mos0711
Copy link

mos0711 commented May 7, 2019

toUTF8Array() contains an error which results in incorrect utf-8 encoding for surrogate pairs branch.

In the line containing charcode = ((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff) its necessary to add 0x010000
to make it work properly.

The error can be easily reproduced using glyph '🜄' which stands for 'Alchemical Symbol For Water'.
This should be rendered to: [240, 159, 156, 132] or hex 0xF0 0x9F 0x9C 0x84 but ends up to be [240, 143, 144, 128].

Correcting the according line to charcode = (((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff)) + 0x010000; fixes that error.

Btw. function fromUTF8Array() does not suffer from this issue.

@HoldOffHunger
Copy link

"क्षति" gives: (15) [224, 164, 149, 224, 165, 141, 224, 164, 183, 224, 164, 164, 224, 164, 191]

I can't imagine this being anything else than wrong.

@joni
Copy link
Author

joni commented Jul 24, 2020

"क्षति" gives: (15) [224, 164, 149, 224, 165, 141, 224, 164, 183, 224, 164, 164, 224, 164, 191]

I can't imagine this being anything else than wrong.

That is in fact the correct result: "क्षति" is a string of 5 Unicode code points, and that is how they are encoded in UTF-8.

@joni
Copy link
Author

joni commented Jul 24, 2020

toUTF8Array() contains an error which results in incorrect utf-8 encoding for surrogate pairs branch.

In the line containing charcode = ((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff) its necessary to add 0x010000
to make it work properly.

The error can be easily reproduced using glyph '🜄' which stands for 'Alchemical Symbol For Water'.
This should be rendered to: [240, 159, 156, 132] or hex 0xF0 0x9F 0x9C 0x84 but ends up to be [240, 143, 144, 128].

Correcting the according line to charcode = (((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff)) + 0x010000; fixes that error.

Btw. function fromUTF8Array() does not suffer from this issue.

The latest version (from Aug 6 2013) no longer has this bug. Did you perchance refer to a previous version of this gist?

@andywangevertz
Copy link

toUTF8Array() contains an error which results in incorrect utf-8 encoding for surrogate pairs branch.
In the line containing charcode = ((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff) its necessary to add 0x010000
to make it work properly.
The error can be easily reproduced using glyph '🜄' which stands for 'Alchemical Symbol For Water'.
This should be rendered to: [240, 159, 156, 132] or hex 0xF0 0x9F 0x9C 0x84 but ends up to be [240, 143, 144, 128].
Correcting the according line to charcode = (((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff)) + 0x010000; fixes that error.
Btw. function fromUTF8Array() does not suffer from this issue.

The latest version (from Aug 6 2013) no longer has this bug. Did you perchance refer to a previous version of this gist?

The latest version on the top render toUTF8Array('🜄') as (4) [240, 143, 156, 132]
Where as mos0711 pointed out, it should be [240, 159, 156, 132] with his fix.

@MariuszGajewski
Copy link

Hi, I really would like to use this useful snippet in my code but I wondering about a legal stuff like a copyright, licensing etc. I can't see any copyright notice in this snippet nor on @joni github profile. In this case, according to GitHub policy I can't use this code :(
https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/licensing-a-repository

image

Any thoughts how can I deal with such legal problems? I wondering if maybe @joni you could add some commonly used open source license to your snippet? Or any other ideas?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment