Skip to content

Instantly share code, notes, and snippets.

Embed
What would you like to do?
toUTF8Array: Javascript function for encoding a string in UTF8.
function toUTF8Array(str) {
var utf8 = [];
for (var i=0; i < str.length; i++) {
var charcode = str.charCodeAt(i);
if (charcode < 0x80) utf8.push(charcode);
else if (charcode < 0x800) {
utf8.push(0xc0 | (charcode >> 6),
0x80 | (charcode & 0x3f));
}
else if (charcode < 0xd800 || charcode >= 0xe000) {
utf8.push(0xe0 | (charcode >> 12),
0x80 | ((charcode>>6) & 0x3f),
0x80 | (charcode & 0x3f));
}
// surrogate pair
else {
i++;
// UTF-16 encodes 0x10000-0x10FFFF by
// subtracting 0x10000 and splitting the
// 20 bits of 0x0-0xFFFFF into two halves
charcode = 0x10000 + (((charcode & 0x3ff)<<10)
| (str.charCodeAt(i) & 0x3ff))
utf8.push(0xf0 | (charcode >>18),
0x80 | ((charcode>>12) & 0x3f),
0x80 | ((charcode>>6) & 0x3f),
0x80 | (charcode & 0x3f));
}
}
return utf8;
}
@nene85

This comment has been minimized.

Copy link

@nene85 nene85 commented Nov 17, 2012

Very usefull! Thank you..
But how shoud I do to decode a UTF-8 array to a Javascript (utf16) String?

@frozn

This comment has been minimized.

Copy link

@frozn frozn commented Sep 13, 2014

Maybe a little late, but if someone find it useful, here's the solution to decode a UTF-8 array:

function fromUTF8Array(data) { // array of bytes
    var str = '',
        i;

    for (i = 0; i < data.length; i++) {
        var value = data[i];

        if (value < 0x80) {
            str += String.fromCharCode(value);
        } else if (value > 0xBF && value < 0xE0) {
            str += String.fromCharCode((value & 0x1F) << 6 | data[i + 1] & 0x3F);
            i += 1;
        } else if (value > 0xDF && value < 0xF0) {
            str += String.fromCharCode((value & 0x0F) << 12 | (data[i + 1] & 0x3F) << 6 | data[i + 2] & 0x3F);
            i += 2;
        } else {
            // surrogate pair
            var charCode = ((value & 0x07) << 18 | (data[i + 1] & 0x3F) << 12 | (data[i + 2] & 0x3F) << 6 | data[i + 3] & 0x3F) - 0x010000;

            str += String.fromCharCode(charCode >> 10 | 0xD800, charCode & 0x03FF | 0xDC00); 
            i += 3;
        }
    }

    return str;
}
@JavaScript-Packer

This comment has been minimized.

Copy link

@JavaScript-Packer JavaScript-Packer commented Jan 14, 2016

Simple decode UTF-8 array in JavaScript:

function fromUTF8Array($) {
  return eval("String.fromCharCode(" + $ + ")");
}
alert(fromUTF8Array("119,119,119,46,87,72,65,75,46,99,111,109"));
@saumyamehta17

This comment has been minimized.

Copy link

@saumyamehta17 saumyamehta17 commented Jan 20, 2017

@frozn How to use fromUTF8Array for array of bytes like [0,1,0,-1] ??

@bkdotcom

This comment has been minimized.

Copy link

@bkdotcom bkdotcom commented May 31, 2017

there's "something" wrong with the surrogate pair portion of your code.

I think

    charcode = 0x10000 + (((charcode & 0x3ff)<<10)
                      | (str.charCodeAt(i) & 0x3ff))

should be

    charcode = 0x10000 + (((charcode & 0x3ff)<<10)
                      | (str.charCodeAt(i+1) & 0x3ff))

All I know is that you never add in the following pair and increment the index accordingly

@pgqueme

This comment has been minimized.

Copy link

@pgqueme pgqueme commented Dec 11, 2017

This worked great when creating a .doc file from Blob. For some reason, my old method using charCodeAt(i) stopped working, showing some weir characters when opening the file. This is how my method works:

var htmlString = '<div>Your html á é í ó ú</div>';
var arrayUTF8 = toUTF8Array(htmlString); //Your function
var byteNumbers = new Uint8Array(arrayUTF8.length);
for (var i = 0; i < arrayUTF8.length; i++) {
    byteNumbers[i] = arrayUTF8[i]; 
}
var blob = new Blob([byteNumbers], {type: 'text/html;charset=UTF-8;' });
FileSaver.saveAs(blob, 'yourfile.doc');
@kachkaev

This comment has been minimized.

Copy link

@kachkaev kachkaev commented Nov 29, 2018

@bkdotcom looks like there should not be that extra i+1 in str.charCodeAt, because there is i++; right after else. WDYT?

What I don't quite understand is

} else if (charcode < 0x800) {
// ...
} else if (charcode < 0xd800 || charcode >= 0xe000) {
//                  ^ never true given previous if

**UPD:** Here is a similar function inside google closure library: [stringToUtf8ByteArray()](https://github.com/google/closure-library/blob/8598d87242af59aac233270742c8984e2b2bdbe0/closure/goog/crypt/crypt.js#L117-L143). The fact that strings are UTF16 in the JavaScript's memory has been an opening to me!
@mos0711

This comment has been minimized.

Copy link

@mos0711 mos0711 commented May 7, 2019

toUTF8Array() contains an error which results in incorrect utf-8 encoding for surrogate pairs branch.

In the line containing charcode = ((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff) its necessary to add 0x010000
to make it work properly.

The error can be easily reproduced using glyph '🜄' which stands for 'Alchemical Symbol For Water'.
This should be rendered to: [240, 159, 156, 132] or hex 0xF0 0x9F 0x9C 0x84 but ends up to be [240, 143, 144, 128].

Correcting the according line to charcode = (((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff)) + 0x010000; fixes that error.

Btw. function fromUTF8Array() does not suffer from this issue.

@HoldOffHunger

This comment has been minimized.

Copy link

@HoldOffHunger HoldOffHunger commented Jul 19, 2020

"क्षति" gives: (15) [224, 164, 149, 224, 165, 141, 224, 164, 183, 224, 164, 164, 224, 164, 191]

I can't imagine this being anything else than wrong.

@joni

This comment has been minimized.

Copy link
Owner Author

@joni joni commented Jul 24, 2020

"क्षति" gives: (15) [224, 164, 149, 224, 165, 141, 224, 164, 183, 224, 164, 164, 224, 164, 191]

I can't imagine this being anything else than wrong.

That is in fact the correct result: "क्षति" is a string of 5 Unicode code points, and that is how they are encoded in UTF-8.

@joni

This comment has been minimized.

Copy link
Owner Author

@joni joni commented Jul 24, 2020

toUTF8Array() contains an error which results in incorrect utf-8 encoding for surrogate pairs branch.

In the line containing charcode = ((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff) its necessary to add 0x010000
to make it work properly.

The error can be easily reproduced using glyph '🜄' which stands for 'Alchemical Symbol For Water'.
This should be rendered to: [240, 159, 156, 132] or hex 0xF0 0x9F 0x9C 0x84 but ends up to be [240, 143, 144, 128].

Correcting the according line to charcode = (((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff)) + 0x010000; fixes that error.

Btw. function fromUTF8Array() does not suffer from this issue.

The latest version (from Aug 6 2013) no longer has this bug. Did you perchance refer to a previous version of this gist?

@andywangevertz

This comment has been minimized.

Copy link

@andywangevertz andywangevertz commented Feb 12, 2021

toUTF8Array() contains an error which results in incorrect utf-8 encoding for surrogate pairs branch.
In the line containing charcode = ((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff) its necessary to add 0x010000
to make it work properly.
The error can be easily reproduced using glyph '🜄' which stands for 'Alchemical Symbol For Water'.
This should be rendered to: [240, 159, 156, 132] or hex 0xF0 0x9F 0x9C 0x84 but ends up to be [240, 143, 144, 128].
Correcting the according line to charcode = (((charcode&0x3ff)<<10)|(str.charCodeAt(i)&0x3ff)) + 0x010000; fixes that error.
Btw. function fromUTF8Array() does not suffer from this issue.

The latest version (from Aug 6 2013) no longer has this bug. Did you perchance refer to a previous version of this gist?

The latest version on the top render toUTF8Array('🜄') as (4) [240, 143, 156, 132]
Where as mos0711 pointed out, it should be [240, 159, 156, 132] with his fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment