Create a gist now

Instantly share code, notes, and snippets.

What would you like to do?
UTF-8 byte counter in 49 bytes
function(string) {
return unescape( // convert a single `%xx` escape into the corresponding character
encodeURI(string) // URL-encode the string (this uses UTF-8)
).length; // read out the length (i.e. the number of `%xx` escapes)
}
// Note: this fails for input that contains lone surrogates.
// Use http://mths.be/utf8js if you need something more robust.
function(s){return unescape(encodeURI(s)).length}
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
Version 2, December 2004
Copyright (C) 2011 Mathias Bynens <http://mathiasbynens.be/>
Everyone is permitted to copy and distribute verbatim or modified
copies of this license document, and changing it is allowed as long
as the name is changed.
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION
0. You just DO WHAT THE FUCK YOU WANT TO.
{
"name": "byteSize",
"description": "This function will return the byte size of any UTF-8 string you pass to it.",
"keywords": [
"utf-8",
"utf8",
"byte",
"byte-size"
]
}
<!DOCTYPE html>
<!-- online demo: http://mothereff.in/byte-counter -->
<meta charset=utf-8>
<title>Get the byte size of any UTF-8 string</title>
<input autofocus>
<p>Byte size: <span></span>
<script>
var byteSize = function(s){return unescape(encodeURI(s)).length};
var el = document.getElementsByTagName('span')[0];
document.getElementsByTagName('input')[0].oninput = function() {
el.innerHTML = byteSize(this.value);
};
</script>
@mathiasbynens

This comment has been minimized.

Show comment Hide comment
@mathiasbynens

mathiasbynens Jun 6, 2011

Alternative in 80 bytes:

function(s){return s.length-s.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length+1}

Can be shorter (76 bytes) if we depend on encodeURIComponent:

function(s){return encodeURIComponent(s).replace(/%[A-F\d]{2}/g,'x').length}

But that would be cheating, no?

Owner

mathiasbynens commented Jun 6, 2011

Alternative in 80 bytes:

function(s){return s.length-s.split(/[\uD800-\uDBFF][\uDC00-\uDFFF]/g).length+1}

Can be shorter (76 bytes) if we depend on encodeURIComponent:

function(s){return encodeURIComponent(s).replace(/%[A-F\d]{2}/g,'x').length}

But that would be cheating, no?

@jed

This comment has been minimized.

Show comment Hide comment
@jed

jed Jun 6, 2011

isn't encodeURIComponent cross-browser? if so, would this work?

function(s){return encodeURIComponent(s).split(/%..|./).length-1}

or even

function(s){return~-encodeURIComponent(s).split(/%..|./).length}

jed commented Jun 6, 2011

isn't encodeURIComponent cross-browser? if so, would this work?

function(s){return encodeURIComponent(s).split(/%..|./).length-1}

or even

function(s){return~-encodeURIComponent(s).split(/%..|./).length}
@jed

This comment has been minimized.

Show comment Hide comment
@jed

jed Jun 6, 2011

there's still room in the non-encodeURIComponent version too:

function(s,b,i,c){for(b=i=0;s[i];i++){c=s.charCodeAt(i);b+=1+(c>127)+(c>2047)}return b}

and also

function(s,b,i,c){for(b=i=0;c=s.charCodeAt(i++);b+=1+(c>127)+(c>2047));return b}

jed commented Jun 6, 2011

there's still room in the non-encodeURIComponent version too:

function(s,b,i,c){for(b=i=0;s[i];i++){c=s.charCodeAt(i);b+=1+(c>127)+(c>2047)}return b}

and also

function(s,b,i,c){for(b=i=0;c=s.charCodeAt(i++);b+=1+(c>127)+(c>2047));return b}
@subzey

This comment has been minimized.

Show comment Hide comment
@subzey

subzey Jun 7, 2011

Another way, without encodeURIComponent and charCodeAt:

function(s){return s.replace(/[\0-\x7f]|([0-\u07ff]|(.))/g,"$&$1$2").length}

This regexp replaces 1 char in source with 1 to 3 chars depending on how many parens were captured.

subzey commented Jun 7, 2011

Another way, without encodeURIComponent and charCodeAt:

function(s){return s.replace(/[\0-\x7f]|([0-\u07ff]|(.))/g,"$&$1$2").length}

This regexp replaces 1 char in source with 1 to 3 chars depending on how many parens were captured.

@mathiasbynens

This comment has been minimized.

Show comment Hide comment
@mathiasbynens

mathiasbynens Jun 7, 2011

You guys blow my mind.

@subzey: That solution seems to return incorrect results for astral symbols: e.g. Try U+1D306: x('\uD834\uDF06') == 6 but it should be 4.

Owner

mathiasbynens commented Jun 7, 2011

You guys blow my mind.

@subzey: That solution seems to return incorrect results for astral symbols: e.g. Try U+1D306: x('\uD834\uDF06') == 6 but it should be 4.

@jed

This comment has been minimized.

Show comment Hide comment
@jed

jed Jun 7, 2011

good lord, @subzey, this is crazytown!

jed commented Jun 7, 2011

good lord, @subzey, this is crazytown!

@p01

This comment has been minimized.

Show comment Hide comment
@p01

p01 Jun 7, 2011

IINM all these implementation are limited to the Basic Multilingual Plane of Unicode characters and do not support 4 bytes long UTF-8 characters.

Here's a 67 bytes version using charCodeAt that should support 1-4 bytes long characters:

function(s,b,i,c){for(;c=c>>8||s.charCodeAt(i=-~i);b=-~b);return b}

p01 commented Jun 7, 2011

IINM all these implementation are limited to the Basic Multilingual Plane of Unicode characters and do not support 4 bytes long UTF-8 characters.

Here's a 67 bytes version using charCodeAt that should support 1-4 bytes long characters:

function(s,b,i,c){for(;c=c>>8||s.charCodeAt(i=-~i);b=-~b);return b}
@jed

This comment has been minimized.

Show comment Hide comment
@jed

jed Jun 7, 2011

@p01, this doesn't work for me. for example, the length of "日本語ée" should be 12, not 6.

jed commented Jun 7, 2011

@p01, this doesn't work for me. for example, the length of "日本語ée" should be 12, not 6.

@subzey

This comment has been minimized.

Show comment Hide comment
@subzey

subzey Jun 7, 2011

@p01 it is so. But we anyway can't have a char with charCode more than 0xffff in javascript.
See ECMAScript docs, 3'rd edition, 15.5.3.2.

subzey commented Jun 7, 2011

@p01 it is so. But we anyway can't have a char with charCode more than 0xffff in javascript.
See ECMAScript docs, 3'rd edition, 15.5.3.2.

@p01

This comment has been minimized.

Show comment Hide comment
@p01

p01 Jun 7, 2011

Oopsy daisy. Sorry my function did work fine, I guess I messed it up at some point, then got distracted when my baby girl woke up.

p01 commented Jun 7, 2011

Oopsy daisy. Sorry my function did work fine, I guess I messed it up at some point, then got distracted when my baby girl woke up.

@michaelficarra

This comment has been minimized.

Show comment Hide comment
@michaelficarra

michaelficarra Jul 11, 2011

@mathiasbynens: save 2 characters from your current method with some bit shifting:

  • current:

    function(s,b,i,c){for(b=i=0;c=s.charCodeAt(i++);b+=1+(c>127)+(c>2047));return b}
  • improved:

    function(s,b,i,c){for(b=i=0;c=s.charCodeAt(i++);b+=c>>11?3:c>>7?2:1);return b}

@mathiasbynens: save 2 characters from your current method with some bit shifting:

  • current:

    function(s,b,i,c){for(b=i=0;c=s.charCodeAt(i++);b+=1+(c>127)+(c>2047));return b}
  • improved:

    function(s,b,i,c){for(b=i=0;c=s.charCodeAt(i++);b+=c>>11?3:c>>7?2:1);return b}
@mathiasbynens

This comment has been minimized.

Show comment Hide comment
@mathiasbynens

mathiasbynens Nov 18, 2011

We could just use encodeURI instead of encodeURIComponent; this saves 9 bytes.

Anyway, here’s an online tool you can use to check the length & byte count of a string (useful for @140bytes): http://mothereff.in/byte-counter

API: http://mothereff.in/byte-counter#%s where %s is the URL-encoded input string. I’ve added it to my browser’s bookmarks / search engines :)

Owner

mathiasbynens commented Nov 18, 2011

We could just use encodeURI instead of encodeURIComponent; this saves 9 bytes.

Anyway, here’s an online tool you can use to check the length & byte count of a string (useful for @140bytes): http://mothereff.in/byte-counter

API: http://mothereff.in/byte-counter#%s where %s is the URL-encoded input string. I’ve added it to my browser’s bookmarks / search engines :)

@atk

This comment has been minimized.

Show comment Hide comment
@atk

atk Sep 20, 2012

encodeURI and encodeURIComponent will throw out "URI malformed" errors on certain strings in Google Chrome.

atk commented Sep 20, 2012

encodeURI and encodeURIComponent will throw out "URI malformed" errors on certain strings in Google Chrome.

@mathiasbynens

This comment has been minimized.

Show comment Hide comment
@mathiasbynens

mathiasbynens Dec 17, 2013

@atk Yeah, if the input contains lone surrogates.

Owner

mathiasbynens commented Dec 17, 2013

@atk Yeah, if the input contains lone surrogates.

@FuweiChin

This comment has been minimized.

Show comment Hide comment
@FuweiChin

FuweiChin Jan 21, 2016

//count UTF-8 bytes of a string
function byteLengthOf(s){
    //assuming the String is UCS-2(aka UTF-16) encoded
    var n=0;
    for(var i=0,l=s.length; i<l; i++){
        var hi=s.charCodeAt(i);
        if(hi<0x0080){ //[0x0000, 0x007F]
            n+=1;
        }else if(hi<0x0800){ //[0x0080, 0x07FF]
            n+=2;
        }else if(hi<0xD800){ //[0x0800, 0xD7FF]
            n+=3;
        }else if(hi<0xDC00){ //[0xD800, 0xDBFF]
            var lo=s.charCodeAt(++i);
            if(i<l&&lo>=0xDC00&&lo<=0xDFFF){ //followed by [0xDC00, 0xDFFF]
                n+=4;
            }else{
                throw new Error("UCS-2 String malformed");
            }
        }else if(hi<0xE000){ //[0xDC00, 0xDFFF]
            throw new Error("UCS-2 String malformed");
        }else{ //[0xE000, 0xFFFF]
            n+=3;
        }
    }
    return n;
}
//count UTF-8 bytes of a string
function byteLengthOf(s){
    //assuming the String is UCS-2(aka UTF-16) encoded
    var n=0;
    for(var i=0,l=s.length; i<l; i++){
        var hi=s.charCodeAt(i);
        if(hi<0x0080){ //[0x0000, 0x007F]
            n+=1;
        }else if(hi<0x0800){ //[0x0080, 0x07FF]
            n+=2;
        }else if(hi<0xD800){ //[0x0800, 0xD7FF]
            n+=3;
        }else if(hi<0xDC00){ //[0xD800, 0xDBFF]
            var lo=s.charCodeAt(++i);
            if(i<l&&lo>=0xDC00&&lo<=0xDFFF){ //followed by [0xDC00, 0xDFFF]
                n+=4;
            }else{
                throw new Error("UCS-2 String malformed");
            }
        }else if(hi<0xE000){ //[0xDC00, 0xDFFF]
            throw new Error("UCS-2 String malformed");
        }else{ //[0xE000, 0xFFFF]
            n+=3;
        }
    }
    return n;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment