-
-
Save mathiasbynens/1010324 to your computer and use it in GitHub Desktop.
function(string) { | |
return unescape( // convert a single `%xx` escape into the corresponding character | |
encodeURI(string) // URL-encode the string (this uses UTF-8) | |
).length; // read out the length (i.e. the number of `%xx` escapes) | |
} | |
// Note: this fails for input that contains lone surrogates. | |
// Use http://mths.be/utf8js if you need something more robust. |
function(s){return unescape(encodeURI(s)).length} |
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE | |
Version 2, December 2004 | |
Copyright (C) 2011 Mathias Bynens <http://mathiasbynens.be/> | |
Everyone is permitted to copy and distribute verbatim or modified | |
copies of this license document, and changing it is allowed as long | |
as the name is changed. | |
DO WHAT THE FUCK YOU WANT TO PUBLIC LICENSE | |
TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION | |
0. You just DO WHAT THE FUCK YOU WANT TO. | |
{ | |
"name": "byteSize", | |
"description": "This function will return the byte size of any UTF-8 string you pass to it.", | |
"keywords": [ | |
"utf-8", | |
"utf8", | |
"byte", | |
"byte-size" | |
] | |
} |
<!DOCTYPE html> | |
<!-- online demo: http://mothereff.in/byte-counter --> | |
<meta charset=utf-8> | |
<title>Get the byte size of any UTF-8 string</title> | |
<input autofocus> | |
<p>Byte size: <span></span> | |
<script> | |
var byteSize = function(s){return unescape(encodeURI(s)).length}; | |
var el = document.getElementsByTagName('span')[0]; | |
document.getElementsByTagName('input')[0].oninput = function() { | |
el.innerHTML = byteSize(this.value); | |
}; | |
</script> |
isn't encodeURIComponent
cross-browser? if so, would this work?
function(s){return encodeURIComponent(s).split(/%..|./).length-1}
or even
function(s){return~-encodeURIComponent(s).split(/%..|./).length}
there's still room in the non-encodeURIComponent
version too:
function(s,b,i,c){for(b=i=0;s[i];i++){c=s.charCodeAt(i);b+=1+(c>127)+(c>2047)}return b}
and also
function(s,b,i,c){for(b=i=0;c=s.charCodeAt(i++);b+=1+(c>127)+(c>2047));return b}
Another way, without encodeURIComponent
and charCodeAt
:
function(s){return s.replace(/[\0-\x7f]|([0-\u07ff]|(.))/g,"$&$1$2").length}
This regexp replaces 1 char in source with 1 to 3 chars depending on how many parens were captured.
You guys blow my mind.
@subzey: That solution seems to return incorrect results for astral symbols: e.g. Try U+1D306: x('\uD834\uDF06') == 6
but it should be 4
.
good lord, @subzey, this is crazytown!
IINM all these implementation are limited to the Basic Multilingual Plane of Unicode characters and do not support 4 bytes long UTF-8 characters.
Here's a 67 bytes version using charCodeAt that should support 1-4 bytes long characters:
function(s,b,i,c){for(;c=c>>8||s.charCodeAt(i=-~i);b=-~b);return b}
@p01, this doesn't work for me. for example, the length of "日本語ée"
should be 12, not 6.
@p01 it is so. But we anyway can't have a char with charCode more than 0xffff in javascript.
See ECMAScript docs, 3'rd edition, 15.5.3.2.
Oopsy daisy. Sorry my function did work fine, I guess I messed it up at some point, then got distracted when my baby girl woke up.
@mathiasbynens: save 2 characters from your current method with some bit shifting:
-
current:
function(s,b,i,c){for(b=i=0;c=s.charCodeAt(i++);b+=1+(c>127)+(c>2047));return b}
-
improved:
function(s,b,i,c){for(b=i=0;c=s.charCodeAt(i++);b+=c>>11?3:c>>7?2:1);return b}
We could just use encodeURI
instead of encodeURIComponent
; this saves 9 bytes.
Anyway, here’s an online tool you can use to check the length & byte count of a string (useful for @140bytes): http://mothereff.in/byte-counter
API: http://mothereff.in/byte-counter#%s
where %s
is the URL-encoded input string. I’ve added it to my browser’s bookmarks / search engines :)
encodeURI and encodeURIComponent will throw out "URI malformed" errors on certain strings in Google Chrome.
@atk Yeah, if the input contains lone surrogates.
//count UTF-8 bytes of a string
function byteLengthOf(s){
//assuming the String is UCS-2(aka UTF-16) encoded
var n=0;
for(var i=0,l=s.length; i<l; i++){
var hi=s.charCodeAt(i);
if(hi<0x0080){ //[0x0000, 0x007F]
n+=1;
}else if(hi<0x0800){ //[0x0080, 0x07FF]
n+=2;
}else if(hi<0xD800){ //[0x0800, 0xD7FF]
n+=3;
}else if(hi<0xDC00){ //[0xD800, 0xDBFF]
var lo=s.charCodeAt(++i);
if(i<l&&lo>=0xDC00&&lo<=0xDFFF){ //followed by [0xDC00, 0xDFFF]
n+=4;
}else{
throw new Error("UCS-2 String malformed");
}
}else if(hi<0xE000){ //[0xDC00, 0xDFFF]
throw new Error("UCS-2 String malformed");
}else{ //[0xE000, 0xFFFF]
n+=3;
}
}
return n;
}
Alternative in 80 bytes:
Can be shorter (76 bytes) if we depend on
encodeURIComponent
:But that would be cheating, no?