Skip to content

Instantly share code, notes, and snippets.

@ktomk
Created October 4, 2011 19:10
Show Gist options
  • Star 8 You must be signed in to star a gist
  • Fork 2 You must be signed in to fork a gist
  • Save ktomk/1262496 to your computer and use it in GitHub Desktop.
Save ktomk/1262496 to your computer and use it in GitHub Desktop.
filter valid utf-8 byte sequences
<?php
/**
* filter valid utf-8 byte sequences
*
* take over all valid bytes, drop an invalid sequence until first
* non-matching byte, start over at that byte.
*
* @param string $str
* @return string
*/
function valid_utf8_bytes($str)
{
$return = '';
$length = strlen($str);
$invalid = array_flip(array("\xEF\xBF\xBF" /* U-FFFF */, "\xEF\xBF\xBE" /* U-FFFE */));
for ($i=0; $i < $length; $i++)
{
$c = ord($str[$o=$i]);
if ($c < 0x80) $n=0; # 0bbbbbbb
elseif (($c & 0xE0) === 0xC0) $n=1; # 110bbbbb
elseif (($c & 0xF0) === 0xE0) $n=2; # 1110bbbb
elseif (($c & 0xF8) === 0xF0) $n=3; # 11110bbb
elseif (($c & 0xFC) === 0xF8) $n=4; # 111110bb
else continue; # Does not match
for ($j=++$n; --$j;) # n bytes matching 10bbbbbb follow ?
if ((++$i === $length) || ((ord($str[$i]) & 0xC0) != 0x80))
continue 2
;
$match = substr($str, $o, $n);
if ($n === 3 && isset($invalid[$match])) # test invalid sequences
continue;
$return .= $match;
}
return $return;
}
@ktomk
Copy link
Author

ktomk commented Nov 21, 2011

@todo: Check for invalid UTF-8 bytes, check for UTF-16 range.

@BenMakesGames
Copy link

this saved my life.

you are a savior.

thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment