Skip to content

Instantly share code, notes, and snippets.

@qiu8310
Last active December 14, 2015 02:07
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save qiu8310/6b31f2fde6eea35fadce to your computer and use it in GitHub Desktop.
Save qiu8310/6b31f2fde6eea35fadce to your computer and use it in GitHub Desktop.
php whitespace regexp warning
<?php
/*
http://php.net/manual/en/regexp.reference.escape.php
The "whitespace" characters are HT (9), LF (10), FF (12), CR (13), and space (32).
However, if locale-specific matching is happening, characters with code points in
the range 128-255 may also be considered as whitespace characters,
for instance, NBSP (A0).
如何指定 locale,参看我之前提交给 phabricator 的一个 bug: https://secure.phabricator.com/D14441
*/
/*
Get all whitespace characters that "\s" will match.
Result:
0x09
0x0A
0x0C
0x0D
0x20
0x85 => 1000 0101
0xA0 => 1010 0000
*/
print_r(whitespace_characters_in_ascii('/\s/'));
/*
http://php.net/manual/en/function.utf8-encode.php
bytes bits representation
1 7 0bbbbbbb
2 11 110bbbbb 10bbbbbb
3 16 1110bbbb 10bbbbbb 10bbbbbb
4 21 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb
So if you are using regexp "/\s/", any character
contains 0x85 or 0xA0 bit sequences will by matched.
For example:
*/
// 忠 => E5 BF A0 => 11100101 10111111 10100000
echo preg_match('/\s/', '忠') . "\n"; // => 1
echo preg_match('/\s/', '濠') . "\n"; // => 1
echo preg_match('/\s/', '翠') . "\n"; // => 1
echo preg_match('/\s/', '迠') . "\n"; // => 1
/*
We should only match whitespace character in 0bbbbbbb,
So the best way is not using "\s", but using "[\t\n\f\r ]"
*/
print_r(whitespace_characters_in_ascii('/[\t\n\f\r ]/'));
function whitespace_characters_in_ascii ($whitespace_regexp) {
$result = array();
for ($i = 0; $i <= 255; $i++) {
if (preg_match($whitespace_regexp, chr($i))) {
$hex = sprintf("0x%02X", $i);
array_push($result, $hex);
}
}
return $result;
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment