Last active
December 14, 2015 02:07
-
-
Save qiu8310/6b31f2fde6eea35fadce to your computer and use it in GitHub Desktop.
php whitespace regexp warning
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
<?php | |
/* | |
http://php.net/manual/en/regexp.reference.escape.php | |
The "whitespace" characters are HT (9), LF (10), FF (12), CR (13), and space (32). | |
However, if locale-specific matching is happening, characters with code points in | |
the range 128-255 may also be considered as whitespace characters, | |
for instance, NBSP (A0). | |
如何指定 locale,参看我之前提交给 phabricator 的一个 bug: https://secure.phabricator.com/D14441 | |
*/ | |
/* | |
Get all whitespace characters that "\s" will match. | |
Result: | |
0x09 | |
0x0A | |
0x0C | |
0x0D | |
0x20 | |
0x85 => 1000 0101 | |
0xA0 => 1010 0000 | |
*/ | |
print_r(whitespace_characters_in_ascii('/\s/')); | |
/* | |
http://php.net/manual/en/function.utf8-encode.php | |
bytes bits representation | |
1 7 0bbbbbbb | |
2 11 110bbbbb 10bbbbbb | |
3 16 1110bbbb 10bbbbbb 10bbbbbb | |
4 21 11110bbb 10bbbbbb 10bbbbbb 10bbbbbb | |
So if you are using regexp "/\s/", any character | |
contains 0x85 or 0xA0 bit sequences will by matched. | |
For example: | |
*/ | |
// 忠 => E5 BF A0 => 11100101 10111111 10100000 | |
echo preg_match('/\s/', '忠') . "\n"; // => 1 | |
echo preg_match('/\s/', '濠') . "\n"; // => 1 | |
echo preg_match('/\s/', '翠') . "\n"; // => 1 | |
echo preg_match('/\s/', '迠') . "\n"; // => 1 | |
/* | |
We should only match whitespace character in 0bbbbbbb, | |
So the best way is not using "\s", but using "[\t\n\f\r ]" | |
*/ | |
print_r(whitespace_characters_in_ascii('/[\t\n\f\r ]/')); | |
function whitespace_characters_in_ascii ($whitespace_regexp) { | |
$result = array(); | |
for ($i = 0; $i <= 255; $i++) { | |
if (preg_match($whitespace_regexp, chr($i))) { | |
$hex = sprintf("0x%02X", $i); | |
array_push($result, $hex); | |
} | |
} | |
return $result; | |
} | |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment