Unicode and Unicode encodings

Let's use the character (code point U+1EE5) as example.

Unicode defines a mapping from numbers (code points) to characters. In this sense, Unicode is a coded character set, which doesn't define how the characters or code points should be stored.

UTF-8, UTF-16, UTF-32 are character encodings which defines how to encode code points into a sequence of 8-bit, 16-bit, 32-bit code values respectively. In this sense, it's a character encoding form.

Using the example, the character is encoded as 0xE1 0xBB 0xA5 in UTF-8, 0x1EE5 in UTF-16 and 0x00001EE5 in UTF-32.

Python bug and workaround

As mentioned by Amadan in the comment, the code below works in Python 2.7.6.


I also independently confirm that the code above works for version 2.6.6, 2.7.8, 3.2.5, 3.4.3.

However, as Amadan also mentioned, (?<!\1.) should have been used instead of (?<!\1..), since we only want to back off a single character and look at the character before it. Since the working pattern is clearly a wrong pattern, I strongly recommend against using the code above in production.

Perl bug backreference and optional capturing group
print "$^V\n";
while ("aa" =~ /\b(\w)?(\w)(\w?)(\2\1)/g) {
print "[$&] [$1] [$2] [$3] [$4]\n";
while ("aba" =~ /\b(\w)?(\w)(\w?)(\2\1)/g) {
print "[$&] [$1] [$2] [$3] [$4]\n";
Octal escape sequence in RegExp and String literal
<div id="output"></div>
<script type="text/javascript">
document.getElementById("output").innerHTML =
"<p>" + /a\1b/.test("a\u0001b") + "</p>" +
"<p>" + /a\11b/.test("a\tb") + "</p>" +
"<p>" + /a\1b()/.test("a\u0001b") + "</p>" +
"<p>" + /a\1b()/.test("ab") + "</p>";

Unicode and UTF-8, UTF-16, UTF-32 encoding

Unicode is a character set, which specifies a mapping from characters to code points, and the character encodings (UTF-8, UTF-16, UTF-32) specify how to store the Unicode code points.

In Unicode, a character maps to a single code point, but it can have different representation depending on how it is encoded.

I don't want to rehash this discussion all over again, so if you are still not clear about this, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Using the example in the question, 𡃁 maps to the code point U+210C1, but it can be encoded as F0 A1 83 81 in UTF-8, D844 DCC1 in UTF-16 and 000210C1 in UTF-32.

(*UCP) PCRE_UCP and (*UTF) PCRE_UTF mode in PHP
/* */
// `PCRE_UCP` is set along with `PCRE_UTF` in PHP wrapper around PCRE library
// Test against U+00A0 and U+2028
var_dump(preg_match( "/(*UCP)^[A-Za-z0-9?.,-=$@!&%';:)(_\s]+$/", "ASDjf\t sdf"));
var_dump(preg_match( "/(*UCP)^[A-Za-z0-9?.,-=$@!&%';:)(_\s]+$/", "ASDjf\xC2\xA0sdf"));
// From
$re = <<<'EOF'
POSIX Extended Regular Expression checks number divisible by 9
import java.util.regex.*;
import java.util.*;
class DivisibleBy9 {
// Need to set stack size 20MB
// java -Xss20M DivisibleBy9
public static void main(String args[]) {
Scanner sc;
try {
// Original code by SergeyS
class Test16377079 {
static int[] GetAssignments(int[] studentsPerLetter, int[] rooms)
int numberOfRooms = rooms.length;
int numberOfLetters = studentsPerLetter.length;
int roomSets = 1 << numberOfRooms; // 2 ^ (number of rooms)
