Skip to content

Instantly share code, notes, and snippets.

Unicode and Unicode encodings

Let's use the character (code point U+1EE5) as example.

Unicode defines a mapping from numbers (code points) to characters. In this sense, Unicode is a coded character set, which doesn't define how the characters or code points should be stored.

UTF-8, UTF-16, UTF-32 are character encodings which defines how to encode code points into a sequence of 8-bit, 16-bit, 32-bit code values respectively. In this sense, it's a character encoding form.

Using the example, the character is encoded as 0xE1 0xBB 0xA5 in UTF-8, 0x1EE5 in UTF-16 and 0x00001EE5 in UTF-32.

Python bug and workaround

As mentioned by Amadan in the comment, the code below works in Python 2.7.6.

re.findall(r'(\d)(?<!\1..)\1{2}(?!\1)','122333444455555666666')

I also independently confirm that the code above works for version 2.6.6, 2.7.8, 3.2.5, 3.4.3.

However, as Amadan also mentioned, (?<!\1.) should have been used instead of (?<!\1..), since we only want to back off a single character and look at the character before it. Since the working pattern is clearly a wrong pattern, I strongly recommend against using the code above in production.

@nhahtdh
nhahtdh / YK9rUv.pl
Last active August 29, 2015 14:23
Perl bug backreference and optional capturing group
#!/usr/bin/perl
print "$^V\n";
while ("aa" =~ /\b(\w)?(\w)(\w?)(\2\1)/g) {
print "[$&] [$1] [$2] [$3] [$4]\n";
}
while ("aba" =~ /\b(\w)?(\w)(\w?)(\2\1)/g) {
print "[$&] [$1] [$2] [$3] [$4]\n";
@nhahtdh
nhahtdh / octal.html
Last active August 29, 2015 14:23
Octal escape sequence in RegExp and String literal
<html>
<body>
<div id="output"></div>
<script type="text/javascript">
document.getElementById("output").innerHTML =
"<p>" + /a\1b/.test("a\u0001b") + "</p>" +
"<p>" + /a\11b/.test("a\tb") + "</p>" +
"<p>" + /a\1b()/.test("a\u0001b") + "</p>" +
"<p>" + /a\1b()/.test("ab") + "</p>";
</script>

Unicode and UTF-8, UTF-16, UTF-32 encoding

Unicode is a character set, which specifies a mapping from characters to code points, and the character encodings (UTF-8, UTF-16, UTF-32) specify how to store the Unicode code points.

In Unicode, a character maps to a single code point, but it can have different representation depending on how it is encoded.

I don't want to rehash this discussion all over again, so if you are still not clear about this, please read The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!).

Using the example in the question, 𡃁 maps to the code point U+210C1, but it can be encoded as F0 A1 83 81 in UTF-8, D844 DCC1 in UTF-16 and 000210C1 in UTF-32.

@nhahtdh
nhahtdh / SO30087050.php
Created May 7, 2015 03:07
(*UCP) PCRE_UCP and (*UTF) PCRE_UTF mode in PHP
<?php
/* http://ideone.com/HRiFW3 */
// `PCRE_UCP` is set along with `PCRE_UTF` in PHP wrapper around PCRE library
// https://github.com/php/php-src/blob/bf59acdea75cf13d179f10ce89d296a30f38676d/ext/pcre/php_pcre.c#L364
// Test against U+00A0 and U+2028
var_dump(preg_match( "/(*UCP)^[A-Za-z0-9?.,-=$@!&%';:)(_\s]+$/", "ASDjf\t sdf"));
var_dump(preg_match( "/(*UCP)^[A-Za-z0-9?.,-=$@!&%';:)(_\s]+$/", "ASDjf\xC2\xA0sdf"));
// From https://github.com/firasdib/Regex101/issues/216
<?php
$re = <<<'EOF'
/(?J)
(?&R)\K\s*(?<sign>[=+*-])\s*(?=(?&R))
|
(?&R)\s*\K\s*(?<sign>=)\s*(?=-(?&R))
@nhahtdh
nhahtdh / DivisibleBy9.java
Last active August 29, 2015 13:57
POSIX Extended Regular Expression checks number divisible by 9
import java.util.regex.*;
import java.util.*;
import java.io.*;
class DivisibleBy9 {
// Need to set stack size 20MB
// java -Xss20M DivisibleBy9
public static void main(String args[]) {
Scanner sc;
try {
// Original code by SergeyS
// http://stackoverflow.com/a/16377911/1400768
class Test16377079 {
static int[] GetAssignments(int[] studentsPerLetter, int[] rooms)
{
int numberOfRooms = rooms.length;
int numberOfLetters = studentsPerLetter.length;
int roomSets = 1 << numberOfRooms; // 2 ^ (number of rooms)
Length per string: 50
replaceFirstApproach: 23088 17328 18494 22733 15356 31326 17638 15537 17300 13799 16854 19929 14445 14930 14586 13627 14467 14766 13694 15263 14240 15394 14684 13504 14807 14658 16370 14601 14101 44152 | 17389
substringApproach: 4393 4311 4286 4288 4305 4306 4359 4516 4287 4356 4285 4333 4269 4413 4500 4533 4441 4391 4483 4378 4432 4528 4437 4370 4394 4296 5212 4483 4744 4560 | 4429
appendStringBuilder: 4481 4474 4452 4452 4526 4424 4482 4511 4435 4494 6339 5377 4569 4569 4492 4534 4927 4607 4663 4475 4464 4461 4482 4408 4350 4457 4883 4557 4429 4375 | 4604
Length per string: 100
replaceFirstApproach: 21407 23916 29624 33423 20241 19464 20427 20577 20285 22107 20064 20543 20664 19893 21505 21806 20637 20247 19763 20263 20183 20953 22448 20597 19741 20398 19607 22270 55240 23265 | 22718
substringApproach: 9336 5595 5543 5574 5477 6355 5499 5589 16557 5430 5407 5351 6126 34439 6184 6470 5457 7012 5897 5444 5515 5623 5521 5918 6332 5790 5830 6281 5914 5929 | 7246
appendStringBuilder: 5715