Skip to content

Instantly share code, notes, and snippets.

@cygx

cygx/day7.html Secret

Last active December 7, 2015 01:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save cygx/0ac6aa9922627cc2c2ea to your computer and use it in GitHub Desktop.
Save cygx/0ac6aa9922627cc2c2ea to your computer and use it in GitHub Desktop.
Quick (rhetorical) question: how many of you either try your best to ignore Unicode, or groan at the thought of having to deal with it again?
It's fair, after all, considering Unicode is big. Really big. (You may think it's a long walk down the ASCII table, but that's peanuts compared to <del>space</del> Unicode.) It certainly doesn't help that many languages, particularly older ones, don't help you, the average programmer, work with it all that well. Either they don't deal with encoding standards at all, meaning some familiarity is mandatory, or certain other languages claim to support it but really just balk once you get past the BMP (the codepoints that can fit in a 16-bit number).
Perl 6, as you might guess, does handle Unicode well. It's actually necessary to go about this day in a twofold manner: half of the story is how to process Unicode text, and half is how to use Unicode syntax. Let's start with the one more likely to be of concern when actually programming, that of...
<h1>How do I Handle Unicode Text?</h1>
No matter your level of experience in handling Unicode (or anything involving different encodings), you'll be pleased to learn that in Perl 6, it goes just about the way you'd expect.
Perl 6's strings are interesting in that they by default work on the notion of <em>graphemes</em> — a collection of codepoints that look like a distinct thing; what you'd call a "character" if you didn't know better. Not every distinct "character" you could come up with has its own codepoint in the standard, so usually handling visual elements naturally can be quite painful.
However, Perl 6 does this work for you, keeping track of these collections of codepoints internally, so that you just have to think in terms of what you would see the characters as. If you've ever had to dance around with substring operations to make sure you didn't split between a letter and a diacritic, this will be your happiest day in programming.
As an example, here's a devanagari syllable in a string. The <code>.codes</code> method returns the number of codepoints in the string, while <code>.chars</code> returns the number of characters (aka graphemes):
[code]
say "नि".codes; # returns 3
say "नि".chars; # returns 1
[/code]
Even though there isn't a singular assigned codepoint for this syllable, Perl 6 still treats it as one character, suiting any purpose that doesn't involve messing with the text at a lower level.
<em>That's cool, but does it matter much to me, a simple English-speaking programmer who's never had to deal with other languages or scripts?</em>, I can imagine some of you thinking. And the answer is yes, because regardless of your background, there is most definitely one grapheme you've encountered before:
[code]
say "\r\n".chars; # returns 1
[/code]
Yep, the Windows end-of-line sequence is explicitly counted by Unicode's "extended grapheme cluster" definition as one grapheme.
And of course it's not just looks, that's how operations on strings work:
[code]
say "नि\r\n".substr(1,1).perl # returns "\r\n"
[/code]
Of course, that's all just for the default <code>Str</code> type. If you don't want to work at a grapheme level, then you have several other string types to choose from: If you're interested in working within a particular normalization, there's the self-explanatory types of <code>NFC</code>, <code>NFD</code>, <code>NFKC</code>, and <code>NFKD</code>. If you just want to work with codepoints and not bother with normalization, there's the <code>Uni</code> string type (which may be most appropriate in cases where you don't want the NFC normalization that comes with normal <code>Str</code>, and keep text as-is). And if you want to work at the binary level, well, there's always the <code>Blob</code> family of types :) .
We also have several methods that let you examine the various bits of Unicode info associated with characters:
[code]
say "a".uniname; # get name of first Unicode character in string.
say "\r\nhello!".ord # get number of first codepoint
# (*not* grapheme) in string
say "\r\nhello!".ords # get numbers of all codepoints
say "0".uniprop("Numeric_Type") # get associated property
[/code]
And so on :) . Note that the <code>ord</code>/<code>ords</code> part shows you that you'll really <em>never</em> get the internal numbers used to keep track of graphemes. When <code>ord</code> sees a grapheme cluster, it just returns the codepoint number for the first codepoint of that cluster.
<h2>Not Just Strings</h2>
Of course, our Unicode support wouldn't be complete without regex support! Of particular note is the ability to match based on properties, so for example
[code]
/ <:Alpha>+ /
[/code]
will match multiple alphabetic characters (<code>&lt;alpha&gt;</code> will do almost the same thing, just with the addition of matching underscore), and
[code]
/ '0x' <:Nv(0..9) + :Hex_Digit>+ | '0b' <:Nv(0..1)>+ /
[/code]
is a regex that lets you match against either hexadecimal numbers or binary ones, in a Unicode-friendly way. And if you wanted to write the Unicode standard's "extended grapheme cluster" pattern in regexes (the same pattern we use to determine grapheme handling mentioned earlier):
[code]
grammar EGC {
token Hangul-Syllable {
|| <:GCB<L>>* <:GCB<V>>+ <:GCB<T>>*
|| <:GCB<L>>* <:GCB<LV>> <:GCB<V>>* <:GCB<T>>*
|| <:GCB<L>>* <:GCB<LVT>> <:GCB<T>>*
|| <:GCB<L>>+
|| <:GCB<T>>+
}
token TOP {
|| <:GCB<CR>> <:GCB<LF>>
|| <:GCB<PP>>*
[
|| <:GCB<RI>>
|| <.Hangul-Syllable>
|| <!:GCB<Control>>
]
[
|| <:Grapheme_Extend>
|| <:GCB<Spacing_Mark>>
]*
|| .
}
}
[/code]
A bit wordy, but just imagine how much more painful that would be without built-in Unicode support in your regexes!
And aside from all the programming-related stuff, there's also...
<h1>Using Unicode to Write Perl 6</h1>
In part of our tireless support of Unicode, we also parse your source code with the same regex engine you just saw demonstrated above (though the Perl 6 parser doesn't need to bother with Unicode properties nearly that often). This means we're able to support syntax using Unicode in Perl 6, and have been taking advantage of it for a long time now. Observe:
[code]
say 0 ∈ «42 -5 1».map(&amp;log ∘ &amp;abs);
say 0.1e0 + 0.2e0 ≅ 0.3e0;
say 「There is no \escape in here!」
[/code]
Just a small sampling of the Unicode built-in to Perl 6 by default. Featuring interpolating quote-words lists, setops, function composition, and approximate equality. Oh, and the delimiters for the most basic level of string quoting.
Don't worry though, standard Perl 6 does not demand that you be able to type Unicode. If you can't, there are so-called "Texas" variants:
[code]
say 0 (elem) <<42 -5 1>>.map(&amp;log o &amp;abs);
say 0.1e0 + 0.2e0 =~= 0.3e0;
say Q[[[There is no \escape in here!]]]
[/code]
This is fine of course, but if it's feasible for you to set up Unicode support, I heartily recommend it. Here's a short list on various ways to do it:
<ul>
<li><strong>Get an awesome text editor</strong> — The more featureful text editors (such as emacs or vim, to name a couple) will have functionality in place to insert arbitrary characters. Go look it up in your editor's documentation, and consider petitioning if it doesn't support Unicode entry :) .</li>
<li><strong>Use your OS's hex input</strong> — Some systems, such as Windows or applications using GTK, support key shortcuts to let you type the hexadecimal codepoint numbers for characters. You'll have to memorize codepoints, but chances are you'd get used to it eventually.</li>
<li><strong>Set up your keyboard's third/fourth/etc. levels</strong> — If your system supports it, you can enable third/fourth level modifiers and so on for your keyboard to access those levels (if you don't know what those are, your 'Shift' key counts as a second-level modifier, and the characters it lets you type are considered on the second level, as an example). Depending on the amount of time and/or patience you have you could even customize those extra levels.</li>
<li><strong>(X11) Set up your Compose key</strong> — This is the method I myself use, and it involves setting up a key to use as the "Compose key" or "Multi key", and use of a file in <code>~/.XCompose</code> (or some other place, as long as you configure it) to set up key combos. The Compose key works by letting you type any configured sequence of keys after pressing the Compose key, which will insert the character(s) of your choice.
<ul>
<li>Which key you sacrifice of course depends on which keys you don't make use of; it could be the caps lock, or one of those extra Shift/Alt/Ctrl keys. It can even be that useless Menu key, which you probably just remembered was on your keyboard :P .</li>
<li>An absolutely wonderful starting <code>.XCompose</code> can be found <a href="https://github.com/kragen/xcompose">in this github repository</a>. You'll still want to add combinations to this for some Perl 6, and perhaps do other tinkering with it¹, but it's still quite a lot better than having to start from scratch :) .</li>
</ul>
</li>
</ul>
<h1>In Conclusion</h1>
This of course isn't an exhaustive coverage of all that Perl 6 has to offer Unicode, but the underlying takeaway is that Perl 6 makes handling Unicode much nicer than other languages do (at least out of the box).
<strong>Bonus!</strong> Partly in the spirit of Christmastime, and partly in the spirit of "I love this, and what better time to share it?", allow me to present for your historical interest <a href="https://rt.perl.org/Public/Bug/Display.html?id=66498">Perl 6's legendary "snowman comet" bug</a>:
[code]
say "abc" ~~ m☃.(.).☄ # this used to work. Really.
[/code]
Basically this old old old old bug that (sadly) doesn't exist anymore was about the regex part of the parser messing up a bit and interpreting <code>☃☄</code> as just as valid a pair of brackets as <code>()</code> or <code>⦃⦄</code>.
Is there a relevant lesson in this bug? Nope. Is it only vaguely connected to a winter blog post on Unicode? You bet. It's just that it's thanks to Unicode support we were able to get that kind of bug way back in 2009, and it's thanks to Unicode support (among other things) that would let someone re-implement this as a slang or something ☺ .
So go forth confident in your newfound ability to handle international text with much greater ease than you're perhaps used to, and spend more time building ☃☃☃☃ the rest of this month.
Have the appropriate amount of fun! ❄
¹<span style="font-size:smaller;">Psst! Use the texas variants for your compose combos if you're stuck on coming up with them, e.g. <code>&lt;Multi_key&gt; &lt;equal&gt; &lt;asciitilde&gt; &lt;equal&gt;</code></span>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment