Skip to content

Embed URL

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP
Some thoughts on Unicode in Perl 6.
Unicode Notes
=-===========
(or, the quest to create S15)
A quick example:
================
For the syllable
नि (U+0928 U+093F)
The encodings are
UTF-8 : E0 A4 A8 E0 A4 BF
UTF-16BE : 0928 093F
UTF-32BE : 00000928 0000093F
The counts are:
|------------+-------+--------+--------|
| (count by) | UTF-8 | UTF-16 | UTF-32 |
|------------+-------+--------+--------|
| bytes | 6 | 4 | 8 |
| code units | 6 | 2 | 2 |
| codepoints | 2 | 2 | 2 |
| graphemes | 1 | 1 | 1 |
|------------+-------+--------+--------|
notice how codepoints and graphemes are independent of encoding.
Counting by characters should be equivalent to graphemes, unless certain
scripts/languages are shown to not consider all its graphemes complete
characters.
PRAGMAS
==-====
(The specific spellings of these are of course subject to change.)
use utf8;
use utf16 :be/:le;
use utf32 :be/:le;
Other encoding pragmas may be created by module authors, but would decimate the
carefully crafted support for unicode. "utf8" is the default.
:be is the default for the utf16 nand utf32 pragmas, because Unicode itself
would assume big endian when lacking any other info on the data
(http://www.unicode.org/faq/utf_bom.html#gen7).
[ Even though setting the endianness of UTF-16 or UTF-32 may not be immediately
useful, there will almost certainly be someone who needs to do it. ]
I currently don't expect having ucs1, ucs2, or ucs4 as synonyms, as they don't
necessarily mean quite the same thing, and most people will mean the more
featureful utf anyway (at least in the case of UCS-2 and UTF-16).
use graphemes;
use codepoints;
use codeunits;
use bytes;
The perspective taken by Perl 6. "graphemes" is the default.
use NFC;
use NFD;
Whether Perl 6 composes or decomposes the characters in strings. None of these
are the default; any of them forces all Unicode data to undergo de/composition
when handed to Perl 6.
If you want to make sure no forceful de/composition occurs, you should be able
to just do this:
no NFC;
no NFD;
[ An alternate form could be C<use normalization :NFC/:NFD/:any>, with :any
being the "don't do any auto-normalization" default. ]
STR METHODS
===-=======
Str.chars;
Whatever the current perspective on characters is.
Str.graphs;
Number of graphemes.
Str.codes;
Number of codepoints.
Str.units;
Number of code units (encoding-dependant).
Str.bytes;
Number of bytes (encoding-dependant).
Str.compose;
Force all characters in the string to NFC form, if possible.
Str.decompose;
Force all characters in the string to NFD form, if possible.
Str.comb;
Creates an array of "characters", based on perspective. Adverbs can split along
non-default perspectives. Splitting along codeunits or bytes creates either a
properly-sized Buf or an array of integers. Splitting along codepoints or
graphemes creates an array of Strs.
Str.encode;
By default uses the default encoding, alternates may be specified.
BUF METHODS
====-======
Buf.decode;
By default uses the default encoding, alternates may be specified.
OTHER NOTES
=====-=====
A valid Unicode Str can be counted by bytes, code units, and code points, and
graphemes.
A Str will need to keep track of its encoding to properly count bytes and code
units.
A Str will need to know the codepoints that make up the Str (internal UTF-32
string, so units==codes and 4*bytes==codes?) to properly count codepoints.
A Str will need to know what graphemes it holds (internal *NFC* UTF-32 string,
to make it a little easier?), likely through the "unique internal ID" thing
mentioned in the spec (S02, IIRC) to properly count graphemes.
It should be possible to change the encoding of a Str from the default. You
*could* do Str.encode.decode("non-default-encoding"), but there should be a
convenience method.
Yes, I'd like Str to know its encoding (it has to, for things like .bytes to do
anything).
Buf.decode, as implied above, creates a Str from its decoded contents, one that
knows its encoding (which may be non-default).
Changing the default does not change the encoding of Strs set to the previous
default, naturally.
I expect this is only the beginning of our necessary Unicode support. :)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.