public
Created

Some thoughts on Unicode in Perl 6.

  • Download Gist
Notes
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160
Unicode Notes
=-===========
 
(or, the quest to create S15)
 
A quick example:
================
 
For the syllable
 
नि (U+0928 U+093F)
 
The encodings are
 
UTF-8 : E0 A4 A8 E0 A4 BF
UTF-16BE : 0928 093F
UTF-32BE : 00000928 0000093F
 
The counts are:
 
|------------+-------+--------+--------|
| (count by) | UTF-8 | UTF-16 | UTF-32 |
|------------+-------+--------+--------|
| bytes | 6 | 4 | 8 |
| code units | 6 | 2 | 2 |
| codepoints | 2 | 2 | 2 |
| graphemes | 1 | 1 | 1 |
|------------+-------+--------+--------|
 
notice how codepoints and graphemes are independent of encoding.
 
Counting by characters should be equivalent to graphemes, unless certain
scripts/languages are shown to not consider all its graphemes complete
characters.
 
PRAGMAS
==-====
 
(The specific spellings of these are of course subject to change.)
 
use utf8;
use utf16 :be/:le;
use utf32 :be/:le;
 
Other encoding pragmas may be created by module authors, but would decimate the
carefully crafted support for unicode. "utf8" is the default.
 
:be is the default for the utf16 nand utf32 pragmas, because Unicode itself
would assume big endian when lacking any other info on the data
(http://www.unicode.org/faq/utf_bom.html#gen7).
 
[ Even though setting the endianness of UTF-16 or UTF-32 may not be immediately
useful, there will almost certainly be someone who needs to do it. ]
 
I currently don't expect having ucs1, ucs2, or ucs4 as synonyms, as they don't
necessarily mean quite the same thing, and most people will mean the more
featureful utf anyway (at least in the case of UCS-2 and UTF-16).
 
use graphemes;
use codepoints;
use codeunits;
use bytes;
 
The perspective taken by Perl 6. "graphemes" is the default.
 
use NFC;
use NFD;
 
Whether Perl 6 composes or decomposes the characters in strings. None of these
are the default; any of them forces all Unicode data to undergo de/composition
when handed to Perl 6.
 
If you want to make sure no forceful de/composition occurs, you should be able
to just do this:
 
no NFC;
no NFD;
 
[ An alternate form could be C<use normalization :NFC/:NFD/:any>, with :any
being the "don't do any auto-normalization" default. ]
 
STR METHODS
===-=======
 
Str.chars;
 
Whatever the current perspective on characters is.
 
Str.graphs;
 
Number of graphemes.
 
Str.codes;
 
Number of codepoints.
 
Str.units;
 
Number of code units (encoding-dependant).
 
Str.bytes;
 
Number of bytes (encoding-dependant).
 
Str.compose;
 
Force all characters in the string to NFC form, if possible.
 
Str.decompose;
 
Force all characters in the string to NFD form, if possible.
 
Str.comb;
 
Creates an array of "characters", based on perspective. Adverbs can split along
non-default perspectives. Splitting along codeunits or bytes creates either a
properly-sized Buf or an array of integers. Splitting along codepoints or
graphemes creates an array of Strs.
 
Str.encode;
 
By default uses the default encoding, alternates may be specified.
 
BUF METHODS
====-======
 
Buf.decode;
 
By default uses the default encoding, alternates may be specified.
 
OTHER NOTES
=====-=====
 
A valid Unicode Str can be counted by bytes, code units, and code points, and
graphemes.
 
A Str will need to keep track of its encoding to properly count bytes and code
units.
 
A Str will need to know the codepoints that make up the Str (internal UTF-32
string, so units==codes and 4*bytes==codes?) to properly count codepoints.
 
A Str will need to know what graphemes it holds (internal *NFC* UTF-32 string,
to make it a little easier?), likely through the "unique internal ID" thing
mentioned in the spec (S02, IIRC) to properly count graphemes.
 
It should be possible to change the encoding of a Str from the default. You
*could* do Str.encode.decode("non-default-encoding"), but there should be a
convenience method.
 
Yes, I'd like Str to know its encoding (it has to, for things like .bytes to do
anything).
 
Buf.decode, as implied above, creates a Str from its decoded contents, one that
knows its encoding (which may be non-default).
 
Changing the default does not change the encoding of Strs set to the previous
default, naturally.
 
I expect this is only the beginning of our necessary Unicode support. :)

Please sign in to comment on this gist.

Something went wrong with that request. Please try again.