jnthn/nfg.md

## nfg.md

      
    Raw
  

              nfg.md
            
          
    NFG and Unicode "plan of attack"

Goals

Do the following on Moar without busting JVM support along the way (it likely
gets approximations of things here as needed until we have time to do the full
NFG works there):

Make Str and str NFG by default
Implement types Uni/NFC/NFD/NFKC/NFKD
Implement Str.NFC/Str.NFD/Str.NFKC/Str.NFKD to drop from NFG to Uni
Implement Uni.NFC/Uni.NFD/Uni.NFKC/Uni.NFKD to switch normalization
Figure out how Uni types and Buf types play with regexes

While NFG is the Big Goal, since NFG is NFC and then some it makes sense to
tackle NFC first; NFC is also defined in terms of NFD, meaning you end up
pretty much having to do the three anyway. We can leave NFKD until later (and
NFKC is trivial once you have NFKD).
Observations

The synopses are torn between a couple of world views:


The one expressed in S15 and supported by TimToady on #perl6 while it was
being written that Str is just about NFG and the other codepoint-level
views are handled by Uni types, and we do explicit coercion between them.
This is the view I'm running with.


The one hinted at in S02 and S05, where Str is capable of working at
many different levels; bits of S02 even hint at mutability of Str, but
we've long settled on it being immutable.


Further to this, it has been suggested that Uni is more like Buf than Str, in
so far as it doesn't support a great deal of string operations directly, and
that you need to go to Str level to perform them. The difference is that we
can always do a Str coercion on a Uni.
The consequences of adopting the first view are:


StrPos and StrLen probably become uninteresting, since Str only ever works
at one level of abstraction.


We need to decide which operations keep you in the Uni paradigm and which
do not. For Buf we define infix:<>, prefix:<^>, infix:<&>, infix:<|>,
infix:<~^>, infix:<eqv>, infix:<cmp>, infix:<eq>, infix:<ne>, infix:<lt>,
infix:<gt>, infix:<le>, and infix:<ge>.


We need to decide if Uni provides array-like access like Blob/Buf (it's very
probably useful)


We need to decide what operations we might define directly on Uni (for
example, .ord and .ords make sense at this level). Talking of .ord and
.ords, we need figure out what they do on a Str with synthetics in it.


The nature of Uni

It's almost certain that Uni and its subclasses are immutable types, since (as
with utf8, which is an immutable Blob) mutability would let you make them not
normalized. Further, we can be sure that only in-range codepoints are held in
a Uni, differentiating it in another interesting way from a Buf/Blob. The
other difference is Uni would always be in native endian, whereas a Buf/Blob
may be holding UTF-32 in some other endian.
At a guts level, we need to decide on a representation. Str is a P6opaque, and
holds a str. The str is in turn a VM-defined representation; on MoarVM it is
rope-y thing. By contrast, Buf/Blob use VMArray, a compact integer array that
is parametrized on size. Uni could go either way.


We could easily represent it as a uint32 array underneath. This'd mean that,
like with Buf and Blob types, you cannot mix into them and they are in some
sense "primitive". If we do that we might want to consider going with a
lower-case naming (uni, nfc, nfd, nfkc, nfkd).


We could also try and find a way to have it as some kind of real string so
all of the various VM-provided string manipulation operations work. Since
it is valid Unicode, just not NFG, we can safely do so - but this raises
all kinds of tricky questions about what return types we end up with when
we do operations. It also raises the question of what str is, and if we
need parallel forms of the native unboxed primitive too. Finally, this
forces our hand on immutability: Uni would have to be immutable if we go
this way.


The first is certainly simpler, and means that most string operations you do
will be a coercion to NFG form. The downside is that you could easily end up
writing code that repeatedly coerces. This could be tackled by having Uni be
a P6opaque with both a uint32 array of the codepoints, and a slot to cache a
(lazily computed) Str. On the other hand, we allow utf8/utf16 Blob types to
auto-coerce without any such caching, so maybe it doesn't make sense to offer
it in just one of the cases, and we probably want to keep Buf and Blob as
lightweight and native things. Not to mention that you only get instances of
these - and Uni - if you explicitly go looking for them, so it's not going to
be a very common failure mode. It's also easily identified by profiling.
Regexes

S05 hints at various "processing levels" that regexes can work at: :graphs,
:codes, and :bytes. It demonstrates them on modifiers to entire regexes, but
also hints at them in a given lexical scope inside of the regex too. This is
heavily tied to the "Str is multi-level" thing, which we seem to have moved
away from. Since Str is NFG then :codes and :bytes both imply coercions on
the Str we're processing, but also are underspecified since we don't know
the normalization form nor the encoding. Generally, it's not clear exactly
how these adverbs fit in with the current world view. Even if you provide
the information on how to reach the other view, and we go ahead and do a
coercion, we need somehow keep the positions in sync - suggesting they are
very tied to the StrPos abstraction that Str being NFG only seems to have
rendered obsolete.
The current regex engine implementation can only work with VM-level strings,
which at present restricts us to Str. We can safely make it work against a
Buf/Blob assuming everything is bytes, since we can just treat it as a string
of 8-bit wide things, cover Unicode codepoints 0..255. Presumably, you would
only match a Buf against rules that are expecting to work on a byte stream.
We could so similar with Uni. In both cases we'd likely start out by creating
a VM string where we memcpy the contents of the VM array into place; we should
be able to re-structure things to avoid the copy in the future.
One further question that falls out of this is the nature of Match objects if
you match against a Buf or a Uni. Since substr is ill-defined on a Buf, we
really cannot give back a Str from a Match object in the general case.
Presumably we'd also expect to be able to get a Uni back when we're matching
at that level (given substrings preserve normal form by Unicode spec, we can
hand back the precise same type as we received).
Examples

This section tries to capture some things you might like to do and how they
could look.
Decode UTF-8 to NFG

my $str = $buf.decode('utf-8'); # returns a Str

Encode NFG to utf-8 (defaults to NFC)

my $buf = $str.encode('utf-8'); # Returns a utf8

Encode NFG to utf-8 (specific normalization)

my $buf = $str.encode('utf-8', NFC);    # Returns a utf8
my $buf = $str.encode('utf-8', NFD);    # Returns a utf8
my $buf = $str.encode('utf-8', NFKC);   # Returns a utf8
my $buf = $str.encode('utf-8', NFKD);   # Returns a utf8

Create a Uni from a bunch of codepoints

my $uni = Uni.new(114, 117, 114);

Turn a Uni into an NFG string

my $str = Uni.Str;

Decode to a Uni, maybe with normalization

my $str = $buf.decode('utf-8');         # Decode as UTF-8, return a Str (NFG)
my $str = $buf.decode('utf-8', Str);    # Same thing
my $uni = $buf.decode('utf-8', Uni);    # Decode as UTF-8, do no normalization, return Uni
my $nfc = $buf.decode('utf-8', NFC);    # Decode as UTF-8, apply NFC, return NFC
my $nfc = $buf.decode('utf-8', NFD);    # Decode as UTF-8, apply NFD, return NFD
my $nfc = $buf.decode('utf-8', NFKC);   # Decode as UTF-8, apply NFKC, return NFKC
my $nfc = $buf.decode('utf-8', NFKD);   # Decode as UTF-8, apply NFKD, return NFKD