Skip to content

Instantly share code, notes, and snippets.

@jnthn
Last active August 29, 2015 14:18
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jnthn/b312a7a211e78c27b96d to your computer and use it in GitHub Desktop.
Save jnthn/b312a7a211e78c27b96d to your computer and use it in GitHub Desktop.

NFG and Unicode "plan of attack"

Goals

Do the following on Moar without busting JVM support along the way (it likely gets approximations of things here as needed until we have time to do the full NFG works there):

  • Make Str and str NFG by default
  • Implement types Uni/NFC/NFD/NFKC/NFKD
  • Implement Str.NFC/Str.NFD/Str.NFKC/Str.NFKD to drop from NFG to Uni
  • Implement Uni.NFC/Uni.NFD/Uni.NFKC/Uni.NFKD to switch normalization
  • Figure out how Uni types and Buf types play with regexes

While NFG is the Big Goal, since NFG is NFC and then some it makes sense to tackle NFC first; NFC is also defined in terms of NFD, meaning you end up pretty much having to do the three anyway. We can leave NFKD until later (and NFKC is trivial once you have NFKD).

Observations

The synopses are torn between a couple of world views:

  • The one expressed in S15 and supported by TimToady on #perl6 while it was being written that Str is just about NFG and the other codepoint-level views are handled by Uni types, and we do explicit coercion between them. This is the view I'm running with.

  • The one hinted at in S02 and S05, where Str is capable of working at many different levels; bits of S02 even hint at mutability of Str, but we've long settled on it being immutable.

Further to this, it has been suggested that Uni is more like Buf than Str, in so far as it doesn't support a great deal of string operations directly, and that you need to go to Str level to perform them. The difference is that we can always do a Str coercion on a Uni.

The consequences of adopting the first view are:

  • StrPos and StrLen probably become uninteresting, since Str only ever works at one level of abstraction.

  • We need to decide which operations keep you in the Uni paradigm and which do not. For Buf we define infix:<>, prefix:<^>, infix:<&>, infix:<|>, infix:<~^>, infix:<eqv>, infix:<cmp>, infix:<eq>, infix:<ne>, infix:<lt>, infix:<gt>, infix:<le>, and infix:<ge>.

  • We need to decide if Uni provides array-like access like Blob/Buf (it's very probably useful)

  • We need to decide what operations we might define directly on Uni (for example, .ord and .ords make sense at this level). Talking of .ord and .ords, we need figure out what they do on a Str with synthetics in it.

The nature of Uni

It's almost certain that Uni and its subclasses are immutable types, since (as with utf8, which is an immutable Blob) mutability would let you make them not normalized. Further, we can be sure that only in-range codepoints are held in a Uni, differentiating it in another interesting way from a Buf/Blob. The other difference is Uni would always be in native endian, whereas a Buf/Blob may be holding UTF-32 in some other endian.

At a guts level, we need to decide on a representation. Str is a P6opaque, and holds a str. The str is in turn a VM-defined representation; on MoarVM it is rope-y thing. By contrast, Buf/Blob use VMArray, a compact integer array that is parametrized on size. Uni could go either way.

  • We could easily represent it as a uint32 array underneath. This'd mean that, like with Buf and Blob types, you cannot mix into them and they are in some sense "primitive". If we do that we might want to consider going with a lower-case naming (uni, nfc, nfd, nfkc, nfkd).

  • We could also try and find a way to have it as some kind of real string so all of the various VM-provided string manipulation operations work. Since it is valid Unicode, just not NFG, we can safely do so - but this raises all kinds of tricky questions about what return types we end up with when we do operations. It also raises the question of what str is, and if we need parallel forms of the native unboxed primitive too. Finally, this forces our hand on immutability: Uni would have to be immutable if we go this way.

The first is certainly simpler, and means that most string operations you do will be a coercion to NFG form. The downside is that you could easily end up writing code that repeatedly coerces. This could be tackled by having Uni be a P6opaque with both a uint32 array of the codepoints, and a slot to cache a (lazily computed) Str. On the other hand, we allow utf8/utf16 Blob types to auto-coerce without any such caching, so maybe it doesn't make sense to offer it in just one of the cases, and we probably want to keep Buf and Blob as lightweight and native things. Not to mention that you only get instances of these - and Uni - if you explicitly go looking for them, so it's not going to be a very common failure mode. It's also easily identified by profiling.

Regexes

S05 hints at various "processing levels" that regexes can work at: :graphs, :codes, and :bytes. It demonstrates them on modifiers to entire regexes, but also hints at them in a given lexical scope inside of the regex too. This is heavily tied to the "Str is multi-level" thing, which we seem to have moved away from. Since Str is NFG then :codes and :bytes both imply coercions on the Str we're processing, but also are underspecified since we don't know the normalization form nor the encoding. Generally, it's not clear exactly how these adverbs fit in with the current world view. Even if you provide the information on how to reach the other view, and we go ahead and do a coercion, we need somehow keep the positions in sync - suggesting they are very tied to the StrPos abstraction that Str being NFG only seems to have rendered obsolete.

The current regex engine implementation can only work with VM-level strings, which at present restricts us to Str. We can safely make it work against a Buf/Blob assuming everything is bytes, since we can just treat it as a string of 8-bit wide things, cover Unicode codepoints 0..255. Presumably, you would only match a Buf against rules that are expecting to work on a byte stream. We could so similar with Uni. In both cases we'd likely start out by creating a VM string where we memcpy the contents of the VM array into place; we should be able to re-structure things to avoid the copy in the future.

One further question that falls out of this is the nature of Match objects if you match against a Buf or a Uni. Since substr is ill-defined on a Buf, we really cannot give back a Str from a Match object in the general case. Presumably we'd also expect to be able to get a Uni back when we're matching at that level (given substrings preserve normal form by Unicode spec, we can hand back the precise same type as we received).

Examples

This section tries to capture some things you might like to do and how they could look.

Decode UTF-8 to NFG

my $str = $buf.decode('utf-8'); # returns a Str

Encode NFG to utf-8 (defaults to NFC)

my $buf = $str.encode('utf-8'); # Returns a utf8

Encode NFG to utf-8 (specific normalization)

my $buf = $str.encode('utf-8', NFC);    # Returns a utf8
my $buf = $str.encode('utf-8', NFD);    # Returns a utf8
my $buf = $str.encode('utf-8', NFKC);   # Returns a utf8
my $buf = $str.encode('utf-8', NFKD);   # Returns a utf8

Create a Uni from a bunch of codepoints

my $uni = Uni.new(114, 117, 114);

Turn a Uni into an NFG string

my $str = Uni.Str;

Decode to a Uni, maybe with normalization

my $str = $buf.decode('utf-8');         # Decode as UTF-8, return a Str (NFG)
my $str = $buf.decode('utf-8', Str);    # Same thing
my $uni = $buf.decode('utf-8', Uni);    # Decode as UTF-8, do no normalization, return Uni
my $nfc = $buf.decode('utf-8', NFC);    # Decode as UTF-8, apply NFC, return NFC
my $nfc = $buf.decode('utf-8', NFD);    # Decode as UTF-8, apply NFD, return NFD
my $nfc = $buf.decode('utf-8', NFKC);   # Decode as UTF-8, apply NFKC, return NFKC
my $nfc = $buf.decode('utf-8', NFKD);   # Decode as UTF-8, apply NFKD, return NFKD
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment