jnthn/encoding.md Secret

## encoding.md

      
    Raw
  

              encoding.md
            
          
    Encoding API

At the moment, encodings in Perl 6 are identified by string names. These
are provided by the backend. While using MoarVM/JVM "native" encoders is
fine for performance, it's also rather limiting and inflexible. It means
userspace encoders are not possible - or at least, if you write one, there
is no way to plug it in to the I/O system. On the other side of the coin,
streaming decoders are currently entirely tied up with I/O, meaning there
is no way to use, for example, the VM-backed streaming decoder for UTF-8
to decode a HTTP request body given you should use Latin-1 to decode the
headers (which tell you how the body should be processed).
Furthermore, the surface area of the VM I/O abstraction layer is
rather large as a result of this current design. Rather than having
sync I/O and async I/O APIs that deal in bytes, and a separate API
for VM-backed streaming decoding, we end up also having sync and
async I/O APIs that deal in strings. This pushes complexity into
lower-level code that would be better handled in higher-level code
(see the recently fixed issue involving uncatchable decoding errors
involving async sockets in character mode for an example). It also
somewhat complicates efforts to fix the "use of handles between
threads" issue.
The I/O VM API design we have today is in some ways descended from
Parrot's one, ports to the JVM and MoarVM just implementing the same
API rather than yak shaving. Now it's time to clear this up, and provide
more flexibility to Perl 6 users.
A proposed encoding API

An encoding is represented by an object implementing the Encoding role.
Some methods in the role require an implementation; various defaults are
provided (and may be overridden for optimization purposes).
role Encoding {
    # The name of the encoding.
    method name() returns Str { ... }

    # Other names that the encoding may be known by.
    method aliases() returns List { () }

    # The default replacement character for this encoding.
    method replacement-char() returns Str { '?' }

    # Encodes the specified string into bytes. If a replacement
    # is not provided then an unencodable character will result
    # in an error.
    method encode(Str:D $to-encode, Str :$replacement, *%options) returns Blob {
        %options<replacement> = .NFC with $replacement;
        self.encode-codes($to-encode.NFC, |%options)
    }

    # Encodes the specified Uni into bytes.
    method encode-codes(Uni:D $to-encode, Uni :$replacement) returns Blob { ... }

    # Decodes the specified bytes into a Str.
    method decode(Blob:D $to-decode, *%options) returns Str {
        my $decoder = self.decoder(|%options);
        $decoder.add-bytes($to-decode);
        return $decoder.consume-all-chars();
    }

    # Decodes the specified bytes into a Uni.
    method decode-codes(Blob:D $to-decode, *%options) returns Uni {
        my $decoder = self.decoder(|%options);
        $decoder.add-bytes($to-decode);
        return $decoder.consume-all-codes();
    }

    # Creates a new streaming Decoder object for this encoding. The
    # %options may include a replacement, which is a Str that
    # will be inserted as a replacement character if undecodable
    # bytes are encountered. If this is not provided, undecodable
    # bytes will throw an exception.
    method decoder(*%options) returns Encoding::Decoder { ... }
}

Encoding is relatively straightforward: we have a bunch of graphemes
or codepoints and want a bunch of bytes representing them. Decoding is
harder, because we might receive bytes that only partially represent a
codepoint (with the remaining bytes to arrive later - perhaps in the
next network packet) or codepoints that partially represent a grapheme.
Therefore, we have a streaming decoder API. Further, it is designed to:

Be useful for both push and pull scenarios (that is, async when the
next data will arrive whenever, and sync when we should read bytes
from disk/the network and top up the buffer).
Cope with reading strings, codepoints, and bytes - possibly mixed.
Therefore, line reading operations only decode/consume bytes up to
the separator.

role Encoding::Decoder {
    # Adds bytes that are to be decoded.
    method add-bytes(Blob:D $bytes --> Nil) { ... }

    # Decodes all characters (graphemes) that are available, assuming
    # that further bytes may arrive in the future (that is, no EOF
    # yet). Incomplete byte sequences will be left behind. If this is
    # an encoding that can represent combining characters, and the
    # last codepoint decoded is not a control character, it will
    # be held back in the normalization buffer since the next
    # bytes to arrive may represent a combining character.
    method consume-available-chars() returns Str { ... }

    # Decodes all characters (graphemes) that are available, assuming
    # that no further bytes will be added to the buffer. (This means
    # that any incomplete multi-byte sequences will cause an exception.)
    method consume-all-chars() returns Str { ... }

    # Decodes all Unicode codepoints that are available, assuming
    # that further bytes may arrive in the future (that is, no EOF
    # yet). Incomplete byte sequences will be left behind.
    method consume-available-codes() returns Uni { ... }

    # Decodes all Unicode codepoints that are available, assuming
    # that no further bytes will be added to the buffer. (This means
    # that any incomplete multi-byte sequences will cause an exception.)
    method consume-all-codes() returns Uni { ... }

    # Sets the line separators for the decoder. Passed as an array of
    # Str. Separators may include multiple characters.
    method set-line-separators(@seps --> Nil) { ... }

    # Decodes all characters up to the next line separator. The line
    # may be chomped (have the separator excluded). If $eof is set to
    # True then it will be assumed no more bytes are coming, and that
    # anything not followed by a separator will be considered as the
    # final line. Any incomplete multi-byte sequences should be
    # treated as an error. If $eof is False and the separator is not
    # found, then Nil will be returned.
    method consume-line-chars(Bool :$chomp = False, Bool :$oef = False)
        returns Str { ... }

    # Consumes n characters, provided they are available. If not,
    # returns Nil. (Note that the characters that were decoded will
    # have their decoding cached, and the bytes corresponding to
    # them will have been consumed.)
    method consume-chars(int $n) returns Str { ... }

    # Consumes n codepoints, provided they are available. If not,
    # returns Nil. (Note that the characters that were decoded will
    # have their decoding cached, and the bytes corresponding to
    # them will have been consumed.)
    method consume-codes(int $n) returns Uni { ... }

    # Consumes n bytes, provided they are available in the "to decode"
    # buffer. If they are not, returns Nil.
    method consume-bytes(int $n) returns Blob { ... }

    # Returns the number of undecoded bytes available in the decoder.
    method bytes-available() returns int { ... }

    # Returns True if there is nothing left in the decoder to read.
    method is-empty() returns Bool { ... }
}

Perl 6 will also provide an Encoding::Normalizer class that offers
streaming normalization. Userspace encoding implementations will be able
to use this in order to return NFG strings. (XXX Design it.)
Using encodings

We'll continue to allow the various :$enc options to be a Str, and
resolve it to one of the built-in encodings. However, we'll also allow
an object implementing the Encoding role to be provided instead, thus
allowing userspace encodings to be plugged in instead.
Rough roadmap


Implement the above roles for the decodings that we support today, using
the VM-backed decoders.
Make sure they have tests!
Implement Encoding::Normalizer (with tests)
Switch async socket I/O to use such encodings for their input/output (easy
as it doesn't have any encoding support today).
Switch Proc::Async over to use the new Encoding role etc. This should get
rid of all uses of the async string I/O API.
Finish implementing VM-backed decoder API on MoarVM (mostly done, just
missing the Uni level bits) and JVM.
Refactor IO::Handle to only use binary I/O nqp:: ops, and use Encoding
implementations to do IO. (This will be the most user-impacting one, so
will need some careful testing!)
Write tests to make sure we can implement user-space encodings.
Tweak NQP's I/O to avoid string-based I/O ops
Eliminate what we can from JVM/MoarVM
Re-work MoarVM's sync socket I/O to not depend on libuv, removing the limitation
involving using handles only on a single thread
Do similar for sync file I/O
Work out how we expose reading Uni (perhaps a :$norm to go with :$enc, which
defaults to Str but can be set to NFC, NFD, etc.) To be really useful we'll
also need to make Uni much more useful too (so it supports stringy operations);
that will need another bunch of planning! :-)