Skip to content

Instantly share code, notes, and snippets.

@jnthn
Last active September 14, 2016 22:15
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jnthn/4b2ce730fe7b505d9104d157291de572 to your computer and use it in GitHub Desktop.
Save jnthn/4b2ce730fe7b505d9104d157291de572 to your computer and use it in GitHub Desktop.
Work in progress encodings changes/refactors

Encoding API

At the moment, encodings in Perl 6 are identified by string names. These are provided by the backend. While using MoarVM/JVM "native" encoders is fine for performance, it's also rather limiting and inflexible. It means userspace encoders are not possible - or at least, if you write one, there is no way to plug it in to the I/O system. On the other side of the coin, streaming decoders are currently entirely tied up with I/O, meaning there is no way to use, for example, the VM-backed streaming decoder for UTF-8 to decode a HTTP request body given you should use Latin-1 to decode the headers (which tell you how the body should be processed).

Furthermore, the surface area of the VM I/O abstraction layer is rather large as a result of this current design. Rather than having sync I/O and async I/O APIs that deal in bytes, and a separate API for VM-backed streaming decoding, we end up also having sync and async I/O APIs that deal in strings. This pushes complexity into lower-level code that would be better handled in higher-level code (see the recently fixed issue involving uncatchable decoding errors involving async sockets in character mode for an example). It also somewhat complicates efforts to fix the "use of handles between threads" issue.

The I/O VM API design we have today is in some ways descended from Parrot's one, ports to the JVM and MoarVM just implementing the same API rather than yak shaving. Now it's time to clear this up, and provide more flexibility to Perl 6 users.

A proposed encoding API

An encoding is represented by an object implementing the Encoding role. Some methods in the role require an implementation; various defaults are provided (and may be overridden for optimization purposes).

role Encoding {
    # The name of the encoding.
    method name() returns Str { ... }

    # Other names that the encoding may be known by.
    method aliases() returns List { () }

    # The default replacement character for this encoding.
    method replacement-char() returns Str { '?' }

    # Encodes the specified string into bytes. If a replacement
    # is not provided then an unencodable character will result
    # in an error.
    method encode(Str:D $to-encode, Str :$replacement, *%options) returns Blob {
        %options<replacement> = .NFC with $replacement;
        self.encode-codes($to-encode.NFC, |%options)
    }

    # Encodes the specified Uni into bytes.
    method encode-codes(Uni:D $to-encode, Uni :$replacement) returns Blob { ... }

    # Decodes the specified bytes into a Str.
    method decode(Blob:D $to-decode, *%options) returns Str {
        my $decoder = self.decoder(|%options);
        $decoder.add-bytes($to-decode);
        return $decoder.consume-all-chars();
    }

    # Decodes the specified bytes into a Uni.
    method decode-codes(Blob:D $to-decode, *%options) returns Uni {
        my $decoder = self.decoder(|%options);
        $decoder.add-bytes($to-decode);
        return $decoder.consume-all-codes();
    }

    # Creates a new streaming Decoder object for this encoding. The
    # %options may include a replacement, which is a Str that
    # will be inserted as a replacement character if undecodable
    # bytes are encountered. If this is not provided, undecodable
    # bytes will throw an exception.
    method decoder(*%options) returns Encoding::Decoder { ... }
}

Encoding is relatively straightforward: we have a bunch of graphemes or codepoints and want a bunch of bytes representing them. Decoding is harder, because we might receive bytes that only partially represent a codepoint (with the remaining bytes to arrive later - perhaps in the next network packet) or codepoints that partially represent a grapheme. Therefore, we have a streaming decoder API. Further, it is designed to:

  • Be useful for both push and pull scenarios (that is, async when the next data will arrive whenever, and sync when we should read bytes from disk/the network and top up the buffer).
  • Cope with reading strings, codepoints, and bytes - possibly mixed. Therefore, line reading operations only decode/consume bytes up to the separator.
role Encoding::Decoder {
    # Adds bytes that are to be decoded.
    method add-bytes(Blob:D $bytes --> Nil) { ... }

    # Decodes all characters (graphemes) that are available, assuming
    # that further bytes may arrive in the future (that is, no EOF
    # yet). Incomplete byte sequences will be left behind. If this is
    # an encoding that can represent combining characters, and the
    # last codepoint decoded is not a control character, it will
    # be held back in the normalization buffer since the next
    # bytes to arrive may represent a combining character.
    method consume-available-chars() returns Str { ... }

    # Decodes all characters (graphemes) that are available, assuming
    # that no further bytes will be added to the buffer. (This means
    # that any incomplete multi-byte sequences will cause an exception.)
    method consume-all-chars() returns Str { ... }

    # Decodes all Unicode codepoints that are available, assuming
    # that further bytes may arrive in the future (that is, no EOF
    # yet). Incomplete byte sequences will be left behind.
    method consume-available-codes() returns Uni { ... }

    # Decodes all Unicode codepoints that are available, assuming
    # that no further bytes will be added to the buffer. (This means
    # that any incomplete multi-byte sequences will cause an exception.)
    method consume-all-codes() returns Uni { ... }

    # Sets the line separators for the decoder. Passed as an array of
    # Str. Separators may include multiple characters.
    method set-line-separators(@seps --> Nil) { ... }

    # Decodes all characters up to the next line separator. The line
    # may be chomped (have the separator excluded). If $eof is set to
    # True then it will be assumed no more bytes are coming, and that
    # anything not followed by a separator will be considered as the
    # final line. Any incomplete multi-byte sequences should be
    # treated as an error. If $eof is False and the separator is not
    # found, then Nil will be returned.
    method consume-line-chars(Bool :$chomp = False, Bool :$oef = False)
        returns Str { ... }

    # Consumes n characters, provided they are available. If not,
    # returns Nil. (Note that the characters that were decoded will
    # have their decoding cached, and the bytes corresponding to
    # them will have been consumed.)
    method consume-chars(int $n) returns Str { ... }

    # Consumes n codepoints, provided they are available. If not,
    # returns Nil. (Note that the characters that were decoded will
    # have their decoding cached, and the bytes corresponding to
    # them will have been consumed.)
    method consume-codes(int $n) returns Uni { ... }

    # Consumes n bytes, provided they are available in the "to decode"
    # buffer. If they are not, returns Nil.
    method consume-bytes(int $n) returns Blob { ... }

    # Returns the number of undecoded bytes available in the decoder.
    method bytes-available() returns int { ... }

    # Returns True if there is nothing left in the decoder to read.
    method is-empty() returns Bool { ... }
}

Perl 6 will also provide an Encoding::Normalizer class that offers streaming normalization. Userspace encoding implementations will be able to use this in order to return NFG strings. (XXX Design it.)

Using encodings

We'll continue to allow the various :$enc options to be a Str, and resolve it to one of the built-in encodings. However, we'll also allow an object implementing the Encoding role to be provided instead, thus allowing userspace encodings to be plugged in instead.

Rough roadmap

  • Implement the above roles for the decodings that we support today, using the VM-backed decoders.
  • Make sure they have tests!
  • Implement Encoding::Normalizer (with tests)
  • Switch async socket I/O to use such encodings for their input/output (easy as it doesn't have any encoding support today).
  • Switch Proc::Async over to use the new Encoding role etc. This should get rid of all uses of the async string I/O API.
  • Finish implementing VM-backed decoder API on MoarVM (mostly done, just missing the Uni level bits) and JVM.
  • Refactor IO::Handle to only use binary I/O nqp:: ops, and use Encoding implementations to do IO. (This will be the most user-impacting one, so will need some careful testing!)
  • Write tests to make sure we can implement user-space encodings.
  • Tweak NQP's I/O to avoid string-based I/O ops
  • Eliminate what we can from JVM/MoarVM
  • Re-work MoarVM's sync socket I/O to not depend on libuv, removing the limitation involving using handles only on a single thread
  • Do similar for sync file I/O
  • Work out how we expose reading Uni (perhaps a :$norm to go with :$enc, which defaults to Str but can be set to NFC, NFD, etc.) To be really useful we'll also need to make Uni much more useful too (so it supports stringy operations); that will need another bunch of planning! :-)
@cygx
Copy link

cygx commented Sep 14, 2016

I think a way to reset the decoder would be nice. One use case would be the ability to intermix char-based and byte-based access to files. When I made this work the last time (cf MoarVM/MoarVM#319), I just destroyed and re-created the decode streams as necessary...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment