Since 4.06 we have UTF-X encoding support via the Buffer
module. The
following is a proposal to add UTF-X decoding (and bytes
encoding)
support with the following goals:
-
Provide a low-level, allocation-free codec API in
bytes
. This API provides all the information needed to implement loops for making higher-level UTF-X codec APIs (e.g.Uutf
's folding functions) operating onbytes
andstring
values. It's not geared towards the end-user but rather towards library programmers. -
Provide a less performant and less precise (no way to detect decoding errors;
Uchar.rep
replacement) decoding convenience API based on theUchar.t Seq.t
datatype for bothBytes
andString
.
In both Bytes
and String
we add the following high-level interface
(implemented via the low-level interface):
val utf_8_to_uchar_seq : ?start:int -> ?len:int -> t -> Uchar.t Seq.t
val utf_16le_to_uchar_seq : ?start:int -> ?len:int -> t -> Uchar.t Seq.t
val utf_16be_to_uchar_seq : ?start:int -> ?len:int -> t -> Uchar.t Seq.t
These are "best-effort" functions: any decoding error is substituted by
a Uchar.rep
character until a new synchronizing byte is found.
The utf_{8,16le,16be}_of_uchar_seq
functions would be a bit difficult to
provide due to dependencies (if implemented via Buffer
), so
we leave them out.
We could add Buffer.add_utf_{8,16le,16,be}_uchar_seq
but that's just
a Seq.iter
away with the existing Buffer.add_utf_{8,16le,16be}_uchar
.
So we stop here w.r.t. high-level API.
We begin with the following additions to the Uchar
module:
val utf_8_byte_length : Uchar.t -> int
val utf_16_byte_length : Uchar.t -> int
In the Bytes
module we then add the following:
exception Utf_error of int
(** The control flow exception raised by
[{get,set}_uchar_utf_{8,16le,16be}] functions. The integer is an
index where a new decode/encode can be tried or the length of the
buffer if the end was reached. {b Important.} This exception is used
for control flow you must not let it uncaught, it is raised with notrace. *)
val get_uchar_utf_8_length : t -> int -> int
(** [get_uchar_utf_8_length b i] is the byte length an UTF-8 encoded
character at index [i] in [b]. This is [0] if the byte is not a valid
UTF-8 starter byte. *)
val get_uchar_utf_8 : t -> int -> Uchar.t
(** [get_uchar_utf_8 b i] is the UTF-8 encoded Unicode character at index
[i] in [b]. The number of bytes consumed by the decode can be
determined by [get_uchar_utf_8_length b i]. In case of errors
@raise Utf_error [n] with [n] the next index were a new decode
can be tried or [length b] if there is no such index. *)
val get_uchar_utf_16le : t -> int -> Uchar.t
(** [get_uchar_utf_16le b i] is the UTF-16LE encoded Unicode character at
index [i] in [b]. The number of bytes consumed by the decode can
be determined by using {!Uchar.utf_16_byte_length} on the
result. @raise Utf_error [n] with [n] the next index were a new
decode can be tried or [length b] if there is no such index. *)
val get_uchar_utf_16be : t -> int -> Uchar.t
(** [get_uchar_utf_16be] is like {!get_uchar_utf16_le} but decodes
UTF-16BE. *)
val set_uchar_utf_8 : t -> int -> Uchar.t -> int
(** [set_uchar_utf_8 b i u] sets the UTF-8 encoding of [u] in [b]
starting at [i] and returns the next index were an encoding can
be performed or the buffer length if there is no such
index. @raise Utf_error [n] if there's not enough space for the
encode. *)
val set_utf_16le : t -> int -> Uchar.t -> int
(** [set_utf_16le] is like {!set_uchar_utf_8} but UTF-16LE encodes. *)
val set_utf_16be : t -> int -> Uchar.t -> int
(** [set_utf_16be] is like {!set_uchar_utf_8} but UTF-16BE encodes. *)
For codecing sequences of Uchar.t
as loops:
- The
set
functions are good. - The
get
functions incur a bit of overhead to compute the next index (the invoked function has the information but we can't return it without allocating). For UTF-8 either we either get the next index viaget_uchar_utf_8_length
(assuming a table-based implementation: two memory accesses) or withUchar.utf_8_byte_length
on the result (up to three branches). For UTF-16 we useUchar.utf_16_byte_length
(one branch).
Why not also include a version of the low-level "get" functions in
String
?