CAFxX/00_cpu_wishlist.md

## 00_cpu_wishlist.md

      
    Raw
  

              00_cpu_wishlist.md
            
          
    This document lists ISA or processor implementation wishlists/ideas.
They are mostly random ideas; they are not reviewed, guaranteed to work, or even just make sense, but at some point I had good use cases for each one of them.
Third party wishlists

ISA wishlists found elsewhere.

y-cruncher Technology Wishlist - ISA
Lockless Inc. X86 Instruction Wishlist
David Chisnall's CPU Feature Wishlist
Fast memcpy, A System Design (ACM SIGARCH)

Personal wishlist

ISA ideas

This section lists a collection of potential ideas for ISA extensions.
Native byte code support
  Support direct execution/interpretation of custom bytecode (in a sense, optimized support for threaded code).

Native global read/write barriers
  Support hooking up load/store instructions with custom behaviors.
  
This could be used to speed up GC, to enable sanitizers in production, or more in general to efficiently support safe language semantics. 

Instruction virtualization
  Support hooking up/override instructions with custom behaviors. This is effectively a superset/generalization of the previous point.
  
In addition to the previous usecases, this could be used for program instrumentation, virtualization, sandboxing, and so on.

Green threads support (quick stack switching)

Alternate code sequences
  Provide multiple equivalent targets for e.g. a CALL/JUMP, and let the processor dynamically pick the most efficient one based on sampling of past executions
  An example of when this may be useful is in the case of
if (a || b) {
  // ...
}
if b is side-effect free, this can be evaluated either as (respecting the short-circuit nature of ||)
if (a) {
  if (b) {
    // ...
  }
}
or as
if (b) { // swapped order of conditions
  if (a) {
    // ...
  }
}
or as
c = bool(a) | bool(b)
if (c) { // single conditional jump
  // ...
}
which one is going to be faster depends on the how predictable a and b are, how correlated they are (e.g. consider the case in which each is individually unpredictable, but bool(a)|bool(b) is highly predictable because a = !bool(b)), and how expensive it is to evaluate each of them.
With an alternate code sequence instructions (alt_inst_start/alt_inst_end) code could instead be emitted as
switch (alt_inst_start(2)) {
case 0:
  if (a) {
    if (b) {
      end1:
      alt_inst_end();
      // ..
    }
  }
  end2:
  alt_inst_end();
  break;
case 1:
  if (b) {
    if (a) {
      goto end1
    }
  }
  goto end2;  
case 2:
  if (bool(a) | bool(b)) {
    goto end1;
  }
  goto end2;
}
and this would enable the processor to dynamically sample how long each sequence takes on average, and then start to pick more frequently the one that, given the dynamic conditions found during execution, is faster on average.

Probabilistic jump
Conditional branch that jumps to its destination with a certain probability (e.g. 1/n) or with a certain rate (e.g. once every n microseconds)
if one_in(n) { // probability = 1/n
  // e.g. sampling/cleanup code
}

if once_every(t) { // taken ~ every t ns (if executed)
  // e.g. sampling/cleanup code
}
Supporting this natively in the ISA would allow CPUs to integrate it with the built-in RNG/timer and branch predictor to ensure that on average the jump is taken the appropriate number of times, while causing zero branch mispredictions and without the need for expensive interrupts.

Atomic operations on whole cachelines
  LL/SC instructions to support atomic load/store/cas of the contents of a whole cacheline (or a subset of it)

Memory operations
  Native implementation basic memory operations that can be potentially be pushed down to the memory controller or to the memory itself.
Note that some of these functions may already be partially or fully implemented by the V extension.

memcpy/memmove, extended to support element size and stride parameters for both source and destination (memcpy(dst, src, elem_sz, elem_cnt, dst_stride, src_stride) with the current memcpy behaviour being equivalent to memcpy(dst, src, sz, 1, sz, sz))
memset, extended to support multi-byte patterns (memset(dst, dst_len, src, src_len), with the current memset behavior being equivalent to memset(dst, dst_len, &value, 1)
memcmp, extended to support a mask parameter
memmem, extended to support a stride parameter (memmem(haystack, haystack_len, needle, needle_len, stride), with the current memmem behavior being equivalent to memmem(haystack, haystack_len, needle, needle_len, 1))
memswp(aptr, bptr, len) swap the contents of two memory areas
memchr, extended to support a needle parameter (memchr(haystack, haystack_len, stride, needle)) that allows to specify multiple bytes to search for, with the current memchr behavior being equivalent to bytemask needle = {0}; needle[c/8] = 1<<(c%8); memchr(haystack, haystack_len, 1, needle)
memsieve (TBD: filter out bytes)

The stride parameters indicates how far apart candidate matches should be. E.g., if stride is 4 then candidate matches are at offset 0, 4, 8, 12, ... from the starting address. A common case is having the stride equal to length of the needle (for memmem) and of the source (for memset), but the stride can be both shorter or longer than those (TODO: extend all functions to support backwards searches via negative strides).
Ideally we should support also the terminator variants to enable efficient operations on 0-terminated strings. Ideally we should also support arbitrary terminator lengths:

memccpy, extended to support terminator multi-byte patterns (memccpy(dst, src, len, stop, stop_len), with the current memccpy behavior being equivalent to memccpy(dst, src, len, &value, 1))
memcset, memccmp, memcmem, ...

A related idea, enabled by having some of these operations supported natively by the ISA, is described below in Bulk memory operations lowering.

Associative table lookup (Content-Addressable Memory)
  Implement a fully-associative table lookup instruction that can be used to quickly map sparse keys to values. 
Such an instruction could expose/reuse the TLB machinery to allow code to efficiently implement hashmaps, sparse LUTs, jump tables, global remapping or indirection, etc.
Note that the proposed memmem function, with the extension described in the previous point, could functionally perform the same task.

Range search
  Related to associative lookups, this instruction would - given a scalar and a SIMD vector - search for the range in the vector where the value is found.
Assuming a N elements vector (v[0], v[1], ..., v[N-1]) where v[i] <= v[i+1] for each i, the instruction would perform the following:
int search<typename T>(T s, T *v, int N) { // N > 0
  if (s < v[0])
    return 0;
  for (int i=1; i<N; i++)
    if (v[i-1] <= s && s < v[i])
      return i;
  return N;
}

This would be useful to implement range lookups, table lookups, jump tables, etc.
If the relationship v[i] <= v[i+1] is not valid in v, then the result is either a fault or undefined.

Deterministic timing mode
  Allow to temporarily disable all performance mechanisms that can be used as side-channels, e.g. all forms of speculation, superscalar execution, resource sharing, caching, variable instruction timing, etc.

Stack-based capability controls

Jump target instruction
  A jump target instruction is an instruction that marks valid jump targets. If a jump/call or other PC modification is executed, and the instruction at the new PC is not a jump target instruction, an exception is raised and execution is aborted.
This requires the definition of a variant of each jump/call instruction ("jump/call-and-check-target"), that instructs the processor to ensure that the next instruction executed (at the jump/call target) is a jump target instruction.
The jump target instruction should be a no-op if the last instruction executed was not a jump/call-and-check-target instruction (if this is not desirable, the jump target instruction could be immediately prefixed by an instruction that unconditionally raises an exception).

Text processing instructions
Note: some of these functionalities may be already partially or fully implemented by the RISC-V V extension.

Validating and converting between ASCII, UTF-8, UTF-16, and UTF-32, Unicode codepoints
Case-insensitive memory comparison (see memcmp in memory instructions)
Case conversion


Fast non-cryptographic hashing functions 
  To speed up workloads that require fast hashing of short fixed/variable-sized blobs (e.g. string hashing, pointer hashing, etc.), like hashmaps.

Generalized mixing function
  Generalized mixing function like
#include <type_traits>

enum op { opXor, opOr, opAnd };

template <typename T1, typename T2, op op,
        class = typename std::enable_if_t<std::is_unsigned_v<T1> &&
                                          std::is_unsigned_v<T2>>>
T1 mix(const T2 n, const T1 m[sizeof(T2) * 8],
     const T1 m2[sizeof(T2) * 8] = {}) {
  T1 out = n & 1 ? m[0] : m2[0];
  for (int i = 1; i < sizeof(T2) * 8; i++) {
      T1 f = n & (1 << i) ? m[i] : m2[i];
      switch (op) {
      case opXor: out ^= f; break;
      case opOr:  out |= f; break;
      case opAnd: out &= f; break;
      }
  }
  return out;
}
that can be used both for bit shuffling (shifts, rotations, pdep/pext, reverse, permutations, ...), other operations (clmul, ...) but especially as fast integer mixing step with ideal avalanche properties (see e.g. https://gist.github.com/badboy/6267743, http://burtleburtle.net/bob/hash/integer.html, https://marc-b-reynolds.github.io/math/2017/10/13/IntegerBijections.html, and https://github.com/skeeto/hash-prospector) for integer hashing.

Register dependency bulk clear
  Instructions to clear multiple registers to prevent register dependencies, e.g. regclear [imm32] where each bit of the immediate maps to each architectural register. If bit N in the immediate is 1, register N is cleared.; otherwise it's left untouched. Potentially could also have a regunset [imm32] variant that, instead of clearing, simply marks the registers as having "undefined value".

Number conversion instructions
  Instructions to convert from/to ASCII representation of integers and floats (itoa/atoi/ftoa/atof).

Masked move
  Perform a masked move to/from memory or register, optionally atomically. In pseudo-code:
dst = (src & mask) | (dst & ~mask)


Cross-process call
  Same-ring or cross-ring call that performs an efficient call to a procedure in a different address space.
Similar in functionality/concept to the Mill's portal calls, XPC or other efficient cross-address-space calls.

Branchless comparison instructions
  Branchless versions of common integer comparison sequences:


instruction
pseudocode


min(s|u)(8|16|32|64) D, S1, S2
D = S1 < S2 ? S1 : S2


max(s|u)(8|16|32|64) D, S1, S2
D = S1 > S2 ? S1 : S2


minmax(s|u)(8|16|32|64) SD1, SD2, S3
(SD1, SD2) = (SD1 < S3 ? SD1 : S3, SD2 > S3 ? SD2 : S3)


sort(s|u)(8|16|32|64) SD1, SD2
(SD1, SD2) = SD1 < SD2 ? (SD1, SD2) : (SD2, SD1)


clip(s|u)(8|16|32|64) SD1, S2, S3
SD1 = SD1 < S2 ? S2 : (SD1 > S3 ? S3 : SD1)


boundcheck(s|u)(8|16|32|64) SD1, S2, S3
SD1 = SD1 < S2 ? -1 : (SD1 >= S3 ? 1 : 0)


joob(s|u)(8|16|32|64) SD1, S2, S3, TPC
SD1 = SD1 < S2 ? -1 : (SD1 >= S3 ? 1 : 0); jnz SD1, TPC


For RISC-V, these could be pseudo-instructions with a canonical explicit form that can be reliably handled by macro-op fusion.
Also worth noting that Zicond now allows to code a min/max operation as a sequence of sltu/slt + czero.nez + czero.eqz + or, so most of the above can be coded with at most a couple of these sequences. It is unlikely (but not impossible) that CPUs will be able to perform macro-op merging across 4 instructions to recognize a min/max operation, and it almost guaranteed to be impossible to recognize the 8 instructions required for things like clip/minmax/sort/boundcheck.
The file cmp_unit.v contains a simple implementation of this (except joob). It consumes ~600 basic logic gates for 32 bits words, ~1200 for 64 bits.
Everything described above could be also extended to floating-point values (8, 16, 32, 64, 80, 128).
    
Implementation ideas

List of processor implementation ideas
Dual branch speculation
  When a 2-way branch is confidently predicted to be highly unpredictable (i.e. the prediction is that the branch is taken randomly 50% of times with no predictable pattern), speculatively execute both branches and then discard the one that was not taken (instead of picking one and then having to flush the pipelines and start over in ~half of the cases).
  
Some discussion here, even though there is no focus on limiting dual branch speculation only to branches that are extremely unpredicable. Tyson, Gary S. et al. “Limited Dual Path Execution” (2000) seems to be very similar to this idea.

Division/modulo strength reduction
  When a division/modulo instruction is frequently executed with a very small amount of possible divisors, the processor could compute their multiplicative inverse and store the inverses in a dedicated small cache. When a division is executed, the cache would be visited and, if it yields a hit, the division would be transparently strength-reduced to multiplications and shifts (or just shifts, in case of power-of-2 divisors).
This is normally done by compilers at compile-time, but only for constant divisors. It can be done at runtime using something like libdivide.

Data speculation
  Arithmetic and logical operations have specific input values for which the result is constant or for which the operation is an identity.
Examples include:

Constant result: x & 0 = 0, x | 0xFF..FF = 0xFF..FF, x * 0 = 0, x / x = 1, x ^ x = 0, ...
Identity: x & x = x, x & 0xFF..FF = x, x | 0 = x, x ^ 0 = x, x + 0 = x, x - 0 = x, x / 1 = x, x >> 0 = x, ...

Speculation could be extended to consider also the outputs of general operations (including logic/arithemtic operations and even loads from memory/cache). If an operation is predicted to yield a constant or identity, or if the result of the operation is somehow highly predictable, the operation result could be speculated so that subsequent operations could start executing before the results of the previous operation become available.

Shortcircuit conditional moves
Consider psuedocode like the following:
int x = 0;
if (some_condition)
  x = a + b; // assume that x has a long data dependency

it could be possible to compile it down to:
move x, 0                               // x = 0
add t, a, b                             // t = a + b
conditional_move x, t, some_condition   // if (some_condition) x = t

but this, at least in CPUs I have tested, causes the conditional_move to have a data dependencies on all of some_condition, x, and t, regardless of the value of some_condition.
Instead, as soon as some_condition is resolved, the data dependency on the value not used (i.e. either x or t, depending on the value of some_condition) should be dropped to let the conditional move execute immediately (as soon as the remaining value is resolved), and potentially any non-retired instructions the dropped value depends on as the sole user should be cancelled/flushed (to free up execution units).
In the example above, this could mean that e.g. as soon as some_condition is reolved to 0/false, the CPU should immediately drop the data dependency of the conditional move on t and, assuming x has retired already, retire immediately. At the same time, if the add is executing or queued for execution, it should be aborted and/or flushed from the execution pipeline (assuming there are no other uses of t).

Bulk memory operations lowering
Some fundamental memory operations, e.g. memset, memcpy, and memmove, could be lowered to equivalent bulk instructions to the memory subsystem to minimize data movement between memory and CPU, potentially decreasing power consumption and increasing performance.
By avoiding the data movement between memory and CPU, we reduce usage of shared resources both in the memory subsystem and in the CPU that can be used by other cores/threads instead. This would also help reduce energy consumption, as moving data from memory and back consumes significant energy. Furthermore the memory subsystem could potentially execute some of these operations with a much greater degree of parallelism, thereby reducing wall times required for the operations to complete.

Overlapping load/store merging
Recognize idioms/macros such as:
  // short non-overlapping memcpy of length off+W2
  load<W1> SRC, R1
  load<W2> SRC+off, R2 // off <= W1
  store<W1> R1, DST // SRC+off+W2 <= DST && DST+off+W2 <= SRC
  store<W2> R2, DST+off
  // optionally require explicitly killing R1 and R2

and internally execute a single operation that moves off+W2 bytes from SRC to DST (possibly subject to some additional constraints for the optimization to apply, e.g. the source and destination ranges should not individually cross cache lines)
The alternative would be new instructions like the one proposed in https://www.sigarch.org/fast-memcpy-a-system-design/.

Transparent memory compression
Allow to memory subsystem to transparently compress memory when possible (similar to in-memory texture compression).


## cmp_unit.v
// cmp_unit.v
// Branchless composite comparison logic unit
// (C) 2024 Carlo Alberto Ferraris

module cmp_unit (
    input [W-1:0] _a,
    input [W-1:0] _b,
    input [W-1:0] _c,
    input op_signed,
    input [1:0] op,
    input [1:0] l,
    input clk,

    output reg [W-1:0] x,
    output reg [W-1:0] y
);

    parameter W = 64;
    parameter ENABLE_OPERAND_EXTENSION = 0;

    localparam LT = 1<<(W-1);
    localparam EQ = 0;
    localparam GT = 1;

    function less;
        input [W-1:0] i;
        input [W-1:0] j;
        input op_signed;
        begin
            if (op_signed)
                less = $signed(i) < $signed(j);
            else
                less = $unsigned(i) < $unsigned(j);
        end
    endfunction

    function [W-1:0] ext;
        input [W-1:0] i;
        input op_signed;
        input [1:0] l; // only supported with ENABLE_OPERAND_EXTENSION

        begin
            case (l)
            0: begin
                if (op_signed)
                    ext = $signed(i[7:0]);
                else
                    ext = $unsigned(i[7:0]);
            end
            1: begin
                if (op_signed)
                    ext = $signed(i[15:0]);
                else
                    ext = $unsigned(i[15:0]);
            end
            2: begin
                if (op_signed)
                    ext = $signed(i[31:0]);
                else
                    ext = $unsigned(i[31:0]);
            end
            default: begin
                ext = i;
            end
            endcase
        end
    endfunction

    wire [W-1:0] a;
    wire [W-1:0] b;
    wire [W-1:0] c;

    always @(posedge clk) begin
        a = ENABLE_OPERAND_EXTENSION ? ext(_a, op_signed, l) : _a;
        b = ENABLE_OPERAND_EXTENSION ? ext(_b, op_signed, l) : _b;
        c = ENABLE_OPERAND_EXTENSION ? ext(_c, op_signed, l) : _c;

        case (op)
        0: begin /* cmp */
            x = less(a, b, op_signed) ? LT : ( a == b ? EQ : GT ); // cmp:ab
            y = less(a, c, op_signed) ? LT : ( a == c ? EQ : GT ); // cmp:ac
        end
        1: begin /* min, max, sort */
            // TODO: this is just a special case of (2), we could merge them
            x = less(a, b, op_signed) ? a : b; // min, sort:low
            y = less(a, b, op_signed) ? b : a; // max, sort:high
        end
        2: begin /* minmax */
            x = less(a, b, op_signed) ? a : b; // minmax:min
            y = less(a, c, op_signed) ? c : a; // minmax:max
        end
        3: begin /* clip, boundcheck */
            x = less(a, b, op_signed) ? b : (less(a, c, op_signed) ? a : c); // clip
            y = less(a, b, op_signed) ? LT : (less(a, c, op_signed) ? EQ : GT); // boundcheck
        end
        endcase
    end

endmodule

## lut.v.erb
<%
INW = 64
OUTW = 64
WW = 64
LEN = 64*8

raise "INW not power of 2" unless INW & (INW-1) == 0
Mw = Math.log2(INW)
%>

module lut(
    input [<%= INW-1 %>:0] in,
    input [<%= WW-1 %>:0] wlen,
    input [<%= LEN-1 %>:0] lut,
    output [<%= OUTW-1 %>:0] out,
    output ok
);

    always begin
        (* parallel_case, full_case *) case (wlen)
        <% (0..Mw).each {|w| W = 2**w %>
            <%= w %>: begin
                (* parallel_case, full_case *) case (in[<%= W-1 %>:0])
                <% (0...(LEN/W)).each {|o| %>
                    lut[<%= o*W+W-1 %>:<%= o*W %>]: begin out = <%= o %>; ok = 1; end
                <% } %>
                    default: begin out = 0; ok = 0; end
                endcase
            end
        <% } %>
            default: begin out = 0; ok = 0; end
        endcase
    end
endmodule

## match.v.erb
<%
INW = 64
OUTW = 64
LEN = 64*8
%>
module match(
    input [<%= INW-1 %>:0] in,
    input [<%= INW-1 %>:0] off,
    input [<%= INW-1 %>:0] nlen,
    input [<%= LEN+INW-8-1 %>:0] lut,
    output [<%= OUTW-1 %>:0] out,
    output ok
);
    always begin
        out = 0; ok = 0;
        (* parallel_case, full_case *) case (nlen)
        <% (1..(INW/8)).each {|w| W = w*8 %>
            <%= w %>: begin
                if (0) begin end
                <% (0...(LEN/8)).each {|o| O = o*8 %>
                    else if (in[<%= W-1 %>:0] == lut[<%= O+W-1 %>:<%= O %>] && <%= o %> >= off) begin out = <%= o %>; ok = 1; end
                <% } %>
            end
        <% } %>
            default: begin end
        endcase
    end
endmodule

## utf.v
// utf.v
// Combinatorial Unicode encoder/decoder for UTF-8, UTF-16, and UTF-32
// (C) 2022 Carlo Alberto Ferraris

module utf (
    input [31:0] in,
    input [3:0] op,
    input allow_surrogates, // false -> disallow (unicode-compliant); true -> allow to encode/decode code points in the surrogate ranges
    input allow_overlong, // false -> overlong encodings are invalid (unicode-compliant); true -> overlong encodings are tolerated
    input allow_supplemental_planes, // true -> allow (unicode-compliant); false -> disallow to encode/decode code points in the supplemental planes
    input allow_alien_codepoints, // false -> disallow (unicode-compliant); true -> allow codepoints higher than 0x10FFFF (0x1FFFFF for UTF-8, 0xFFFFFFFF for UTF-32)
    input big_endian, // for UTF-16 and UTF-32, whether to decode/encode big endian encodings
    output [31:0] out,
    output [2:0] delta,
    output ok);

    localparam [31:0] invalid = 32'hFFFD;
    wire [31:0] in1;
    wire [31:0] tmp;

    always begin
        in1 = in - 32'h10000;
        tmp = in;

        case (op)
        0 : (* hdlname = "utf8decode" *) begin : utf8decode
            casez (in)
            32'b0???????zzzzzzzzzzzzzzzzzzzzzzzz : begin
                ok = 1; delta = 1; out = in[30:24];
            end
            32'b10??????zzzzzzzzzzzzzzzzzzzzzzzz : begin ok = 0; delta = 1; out = invalid; end
            32'b110?????10??????zzzzzzzzzzzzzzzz : begin
                ok = 1; delta = 2; out = { in[28:24], in[21:16] };
                if (out <= 32'h7F && !allow_overlong) begin ok = 0; out = invalid; end
            end
            32'b110?????zzzzzzzzzzzzzzzzzzzzzzzz : begin ok = 0; delta = 1; out = invalid; end
            32'b1110????10??????10??????zzzzzzzz : begin
                ok = 1; delta = 3; out = { in[27:24], in[21:16], in[13:8] };
                if (out <= 32'h7FF && !allow_overlong) begin ok = 0; out = invalid; end
                else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
            end
            32'b1110????10??????zzzzzzzzzzzzzzzz : begin ok = 0; delta = 2; out = invalid; end
            32'b1110????zzzzzzzzzzzzzzzzzzzzzzzz : begin ok = 0; delta = 1; out = invalid; end
            32'b11110???10??????10??????10?????? : begin
                ok = 1; delta = 4; out = { in[26:24], in[21:16], in[13:8], in[5:0] };
                if (out <= 32'hFFFF && !allow_overlong) begin ok = 0; out = invalid; end
                else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
                else if (out > 32'hFFFF && out <= 32'h10FFFF && !allow_supplemental_planes) begin ok = 0; out = invalid; end
                else if (out > 32'h10FFFF && !allow_alien_codepoints) begin ok = 0; out = invalid; end
            end
            32'b11110???10??????10??????zzzzzzzz : begin ok = 0; delta = 3; out = invalid; end
            32'b11110???10??????zzzzzzzzzzzzzzzz : begin ok = 0; delta = 2; out = invalid; end
            32'b11110???zzzzzzzzzzzzzzzzzzzzzzzz : begin ok = 0; delta = 1; out = invalid; end
            default                              : begin ok = 0; delta = 1; out = invalid; end
            endcase
        end
        1 : (* hdlname = "utf8decoderev" *) begin : utf8decoderev
            casez (in)
            32'bzzzzzzzzzzzzzzzzzzzzzzzz0??????? : begin
                ok = 1; delta = 1; out = in[6:0];
            end
            32'bzzzzzzzzzzzzzzzz110?????10?????? : begin
                ok = 1; delta = 2; out = { in[12:8], in[5:0] };
                if (out <= 32'h7F && !allow_overlong) begin ok = 0; out = invalid; end
            end
            32'bzzzzzzzzzzzzzzzzzzzzzzzz110????? : begin ok = 0; delta = 1; out = invalid; end
            32'bzzzzzzzz1110????10??????10?????? : begin
                ok = 1; delta = 3; out = { in[19:16], in[13:8], in[5:0] };
                if (out <= 32'h7FF && !allow_overlong) begin ok = 0; out = invalid; end
                else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
            end
            32'bzzzzzzzzzzzzzzzz1110????10?????? : begin ok = 0; delta = 2; out = invalid; end
            32'bzzzzzzzzzzzzzzzzzzzzzzzz1110???? : begin ok = 0; delta = 1; out = invalid; end
            32'b11110???10??????10??????10?????? : begin
                ok = 1; delta = 4; out = { in[26:24], in[21:16], in[13:8], in[5:0] };
                if (out <= 32'hFFFF && !allow_overlong) begin ok = 0; out = invalid; end
                else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
                else if (out > 32'hFFFF && out <= 32'h10FFFF && !allow_supplemental_planes) begin ok = 0; out = invalid; end
                else if (out > 32'h10FFFF && !allow_alien_codepoints) begin ok = 0; out = invalid; end
            end
            32'bzzzzzzzz11110???10??????10?????? : begin ok = 0; delta = 3; out = invalid; end
            32'bzzzzzzzzzzzzzzzz11110???10?????? : begin ok = 0; delta = 2; out = invalid; end
            32'bzzzzzzzzzzzzzzzzzzzzzzzz11110??? : begin ok = 0; delta = 1; out = invalid; end
            32'bzzzzzzzzzzzzzzzzzzzzzzzz10?????? : begin ok = 0; delta = 1; out = invalid; end
            default                              : begin ok = 0; delta = 1; out = invalid; end
            endcase
        end
        2 : (* hdlname = "utf8encode" *) begin : utf8encode
            if      (in <= 32'h7F)     begin ok = 1; delta = 1; out = { 1'b0, in[6:0], 24'b0 }; end
            else if (in <= 32'h7FF)    begin ok = 1; delta = 2; out = { 3'b110, in[10:6], 2'b10, in[5:0], 16'b0 }; end
            else if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
            else if (in <= 32'hFFFF)   begin ok = 1; delta = 3; out = { 4'b1110, in[15:12], 2'b10, in[11:6], 2'b10, in[5:0], 8'b0 }; end
            else if (in > 32'hFFFF && in <= 32'h10FFFF && !allow_supplemental_planes) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
            else if (in > 32'h10FFFF && !allow_alien_codepoints) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
            else if (in <= 32'h1FFFFF) begin ok = 1; delta = 4; out = { 5'b11110, in[20:18], 2'b10, in[17:12], 2'b10, in[11:6], 2'b10, in[5:0] }; end
            else                       begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
        end
        3 : (* hdlname = "utf8encoderev" *) begin : utf8encoderev
            if      (in <= 32'h7F)     begin ok = 1; delta = 1; out = { 24'b0, 1'b0, in[6:0] }; end
            else if (in <= 32'h7FF)    begin ok = 1; delta = 2; out = { 16'b0, 3'b110, in[10:6], 2'b10, in[5:0] }; end
            else if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin ok = 0; delta = 3; out = 32'h00EFBFBD; end
            else if (in <= 32'hFFFF)   begin ok = 1; delta = 3; out = { 8'b0, 4'b1110, in[15:12], 2'b10, in[11:6], 2'b10, in[5:0] }; end
            else if (in > 32'hFFFF && in <= 32'h10FFFF && !allow_supplemental_planes) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
            else if (in > 32'h10FFFF && !allow_alien_codepoints) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
            else if (in <= 32'h1FFFFF) begin ok = 1; delta = 4; out = { 5'b11110, in[20:18], 2'b10, in[17:12], 2'b10, in[11:6], 2'b10, in[5:0] }; end
            else                       begin ok = 0; delta = 3; out = 32'h00EFBFBD; end
        end
        4 : (* hdlname = "utf16decode" *) begin : utf16decode
            if (big_endian) begin
                tmp = { in[23:16], in[31:24], in[7:0], in[15:8] };
            end
            casez (tmp)
            32'b110110??????????110111?????????? : begin
                ok = 1; delta = 4; out = { tmp[25:16], tmp[9:0] } + 32'h10000;
                if (out <= 32'hFFFF && !allow_overlong) begin ok = 0; out = invalid; end
                else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
                else if (out > 32'hFFFF && !allow_supplemental_planes) begin ok = 0; out = invalid; end
            end
            32'b????????????????zzzzzzzzzzzzzzzz : begin
                ok = 1; delta = 2; out = tmp[31:16];
                if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
            end
            endcase
        end
        5 : (* hdlname = "utf16decoderev" *) begin : utf16decoderev
            if (big_endian) begin
                tmp = { in[23:16], in[31:24], in[7:0], in[15:8] };
            end
            casez (tmp)
            32'b110110??????????110111?????????? : begin
                ok = 1; delta = 4; out = { tmp[25:16], tmp[9:0] } + 32'h10000;
                if (out <= 32'hFFFF && !allow_overlong) begin ok = 0; out = invalid; end
                else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
                else if (out > 32'hFFFF && !allow_supplemental_planes) begin ok = 0; out = invalid; end
            end
            32'bzzzzzzzzzzzzzzzz???????????????? : begin
                ok = 1; delta = 2; out = tmp[15:0];
                if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
            end
            endcase
        end
        6 : (* hdlname = "utf16encode" *) begin : utf16encode
            if (in <= 32'hFFFF) begin
                if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin
                    ok = 0; delta = 2; tmp = { invalid[15:0], 16'b0 };
                end else begin
                    ok = 1; delta = 2; tmp = { in[15:0], 16'b0 };
                end
            end else begin
                if (in <= 32'h10FFFF && allow_supplemental_planes) begin
                    ok = 1; delta = 4; tmp = { 6'b110110, in1[19:10], 6'b110111, in1[9:0] };
                end else begin
                    ok = 0; delta = 2; tmp = { invalid[15:0], 16'b0 };
                end
            end
            out = tmp;
            if (big_endian) begin
                out = { tmp[23:16], tmp[31:24], tmp[7:0], tmp[15:8] };
            end
        end
        7 : (* hdlname = "utf16encoderev" *) begin : utf16encoderev
            if (in <= 32'hFFFF) begin
                if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin
                    ok = 0; delta = 2; tmp = { 16'b0, invalid[15:0] };
                end else begin
                    ok = 1; delta = 2; tmp = { 16'b0, in[15:0] };
                end
            end else begin
                if (in <= 32'h10FFFF && allow_supplemental_planes) begin
                    ok = 1; delta = 4; tmp = { 6'b110110, in1[19:10], 6'b110111, in1[9:0] };
                end else begin
                    ok = 0; delta = 2; tmp = { 16'b0, invalid[15:0] };
                end
            end
            out = tmp;
            if (big_endian) begin
                out = { tmp[23:16], tmp[31:24], tmp[7:0], tmp[15:8] };
            end
        end
        8, 9 : (* hdlname = "utf32decode" *) begin : utf32decode
            if (big_endian) begin
                tmp = { in[7:0], in[15:8], in[23:16], in[31:24] };
            end
            if (tmp >= 32'hD800 && tmp <= 32'hDFFF && !allow_surrogates) begin ok = 0; delta = 4; out = invalid; end
            else if (tmp <= 32'hFFFF) begin ok = 1; delta = 4; out = tmp; end
            else if (tmp > 32'hFFFF && tmp <= 32'h10FFFF && allow_supplemental_planes) begin ok = 1; delta = 4; out = tmp; end
            else if (tmp > 32'h1FFFFF && allow_alien_codepoints) begin ok = 1; delta = 4; out = tmp; end
            else begin ok = 0; delta = 4; out = invalid; end
        end
        10, 11 : (* hdlname = "utf32encode" *) begin : utf32encode
            if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin ok = 0; delta = 4; tmp = invalid; end
            else if (in <= 32'hFFFF) begin ok = 1; delta = 4; tmp = in; end
            else if (in > 32'hFFFF && in <= 32'h10FFFF && allow_supplemental_planes) begin ok = 1; delta = 4; tmp = in; end
            else if (in > 32'h1FFFFF && allow_alien_codepoints) begin ok = 1; delta = 4; tmp = in; end
            else begin ok = 0; delta = 4; tmp = invalid; end
            out = tmp;
            if (big_endian) begin
                out = { tmp[7:0], tmp[15:8], tmp[23:16], tmp[31:24] };
            end
        end
        default: begin
            ok = x; delta = x; out = x;
        end
        endcase
    end
endmodule

## z_appendix.c
// Pseudo-code for the functions described above

// TODO: add temporal/nontemporal hints
#include <stdlib.h>
#include <stdint.h>

static size_t min(size_t a, size_t b) {
    return a < b ? a : b;
}

void* _memcpy(void * restrict dst, const void * restrict src, size_t len) {
  // TODO: handle overlap
  unsigned char *cdst = dst, *csrc = src;
  for (size_t p = 0; p < len; p++)
    cdst[p] = csrc[p];
  return dst;
}

void* _memcpyex(void * restrict dst, const void * restrict src, size_t elem_sz, size_t elem_cnt, size_t dst_stride, size_t src_stride) {
    void *rv = dst;
    for (size_t i=0; i<elem_cnt; i++) {
        _memcpy(dst, src, elem_sz);
        dst += dst_stride;
        src += src_stride;
    }
    return rv;
}

size_t _memcmp(void *a, void *b, size_t len) {
  const unsigned char mask = 0xFF; // TODO: extend to support a custom bitmask
  unsigned char *ca = a, *cb = b;
  for (size_t p = 0; p < len; p++) {
    if ((ca[p] & mask) != (cb[p] & mask))
      return p;
  }
  return len;
}

void* _memmem(void *haystack, size_t haystack_len, void *needle, size_t needle_len, size_t stride) {
  // TODO: handle negative stride
  for (size_t p = 0; p < haystack_len-needle_len; p += stride)
    if (_memcmp(haystack+p, needle, needle_len) == 0)
      return haystack+p;
  return NULL;
}

void _memset(void * restrict dst, size_t dst_len, void * restrict src, size_t src_len) {
    for (size_t p = 0; p < dst_len; p += src_len)
        _memcpy(dst+p, src, min(dst_len-p, src_len));
}

void _memswp(void * restrict a, void * restrict b, size_t len) {
  // TODO: handle overlap
  char ta, tb;
  for (size_t p = 0; p < len; p++) {
    _memcpy(&ta, a+p, 1);
    _memcpy(&tb, b+p, 1);
    _memcpy(a+p, &tb, 1);
    _memcpy(b+p, &ta, 1);
  }
}

typedef struct { const uint8_t x[256/8]; } bytemask __attribute__ ((aligned (64)));

void* _memchr(void *a, size_t len, size_t stride, bytemask needle) {
  for (size_t p = 0; p < len; p+=stride) {
    unsigned char b;
    _memcpy(&b, a+p, 1);
    if (needle.x[b/8]&(1<<(b%8)))
      return a+p;
  }
  return NULL;
}

size_t _memsieve(void *src, size_t src_len, void *dst, size_t dst_len, bytemask needle) {
  // TODO: add masking/non-gather support
  size_t d = 0;
  for (size_t s = 0; s < src_len && d < dst_len; s++) {
    unsigned char b;
    _memcpy(&b, src+s, 1);
    if (needle.x[b/8]&(1<<(b%8)))
      _memcpy(dst+(d++), &b, 1);
  }
  return d;
}
instruction	pseudocode
`min(s\|u)(8\|16\|32\|64) D, S1, S2`	`D = S1 < S2 ? S1 : S2`
`max(s\|u)(8\|16\|32\|64) D, S1, S2`	`D = S1 > S2 ? S1 : S2`
`minmax(s\|u)(8\|16\|32\|64) SD1, SD2, S3`	`(SD1, SD2) = (SD1 < S3 ? SD1 : S3, SD2 > S3 ? SD2 : S3)`
`sort(s\|u)(8\|16\|32\|64) SD1, SD2`	`(SD1, SD2) = SD1 < SD2 ? (SD1, SD2) : (SD2, SD1)`
`clip(s\|u)(8\|16\|32\|64) SD1, S2, S3`	`SD1 = SD1 < S2 ? S2 : (SD1 > S3 ? S3 : SD1)`
`boundcheck(s\|u)(8\|16\|32\|64) SD1, S2, S3`	`SD1 = SD1 < S2 ? -1 : (SD1 >= S3 ? 1 : 0)`
`joob(s\|u)(8\|16\|32\|64) SD1, S2, S3, TPC`	`SD1 = SD1 < S2 ? -1 : (SD1 >= S3 ? 1 : 0); jnz SD1, TPC`
	// cmp_unit.v
	// Branchless composite comparison logic unit
	// (C) 2024 Carlo Alberto Ferraris

	module cmp_unit (
	input [W-1:0] _a,
	input [W-1:0] _b,
	input [W-1:0] _c,
	input op_signed,
	input [1:0] op,
	input [1:0] l,
	input clk,

	output reg [W-1:0] x,
	output reg [W-1:0] y
	);

	parameter W = 64;
	parameter ENABLE_OPERAND_EXTENSION = 0;

	localparam LT = 1<<(W-1);
	localparam EQ = 0;
	localparam GT = 1;

	function less;
	input [W-1:0] i;
	input [W-1:0] j;
	input op_signed;
	begin
	if (op_signed)
	less = $signed(i) < $signed(j);
	else
	less = $unsigned(i) < $unsigned(j);
	end
	endfunction

	function [W-1:0] ext;
	input [W-1:0] i;
	input op_signed;
	input [1:0] l; // only supported with ENABLE_OPERAND_EXTENSION

	begin
	case (l)
	0: begin
	if (op_signed)
	ext = $signed(i[7:0]);
	else
	ext = $unsigned(i[7:0]);
	end
	1: begin
	if (op_signed)
	ext = $signed(i[15:0]);
	else
	ext = $unsigned(i[15:0]);
	end
	2: begin
	if (op_signed)
	ext = $signed(i[31:0]);
	else
	ext = $unsigned(i[31:0]);
	end
	default: begin
	ext = i;
	end
	endcase
	end
	endfunction

	wire [W-1:0] a;
	wire [W-1:0] b;
	wire [W-1:0] c;

	always @(posedge clk) begin
	a = ENABLE_OPERAND_EXTENSION ? ext(_a, op_signed, l) : _a;
	b = ENABLE_OPERAND_EXTENSION ? ext(_b, op_signed, l) : _b;
	c = ENABLE_OPERAND_EXTENSION ? ext(_c, op_signed, l) : _c;

	case (op)
	0: begin /* cmp */
	x = less(a, b, op_signed) ? LT : ( a == b ? EQ : GT ); // cmp:ab
	y = less(a, c, op_signed) ? LT : ( a == c ? EQ : GT ); // cmp:ac
	end
	1: begin /* min, max, sort */
	// TODO: this is just a special case of (2), we could merge them
	x = less(a, b, op_signed) ? a : b; // min, sort:low
	y = less(a, b, op_signed) ? b : a; // max, sort:high
	end
	2: begin /* minmax */
	x = less(a, b, op_signed) ? a : b; // minmax:min
	y = less(a, c, op_signed) ? c : a; // minmax:max
	end
	3: begin /* clip, boundcheck */
	x = less(a, b, op_signed) ? b : (less(a, c, op_signed) ? a : c); // clip
	y = less(a, b, op_signed) ? LT : (less(a, c, op_signed) ? EQ : GT); // boundcheck
	end
	endcase
	end

	endmodule
	<%
	INW = 64
	OUTW = 64
	WW = 64
	LEN = 64*8

	raise "INW not power of 2" unless INW & (INW-1) == 0
	Mw = Math.log2(INW)
	%>

	module lut(
	input [<%= INW-1 %>:0] in,
	input [<%= WW-1 %>:0] wlen,
	input [<%= LEN-1 %>:0] lut,
	output [<%= OUTW-1 %>:0] out,
	output ok
	);

	always begin
	(* parallel_case, full_case *) case (wlen)
	<% (0..Mw).each {\|w\| W = 2**w %>
	<%= w %>: begin
	(* parallel_case, full_case *) case (in[<%= W-1 %>:0])
	<% (0...(LEN/W)).each {\|o\| %>
	lut[<%= oW+W-1 %>:<%= oW %>]: begin out = <%= o %>; ok = 1; end
	<% } %>
	default: begin out = 0; ok = 0; end
	endcase
	end
	<% } %>
	default: begin out = 0; ok = 0; end
	endcase
	end
	endmodule
	// utf.v
	// Combinatorial Unicode encoder/decoder for UTF-8, UTF-16, and UTF-32
	// (C) 2022 Carlo Alberto Ferraris

	module utf (
	input [31:0] in,
	input [3:0] op,
	input allow_surrogates, // false -> disallow (unicode-compliant); true -> allow to encode/decode code points in the surrogate ranges
	input allow_overlong, // false -> overlong encodings are invalid (unicode-compliant); true -> overlong encodings are tolerated
	input allow_supplemental_planes, // true -> allow (unicode-compliant); false -> disallow to encode/decode code points in the supplemental planes
	input allow_alien_codepoints, // false -> disallow (unicode-compliant); true -> allow codepoints higher than 0x10FFFF (0x1FFFFF for UTF-8, 0xFFFFFFFF for UTF-32)
	input big_endian, // for UTF-16 and UTF-32, whether to decode/encode big endian encodings
	output [31:0] out,
	output [2:0] delta,
	output ok);

	localparam [31:0] invalid = 32'hFFFD;
	wire [31:0] in1;
	wire [31:0] tmp;

	always begin
	in1 = in - 32'h10000;
	tmp = in;

	case (op)
	0 : (* hdlname = "utf8decode" *) begin : utf8decode
	casez (in)
	32'b0???????zzzzzzzzzzzzzzzzzzzzzzzz : begin
	ok = 1; delta = 1; out = in[30:24];
	end
	32'b10??????zzzzzzzzzzzzzzzzzzzzzzzz : begin ok = 0; delta = 1; out = invalid; end
	32'b110?????10??????zzzzzzzzzzzzzzzz : begin
	ok = 1; delta = 2; out = { in[28:24], in[21:16] };
	if (out <= 32'h7F && !allow_overlong) begin ok = 0; out = invalid; end
	end
	32'b110?????zzzzzzzzzzzzzzzzzzzzzzzz : begin ok = 0; delta = 1; out = invalid; end
	32'b1110????10??????10??????zzzzzzzz : begin
	ok = 1; delta = 3; out = { in[27:24], in[21:16], in[13:8] };
	if (out <= 32'h7FF && !allow_overlong) begin ok = 0; out = invalid; end
	else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
	end
	32'b1110????10??????zzzzzzzzzzzzzzzz : begin ok = 0; delta = 2; out = invalid; end
	32'b1110????zzzzzzzzzzzzzzzzzzzzzzzz : begin ok = 0; delta = 1; out = invalid; end
	32'b11110???10??????10??????10?????? : begin
	ok = 1; delta = 4; out = { in[26:24], in[21:16], in[13:8], in[5:0] };
	if (out <= 32'hFFFF && !allow_overlong) begin ok = 0; out = invalid; end
	else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
	else if (out > 32'hFFFF && out <= 32'h10FFFF && !allow_supplemental_planes) begin ok = 0; out = invalid; end
	else if (out > 32'h10FFFF && !allow_alien_codepoints) begin ok = 0; out = invalid; end
	end
	32'b11110???10??????10??????zzzzzzzz : begin ok = 0; delta = 3; out = invalid; end
	32'b11110???10??????zzzzzzzzzzzzzzzz : begin ok = 0; delta = 2; out = invalid; end
	32'b11110???zzzzzzzzzzzzzzzzzzzzzzzz : begin ok = 0; delta = 1; out = invalid; end
	default : begin ok = 0; delta = 1; out = invalid; end
	endcase
	end
	1 : (* hdlname = "utf8decoderev" *) begin : utf8decoderev
	casez (in)
	32'bzzzzzzzzzzzzzzzzzzzzzzzz0??????? : begin
	ok = 1; delta = 1; out = in[6:0];
	end
	32'bzzzzzzzzzzzzzzzz110?????10?????? : begin
	ok = 1; delta = 2; out = { in[12:8], in[5:0] };
	if (out <= 32'h7F && !allow_overlong) begin ok = 0; out = invalid; end
	end
	32'bzzzzzzzzzzzzzzzzzzzzzzzz110????? : begin ok = 0; delta = 1; out = invalid; end
	32'bzzzzzzzz1110????10??????10?????? : begin
	ok = 1; delta = 3; out = { in[19:16], in[13:8], in[5:0] };
	if (out <= 32'h7FF && !allow_overlong) begin ok = 0; out = invalid; end
	else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
	end
	32'bzzzzzzzzzzzzzzzz1110????10?????? : begin ok = 0; delta = 2; out = invalid; end
	32'bzzzzzzzzzzzzzzzzzzzzzzzz1110???? : begin ok = 0; delta = 1; out = invalid; end
	32'b11110???10??????10??????10?????? : begin
	ok = 1; delta = 4; out = { in[26:24], in[21:16], in[13:8], in[5:0] };
	if (out <= 32'hFFFF && !allow_overlong) begin ok = 0; out = invalid; end
	else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
	else if (out > 32'hFFFF && out <= 32'h10FFFF && !allow_supplemental_planes) begin ok = 0; out = invalid; end
	else if (out > 32'h10FFFF && !allow_alien_codepoints) begin ok = 0; out = invalid; end
	end
	32'bzzzzzzzz11110???10??????10?????? : begin ok = 0; delta = 3; out = invalid; end
	32'bzzzzzzzzzzzzzzzz11110???10?????? : begin ok = 0; delta = 2; out = invalid; end
	32'bzzzzzzzzzzzzzzzzzzzzzzzz11110??? : begin ok = 0; delta = 1; out = invalid; end
	32'bzzzzzzzzzzzzzzzzzzzzzzzz10?????? : begin ok = 0; delta = 1; out = invalid; end
	default : begin ok = 0; delta = 1; out = invalid; end
	endcase
	end
	2 : (* hdlname = "utf8encode" *) begin : utf8encode
	if (in <= 32'h7F) begin ok = 1; delta = 1; out = { 1'b0, in[6:0], 24'b0 }; end
	else if (in <= 32'h7FF) begin ok = 1; delta = 2; out = { 3'b110, in[10:6], 2'b10, in[5:0], 16'b0 }; end
	else if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
	else if (in <= 32'hFFFF) begin ok = 1; delta = 3; out = { 4'b1110, in[15:12], 2'b10, in[11:6], 2'b10, in[5:0], 8'b0 }; end
	else if (in > 32'hFFFF && in <= 32'h10FFFF && !allow_supplemental_planes) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
	else if (in > 32'h10FFFF && !allow_alien_codepoints) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
	else if (in <= 32'h1FFFFF) begin ok = 1; delta = 4; out = { 5'b11110, in[20:18], 2'b10, in[17:12], 2'b10, in[11:6], 2'b10, in[5:0] }; end
	else begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
	end
	3 : (* hdlname = "utf8encoderev" *) begin : utf8encoderev
	if (in <= 32'h7F) begin ok = 1; delta = 1; out = { 24'b0, 1'b0, in[6:0] }; end
	else if (in <= 32'h7FF) begin ok = 1; delta = 2; out = { 16'b0, 3'b110, in[10:6], 2'b10, in[5:0] }; end
	else if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin ok = 0; delta = 3; out = 32'h00EFBFBD; end
	else if (in <= 32'hFFFF) begin ok = 1; delta = 3; out = { 8'b0, 4'b1110, in[15:12], 2'b10, in[11:6], 2'b10, in[5:0] }; end
	else if (in > 32'hFFFF && in <= 32'h10FFFF && !allow_supplemental_planes) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
	else if (in > 32'h10FFFF && !allow_alien_codepoints) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
	else if (in <= 32'h1FFFFF) begin ok = 1; delta = 4; out = { 5'b11110, in[20:18], 2'b10, in[17:12], 2'b10, in[11:6], 2'b10, in[5:0] }; end
	else begin ok = 0; delta = 3; out = 32'h00EFBFBD; end
	end
	4 : (* hdlname = "utf16decode" *) begin : utf16decode
	if (big_endian) begin
	tmp = { in[23:16], in[31:24], in[7:0], in[15:8] };
	end
	casez (tmp)
	32'b110110??????????110111?????????? : begin
	ok = 1; delta = 4; out = { tmp[25:16], tmp[9:0] } + 32'h10000;
	if (out <= 32'hFFFF && !allow_overlong) begin ok = 0; out = invalid; end
	else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
	else if (out > 32'hFFFF && !allow_supplemental_planes) begin ok = 0; out = invalid; end
	end
	32'b????????????????zzzzzzzzzzzzzzzz : begin
	ok = 1; delta = 2; out = tmp[31:16];
	if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
	end
	endcase
	end
	5 : (* hdlname = "utf16decoderev" *) begin : utf16decoderev
	if (big_endian) begin
	tmp = { in[23:16], in[31:24], in[7:0], in[15:8] };
	end
	casez (tmp)
	32'b110110??????????110111?????????? : begin
	ok = 1; delta = 4; out = { tmp[25:16], tmp[9:0] } + 32'h10000;
	if (out <= 32'hFFFF && !allow_overlong) begin ok = 0; out = invalid; end
	else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
	else if (out > 32'hFFFF && !allow_supplemental_planes) begin ok = 0; out = invalid; end
	end
	32'bzzzzzzzzzzzzzzzz???????????????? : begin
	ok = 1; delta = 2; out = tmp[15:0];
	if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
	end
	endcase
	end
	6 : (* hdlname = "utf16encode" *) begin : utf16encode
	if (in <= 32'hFFFF) begin
	if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin
	ok = 0; delta = 2; tmp = { invalid[15:0], 16'b0 };
	end else begin
	ok = 1; delta = 2; tmp = { in[15:0], 16'b0 };
	end
	end else begin
	if (in <= 32'h10FFFF && allow_supplemental_planes) begin
	ok = 1; delta = 4; tmp = { 6'b110110, in1[19:10], 6'b110111, in1[9:0] };
	end else begin
	ok = 0; delta = 2; tmp = { invalid[15:0], 16'b0 };
	end
	end
	out = tmp;
	if (big_endian) begin
	out = { tmp[23:16], tmp[31:24], tmp[7:0], tmp[15:8] };
	end
	end
	7 : (* hdlname = "utf16encoderev" *) begin : utf16encoderev
	if (in <= 32'hFFFF) begin
	if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin
	ok = 0; delta = 2; tmp = { 16'b0, invalid[15:0] };
	end else begin
	ok = 1; delta = 2; tmp = { 16'b0, in[15:0] };
	end
	end else begin
	if (in <= 32'h10FFFF && allow_supplemental_planes) begin
	ok = 1; delta = 4; tmp = { 6'b110110, in1[19:10], 6'b110111, in1[9:0] };
	end else begin
	ok = 0; delta = 2; tmp = { 16'b0, invalid[15:0] };
	end
	end
	out = tmp;
	if (big_endian) begin
	out = { tmp[23:16], tmp[31:24], tmp[7:0], tmp[15:8] };
	end
	end
	8, 9 : (* hdlname = "utf32decode" *) begin : utf32decode
	if (big_endian) begin
	tmp = { in[7:0], in[15:8], in[23:16], in[31:24] };
	end
	if (tmp >= 32'hD800 && tmp <= 32'hDFFF && !allow_surrogates) begin ok = 0; delta = 4; out = invalid; end
	else if (tmp <= 32'hFFFF) begin ok = 1; delta = 4; out = tmp; end
	else if (tmp > 32'hFFFF && tmp <= 32'h10FFFF && allow_supplemental_planes) begin ok = 1; delta = 4; out = tmp; end
	else if (tmp > 32'h1FFFFF && allow_alien_codepoints) begin ok = 1; delta = 4; out = tmp; end
	else begin ok = 0; delta = 4; out = invalid; end
	end
	10, 11 : (* hdlname = "utf32encode" *) begin : utf32encode
	if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin ok = 0; delta = 4; tmp = invalid; end
	else if (in <= 32'hFFFF) begin ok = 1; delta = 4; tmp = in; end
	else if (in > 32'hFFFF && in <= 32'h10FFFF && allow_supplemental_planes) begin ok = 1; delta = 4; tmp = in; end
	else if (in > 32'h1FFFFF && allow_alien_codepoints) begin ok = 1; delta = 4; tmp = in; end
	else begin ok = 0; delta = 4; tmp = invalid; end
	out = tmp;
	if (big_endian) begin
	out = { tmp[7:0], tmp[15:8], tmp[23:16], tmp[31:24] };
	end
	end
	default: begin
	ok = x; delta = x; out = x;
	end
	endcase
	end
	endmodule
	// Pseudo-code for the functions described above

	// TODO: add temporal/nontemporal hints
	#include <stdlib.h>
	#include <stdint.h>

	static size_t min(size_t a, size_t b) {
	return a < b ? a : b;
	}

	void* _memcpy(void * restrict dst, const void * restrict src, size_t len) {
	// TODO: handle overlap
	unsigned char cdst = dst, csrc = src;
	for (size_t p = 0; p < len; p++)
	cdst[p] = csrc[p];
	return dst;
	}

	void* _memcpyex(void * restrict dst, const void * restrict src, size_t elem_sz, size_t elem_cnt, size_t dst_stride, size_t src_stride) {
	void *rv = dst;
	for (size_t i=0; i<elem_cnt; i++) {
	_memcpy(dst, src, elem_sz);
	dst += dst_stride;
	src += src_stride;
	}
	return rv;
	}

	size_t _memcmp(void a, void b, size_t len) {
	const unsigned char mask = 0xFF; // TODO: extend to support a custom bitmask
	unsigned char ca = a, cb = b;
	for (size_t p = 0; p < len; p++) {
	if ((ca[p] & mask) != (cb[p] & mask))
	return p;
	}
	return len;
	}

	void* _memmem(void haystack, size_t haystack_len, void needle, size_t needle_len, size_t stride) {
	// TODO: handle negative stride
	for (size_t p = 0; p < haystack_len-needle_len; p += stride)
	if (_memcmp(haystack+p, needle, needle_len) == 0)
	return haystack+p;
	return NULL;
	}

	void _memset(void * restrict dst, size_t dst_len, void * restrict src, size_t src_len) {
	for (size_t p = 0; p < dst_len; p += src_len)
	_memcpy(dst+p, src, min(dst_len-p, src_len));
	}

	void _memswp(void * restrict a, void * restrict b, size_t len) {
	// TODO: handle overlap
	char ta, tb;
	for (size_t p = 0; p < len; p++) {
	_memcpy(&ta, a+p, 1);
	_memcpy(&tb, b+p, 1);
	_memcpy(a+p, &tb, 1);
	_memcpy(b+p, &ta, 1);
	}
	}

	typedef struct { const uint8_t x[256/8]; } bytemask __attribute__ ((aligned (64)));

	void* _memchr(void *a, size_t len, size_t stride, bytemask needle) {
	for (size_t p = 0; p < len; p+=stride) {
	unsigned char b;
	_memcpy(&b, a+p, 1);
	if (needle.x[b/8]&(1<<(b%8)))
	return a+p;
	}
	return NULL;
	}

	size_t _memsieve(void src, size_t src_len, void dst, size_t dst_len, bytemask needle) {
	// TODO: add masking/non-gather support
	size_t d = 0;
	for (size_t s = 0; s < src_len && d < dst_len; s++) {
	unsigned char b;
	_memcpy(&b, src+s, 1);
	if (needle.x[b/8]&(1<<(b%8)))
	_memcpy(dst+(d++), &b, 1);
	}
	return d;
	}