Skip to content

Instantly share code, notes, and snippets.

@CAFxX
Last active November 18, 2024 05:20
Show Gist options
  • Save CAFxX/d07549c3c74c95983421cd4e2e45f28f to your computer and use it in GitHub Desktop.
Save CAFxX/d07549c3c74c95983421cd4e2e45f28f to your computer and use it in GitHub Desktop.
CPU ISA and implementation wishlists

This document lists ISA or processor implementation wishlists/ideas.

They are mostly random ideas; they are not reviewed, guaranteed to work, or even just make sense, but at some point I had good use cases for each one of them.

Third party wishlists

ISA wishlists found elsewhere.

Personal wishlist

ISA ideas

This section lists a collection of potential ideas for ISA extensions.

Native byte code support

Support direct execution/interpretation of custom bytecode (in a sense, optimized support for threaded code).

Native global read/write barriers

Support hooking up load/store instructions with custom behaviors.

This could be used to speed up GC, to enable sanitizers in production, or more in general to efficiently support safe language semantics.

Instruction virtualization

Support hooking up/override instructions with custom behaviors. This is effectively a superset/generalization of the previous point.

In addition to the previous usecases, this could be used for program instrumentation, virtualization, sandboxing, and so on.

Green threads support (quick stack switching)
Alternate code sequences Provide multiple equivalent targets for e.g. a CALL/JUMP, and let the processor dynamically pick the most efficient one based on sampling of past executions An example of when this may be useful is in the case of
if (a || b) {
  // ...
}

if b is side-effect free, this can be evaluated either as (respecting the short-circuit nature of ||)

if (a) {
  if (b) {
    // ...
  }
}

or as

if (b) { // swapped order of conditions
  if (a) {
    // ...
  }
}

or as

c = bool(a) | bool(b)
if (c) { // single conditional jump
  // ...
}

which one is going to be faster depends on the how predictable a and b are, how correlated they are (e.g. consider the case in which each is individually unpredictable, but bool(a)|bool(b) is highly predictable because a = !bool(b)), and how expensive it is to evaluate each of them.

With an alternate code sequence instructions (alt_inst_start/alt_inst_end) code could instead be emitted as

switch (alt_inst_start(2)) {
case 0:
  if (a) {
    if (b) {
      end1:
      alt_inst_end();
      // ..
    }
  }
  end2:
  alt_inst_end();
  break;
case 1:
  if (b) {
    if (a) {
      goto end1
    }
  }
  goto end2;  
case 2:
  if (bool(a) | bool(b)) {
    goto end1;
  }
  goto end2;
}

and this would enable the processor to dynamically sample how long each sequence takes on average, and then start to pick more frequently the one that, given the dynamic conditions found during execution, is faster on average.

Probabilistic jump Conditional branch that jumps to its destination with a certain probability (e.g. 1/n) or with a certain rate (e.g. once every n microseconds)
if one_in(n) { // probability = 1/n
  // e.g. sampling/cleanup code
}

if once_every(t) { // taken ~ every t ns (if executed)
  // e.g. sampling/cleanup code
}

Supporting this natively in the ISA would allow CPUs to integrate it with the built-in RNG/timer and branch predictor to ensure that on average the jump is taken the appropriate number of times, while causing zero branch mispredictions and without the need for expensive interrupts.

Atomic operations on whole cachelines

LL/SC instructions to support atomic load/store/cas of the contents of a whole cacheline (or a subset of it)

Memory operations Native implementation basic memory operations that can be potentially be pushed down to the memory controller or to the memory itself.

Note that some of these functions may already be partially or fully implemented by the V extension.

  • memcpy/memmove, extended to support element size and stride parameters for both source and destination (memcpy(dst, src, elem_sz, elem_cnt, dst_stride, src_stride) with the current memcpy behaviour being equivalent to memcpy(dst, src, sz, 1, sz, sz))
  • memset, extended to support multi-byte patterns (memset(dst, dst_len, src, src_len), with the current memset behavior being equivalent to memset(dst, dst_len, &value, 1)
  • memcmp, extended to support a mask parameter
  • memmem, extended to support a stride parameter (memmem(haystack, haystack_len, needle, needle_len, stride), with the current memmem behavior being equivalent to memmem(haystack, haystack_len, needle, needle_len, 1))
  • memswp(aptr, bptr, len) swap the contents of two memory areas
  • memchr, extended to support a needle parameter (memchr(haystack, haystack_len, stride, needle)) that allows to specify multiple bytes to search for, with the current memchr behavior being equivalent to bytemask needle = {0}; needle[c/8] = 1<<(c%8); memchr(haystack, haystack_len, 1, needle)
  • memsieve (TBD: filter out bytes)

The stride parameters indicates how far apart candidate matches should be. E.g., if stride is 4 then candidate matches are at offset 0, 4, 8, 12, ... from the starting address. A common case is having the stride equal to length of the needle (for memmem) and of the source (for memset), but the stride can be both shorter or longer than those (TODO: extend all functions to support backwards searches via negative strides).

Ideally we should support also the terminator variants to enable efficient operations on 0-terminated strings. Ideally we should also support arbitrary terminator lengths:

  • memccpy, extended to support terminator multi-byte patterns (memccpy(dst, src, len, stop, stop_len), with the current memccpy behavior being equivalent to memccpy(dst, src, len, &value, 1))
  • memcset, memccmp, memcmem, ...

A related idea, enabled by having some of these operations supported natively by the ISA, is described below in Bulk memory operations lowering.

Some of these instructions would benefit from strided memory access acceleration (see implementation ideas section).

Strided memory access mode Allow any memory access instruction - as well as the D-caches - to use strided access mode, to be able to transparently and efficiently perform SoA/AoS transforms.

This would benefit from strided memory access acceleration (see implementation ideas section).

Associative table lookup (Content-Addressable Memory) Implement a fully-associative table lookup instruction that can be used to quickly map sparse keys to values.

Such an instruction could expose/reuse the TLB machinery to allow code to efficiently implement hashmaps, sparse LUTs, jump tables, global remapping or indirection, etc.

Note that the proposed memmem function, with the extension described in the previous point, could functionally perform the same task.

Range search Related to associative lookups, this instruction would - given a scalar and a SIMD vector - search for the range in the vector where the value is found.

Assuming a N elements vector (v[0], v[1], ..., v[N-1]) where v[i] <= v[i+1] for each i, the instruction would perform the following:

int search<typename T>(T s, T *v, int N) { // N > 0
  if (s < v[0])
    return 0;
  for (int i=1; i<N; i++)
    if (v[i-1] <= s && s < v[i])
      return i;
  return N;
}

This would be useful to implement range lookups, table lookups, jump tables, etc.

If the relationship v[i] <= v[i+1] is not valid in v, then the result is either a fault or undefined.

Deterministic timing mode Allow to temporarily disable all performance mechanisms that can be used as side-channels, e.g. all forms of speculation, superscalar execution, resource sharing, caching, variable instruction timing, etc.
Stack-based capability controls
Jump target instruction A jump target instruction is an instruction that marks valid jump targets. If a jump/call or other PC modification is executed, and the instruction at the new PC is not a jump target instruction, an exception is raised and execution is aborted.

This requires the definition of a variant of each jump/call instruction ("jump/call-and-check-target"), that instructs the processor to ensure that the next instruction executed (at the jump/call target) is a jump target instruction.

The jump target instruction should be a no-op if the last instruction executed was not a jump/call-and-check-target instruction (if this is not desirable, the jump target instruction could be immediately prefixed by an instruction that unconditionally raises an exception).

Text processing instructions

Note: some of these functionalities may be already partially or fully implemented by the RISC-V V extension.

Fast non-cryptographic hashing functions To speed up workloads that require fast hashing of short fixed/variable-sized blobs (e.g. string hashing, pointer hashing, etc.), like hashmaps.
Generalized mixing function Generalized mixing function like
#include <type_traits>

enum op { opXor, opOr, opAnd };

template <typename T1, typename T2, op op,
        class = typename std::enable_if_t<std::is_unsigned_v<T1> &&
                                          std::is_unsigned_v<T2>>>
T1 mix(const T2 n, const T1 m[sizeof(T2) * 8],
     const T1 m2[sizeof(T2) * 8] = {}) {
  T1 out = n & 1 ? m[0] : m2[0];
  for (int i = 1; i < sizeof(T2) * 8; i++) {
      T1 f = n & (1 << i) ? m[i] : m2[i];
      switch (op) {
      case opXor: out ^= f; break;
      case opOr:  out |= f; break;
      case opAnd: out &= f; break;
      }
  }
  return out;
}

that can be used both for bit shuffling (shifts, rotations, pdep/pext, reverse, permutations, ...), other operations (clmul, ...) but especially as fast integer mixing step with ideal avalanche properties (see e.g. https://gist.github.com/badboy/6267743, http://burtleburtle.net/bob/hash/integer.html, https://marc-b-reynolds.github.io/math/2017/10/13/IntegerBijections.html, and https://github.com/skeeto/hash-prospector) for integer hashing.

Register dependency bulk clear Instructions to clear multiple registers to prevent register dependencies, e.g. regclear [imm32] where each bit of the immediate maps to each architectural register. If bit N in the immediate is 1, register N is cleared.; otherwise it's left untouched. Potentially could also have a regunset [imm32] variant that, instead of clearing, simply marks the registers as having "undefined value".
Number conversion instructions Instructions to convert from/to ASCII representation of integers and floats (itoa/atoi/ftoa/atof).
Masked move Perform a masked move to/from memory or register, optionally atomically. In pseudo-code:
dst = (src & mask) | (dst & ~mask)
Cross-process call Same-ring or cross-ring call that performs an efficient call to a procedure in a different address space.

Similar in functionality/concept to the Mill's portal calls, XPC or other efficient cross-address-space calls.

Branchless comparison instructions Branchless versions of common integer comparison sequences:
instruction pseudocode
min(s|u)(8|16|32|64) D, S1, S2 D = S1 < S2 ? S1 : S2
max(s|u)(8|16|32|64) D, S1, S2 D = S1 > S2 ? S1 : S2
minmax(s|u)(8|16|32|64) SD1, SD2, S3 (SD1, SD2) = (SD1 < S3 ? SD1 : S3, SD2 > S3 ? SD2 : S3)
sort(s|u)(8|16|32|64) SD1, SD2 (SD1, SD2) = SD1 < SD2 ? (SD1, SD2) : (SD2, SD1)
clip(s|u)(8|16|32|64) SD1, S2, S3 SD1 = SD1 < S2 ? S2 : (SD1 > S3 ? S3 : SD1)
boundcheck(s|u)(8|16|32|64) SD1, S2, S3 SD1 = SD1 < S2 ? -1 : (SD1 >= S3 ? 1 : 0)
joob(s|u)(8|16|32|64) SD1, S2, S3, TPC SD1 = SD1 < S2 ? -1 : (SD1 >= S3 ? 1 : 0); jnz SD1, TPC

For RISC-V, these could be pseudo-instructions with a canonical explicit form that can be reliably handled by macro-op fusion.

Also worth noting that Zicond now allows to code a min/max operation as a sequence of sltu/slt + czero.nez + czero.eqz + or, so most of the above can be coded with at most a couple of these sequences. It is unlikely (but not impossible) that CPUs will be able to perform macro-op merging across 4 instructions to recognize a min/max operation, and it almost guaranteed to be impossible to recognize the 8 instructions required for things like clip/minmax/sort/boundcheck.

The file cmp_unit.v contains a simple implementation of this (except joob). It consumes ~600 basic logic gates for 32 bits words, ~1200 for 64 bits.

Everything described above could be also extended to floating-point values (8, 16, 32, 64, 80, 128).

Implementation ideas

List of processor implementation ideas

Dual branch speculation

When a 2-way branch is confidently predicted to be highly unpredictable (i.e. the prediction is that the branch is taken randomly 50% of times with no predictable pattern), speculatively execute both branches and then discard the one that was not taken (instead of picking one and then having to flush the pipelines and start over in ~half of the cases).

Some discussion here, even though there is no focus on limiting dual branch speculation only to branches that are extremely unpredicable. Tyson, Gary S. et al. “Limited Dual Path Execution” (2000) seems to be very similar to this idea.

Division/modulo strength reduction When a division/modulo instruction is frequently executed with a very small amount of possible divisors, the processor could compute their multiplicative inverse and store the inverses in a dedicated small cache. When a division is executed, the cache would be visited and, if it yields a hit, the division would be transparently strength-reduced to multiplications and shifts (or just shifts, in case of power-of-2 divisors).

This is normally done by compilers at compile-time, but only for constant divisors. It can be done at runtime using something like libdivide.

Data speculation Arithmetic and logical operations have specific input values for which the result is constant or for which the operation is an identity.

Examples include:

  • Constant result: x & 0 = 0, x | 0xFF..FF = 0xFF..FF, x * 0 = 0, x / x = 1, x ^ x = 0, ...
  • Identity: x & x = x, x & 0xFF..FF = x, x | 0 = x, x ^ 0 = x, x + 0 = x, x - 0 = x, x / 1 = x, x >> 0 = x, ...

Speculation could be extended to consider also the outputs of general operations (including logic/arithemtic operations and even loads from memory/cache). If an operation is predicted to yield a constant or identity, or if the result of the operation is somehow highly predictable, the operation result could be speculated so that subsequent operations could start executing before the results of the previous operation become available.

Shortcircuit conditional moves Consider psuedocode like the following:
int x = 0;
if (some_condition)
  x = a + b; // assume that x has a long data dependency

it could be possible to compile it down to:

move x, 0                               // x = 0
add t, a, b                             // t = a + b
conditional_move x, t, some_condition   // if (some_condition) x = t

but this, at least in CPUs I have tested, causes the conditional_move to have a data dependencies on all of some_condition, x, and t, regardless of the value of some_condition.

Instead, as soon as some_condition is resolved, the data dependency on the value not used (i.e. either x or t, depending on the value of some_condition) should be dropped to let the conditional move execute immediately (as soon as the remaining value is resolved), and potentially any non-retired instructions the dropped value depends on as the sole user should be cancelled/flushed (to free up execution units).

In the example above, this could mean that e.g. as soon as some_condition is reolved to 0/false, the CPU should immediately drop the data dependency of the conditional move on t and, assuming x has retired already, retire immediately. At the same time, if the add is executing or queued for execution, it should be aborted and/or flushed from the execution pipeline (assuming there are no other uses of t).

Bulk memory operations lowering

Some fundamental memory operations, e.g. memset, memcpy, and memmove, could be lowered to equivalent bulk instructions to the memory subsystem to minimize data movement between memory and CPU, potentially decreasing power consumption and increasing performance.

By avoiding the data movement between memory and CPU, we reduce usage of shared resources both in the memory subsystem and in the CPU that can be used by other cores/threads instead. This would also help reduce energy consumption, as moving data from memory and back consumes significant energy. Furthermore the memory subsystem could potentially execute some of these operations with a much greater degree of parallelism, thereby reducing wall times required for the operations to complete.

Overlapping load/store merging

Recognize idioms/macros such as:

  // short non-overlapping memcpy of length off+W2
  load<W1> SRC, R1
  load<W2> SRC+off, R2 // off <= W1
  store<W1> R1, DST // SRC+off+W2 <= DST && DST+off+W2 <= SRC
  store<W2> R2, DST+off
  // optionally require explicitly killing R1 and R2

and internally execute a single operation that moves off+W2 bytes from SRC to DST (possibly subject to some additional constraints for the optimization to apply, e.g. the source and destination ranges should not individually cross cache lines)

The alternative would be new instructions like the one proposed in https://www.sigarch.org/fast-memcpy-a-system-design/.

Transparent memory compression

Allow to memory subsystem to transparently compress memory when possible (similar to in-memory texture compression).

Strided access memory acceleration Related to the strided memory access instructions: see https://dl.acm.org/doi/fullHtml/10.1145/3466752.3480091.
Reassociate operations with data dependencies Reassociable operations that are part of a data dependency should be reassociated transparently to minimize latency due to the data dependency.

For example:

  OR r0, r1
  OR r0, r2
  OR r0, r3

should be transparently reassociated, to minimize stalls waiting for the data dependency on r0, to the equivalent microcode of:

  OR r0, r1
  MOV rX, r2 # rX is a unused non-architectural register
  OR rX, r3
  OR r0, rX

This way the dependency is only across 2 instructions, not 3 as in the first case.

This would be applicable to other operations beyond OR, e.g. AND/XOR, and also to some arithmetic operations like ADD/SUB, always assuming the intermediate r0 results are not otherwise used.

For an example of this, see the OR reduction in the hot loop of https://godbolt.org/z/o7frnM14d.

// cmp_unit.v
// Branchless composite comparison logic unit
// (C) 2024 Carlo Alberto Ferraris
module cmp_unit (
input [W-1:0] _a,
input [W-1:0] _b,
input [W-1:0] _c,
input op_signed,
input [1:0] op,
input [1:0] l,
input clk,
output reg [W-1:0] x,
output reg [W-1:0] y
);
parameter W = 64;
parameter ENABLE_OPERAND_EXTENSION = 0;
localparam LT = 1<<(W-1);
localparam EQ = 0;
localparam GT = 1;
function less;
input [W-1:0] i;
input [W-1:0] j;
input op_signed;
begin
if (op_signed)
less = $signed(i) < $signed(j);
else
less = $unsigned(i) < $unsigned(j);
end
endfunction
function [W-1:0] ext;
input [W-1:0] i;
input op_signed;
input [1:0] l; // only supported with ENABLE_OPERAND_EXTENSION
begin
case (l)
0: begin
if (op_signed)
ext = $signed(i[7:0]);
else
ext = $unsigned(i[7:0]);
end
1: begin
if (op_signed)
ext = $signed(i[15:0]);
else
ext = $unsigned(i[15:0]);
end
2: begin
if (op_signed)
ext = $signed(i[31:0]);
else
ext = $unsigned(i[31:0]);
end
default: begin
ext = i;
end
endcase
end
endfunction
wire [W-1:0] a;
wire [W-1:0] b;
wire [W-1:0] c;
always @(posedge clk) begin
a = ENABLE_OPERAND_EXTENSION ? ext(_a, op_signed, l) : _a;
b = ENABLE_OPERAND_EXTENSION ? ext(_b, op_signed, l) : _b;
c = ENABLE_OPERAND_EXTENSION ? ext(_c, op_signed, l) : _c;
case (op)
0: begin /* cmp */
x = less(a, b, op_signed) ? LT : ( a == b ? EQ : GT ); // cmp:ab
y = less(a, c, op_signed) ? LT : ( a == c ? EQ : GT ); // cmp:ac
end
1: begin /* min, max, sort */
// TODO: this is just a special case of (2), we could merge them
x = less(a, b, op_signed) ? a : b; // min, sort:low
y = less(a, b, op_signed) ? b : a; // max, sort:high
end
2: begin /* minmax */
x = less(a, b, op_signed) ? a : b; // minmax:min
y = less(a, c, op_signed) ? c : a; // minmax:max
end
3: begin /* clip, boundcheck */
x = less(a, b, op_signed) ? b : (less(a, c, op_signed) ? a : c); // clip
y = less(a, b, op_signed) ? LT : (less(a, c, op_signed) ? EQ : GT); // boundcheck
end
endcase
end
endmodule
<%
INW = 64
OUTW = 64
WW = 64
LEN = 64*8
raise "INW not power of 2" unless INW & (INW-1) == 0
Mw = Math.log2(INW)
%>
module lut(
input [<%= INW-1 %>:0] in,
input [<%= WW-1 %>:0] wlen,
input [<%= LEN-1 %>:0] lut,
output [<%= OUTW-1 %>:0] out,
output ok
);
always begin
(* parallel_case, full_case *) case (wlen)
<% (0..Mw).each {|w| W = 2**w %>
<%= w %>: begin
(* parallel_case, full_case *) case (in[<%= W-1 %>:0])
<% (0...(LEN/W)).each {|o| %>
lut[<%= o*W+W-1 %>:<%= o*W %>]: begin out = <%= o %>; ok = 1; end
<% } %>
default: begin out = 0; ok = 0; end
endcase
end
<% } %>
default: begin out = 0; ok = 0; end
endcase
end
endmodule
<%
INW = 64
OUTW = 64
LEN = 64*8
%>
module match(
input [<%= INW-1 %>:0] in,
input [<%= INW-1 %>:0] off,
input [<%= INW-1 %>:0] nlen,
input [<%= LEN+INW-8-1 %>:0] lut,
output [<%= OUTW-1 %>:0] out,
output ok
);
always begin
out = 0; ok = 0;
(* parallel_case, full_case *) case (nlen)
<% (1..(INW/8)).each {|w| W = w*8 %>
<%= w %>: begin
if (0) begin end
<% (0...(LEN/8)).each {|o| O = o*8 %>
else if (in[<%= W-1 %>:0] == lut[<%= O+W-1 %>:<%= O %>] && <%= o %> >= off) begin out = <%= o %>; ok = 1; end
<% } %>
end
<% } %>
default: begin end
endcase
end
endmodule
// utf.v
// Combinatorial Unicode encoder/decoder for UTF-8, UTF-16, and UTF-32
// (C) 2022 Carlo Alberto Ferraris
module utf (
input [31:0] in,
input [3:0] op,
input allow_surrogates, // false -> disallow (unicode-compliant); true -> allow to encode/decode code points in the surrogate ranges
input allow_overlong, // false -> overlong encodings are invalid (unicode-compliant); true -> overlong encodings are tolerated
input allow_supplemental_planes, // true -> allow (unicode-compliant); false -> disallow to encode/decode code points in the supplemental planes
input allow_alien_codepoints, // false -> disallow (unicode-compliant); true -> allow codepoints higher than 0x10FFFF (0x1FFFFF for UTF-8, 0xFFFFFFFF for UTF-32)
input big_endian, // for UTF-16 and UTF-32, whether to decode/encode big endian encodings
output [31:0] out,
output [2:0] delta,
output ok);
localparam [31:0] invalid = 32'hFFFD;
wire [31:0] in1;
wire [31:0] tmp;
always begin
in1 = in - 32'h10000;
tmp = in;
case (op)
0 : (* hdlname = "utf8decode" *) begin : utf8decode
casez (in)
32'b0???????zzzzzzzzzzzzzzzzzzzzzzzz : begin
ok = 1; delta = 1; out = in[30:24];
end
32'b10??????zzzzzzzzzzzzzzzzzzzzzzzz : begin ok = 0; delta = 1; out = invalid; end
32'b110?????10??????zzzzzzzzzzzzzzzz : begin
ok = 1; delta = 2; out = { in[28:24], in[21:16] };
if (out <= 32'h7F && !allow_overlong) begin ok = 0; out = invalid; end
end
32'b110?????zzzzzzzzzzzzzzzzzzzzzzzz : begin ok = 0; delta = 1; out = invalid; end
32'b1110????10??????10??????zzzzzzzz : begin
ok = 1; delta = 3; out = { in[27:24], in[21:16], in[13:8] };
if (out <= 32'h7FF && !allow_overlong) begin ok = 0; out = invalid; end
else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
end
32'b1110????10??????zzzzzzzzzzzzzzzz : begin ok = 0; delta = 2; out = invalid; end
32'b1110????zzzzzzzzzzzzzzzzzzzzzzzz : begin ok = 0; delta = 1; out = invalid; end
32'b11110???10??????10??????10?????? : begin
ok = 1; delta = 4; out = { in[26:24], in[21:16], in[13:8], in[5:0] };
if (out <= 32'hFFFF && !allow_overlong) begin ok = 0; out = invalid; end
else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
else if (out > 32'hFFFF && out <= 32'h10FFFF && !allow_supplemental_planes) begin ok = 0; out = invalid; end
else if (out > 32'h10FFFF && !allow_alien_codepoints) begin ok = 0; out = invalid; end
end
32'b11110???10??????10??????zzzzzzzz : begin ok = 0; delta = 3; out = invalid; end
32'b11110???10??????zzzzzzzzzzzzzzzz : begin ok = 0; delta = 2; out = invalid; end
32'b11110???zzzzzzzzzzzzzzzzzzzzzzzz : begin ok = 0; delta = 1; out = invalid; end
default : begin ok = 0; delta = 1; out = invalid; end
endcase
end
1 : (* hdlname = "utf8decoderev" *) begin : utf8decoderev
casez (in)
32'bzzzzzzzzzzzzzzzzzzzzzzzz0??????? : begin
ok = 1; delta = 1; out = in[6:0];
end
32'bzzzzzzzzzzzzzzzz110?????10?????? : begin
ok = 1; delta = 2; out = { in[12:8], in[5:0] };
if (out <= 32'h7F && !allow_overlong) begin ok = 0; out = invalid; end
end
32'bzzzzzzzzzzzzzzzzzzzzzzzz110????? : begin ok = 0; delta = 1; out = invalid; end
32'bzzzzzzzz1110????10??????10?????? : begin
ok = 1; delta = 3; out = { in[19:16], in[13:8], in[5:0] };
if (out <= 32'h7FF && !allow_overlong) begin ok = 0; out = invalid; end
else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
end
32'bzzzzzzzzzzzzzzzz1110????10?????? : begin ok = 0; delta = 2; out = invalid; end
32'bzzzzzzzzzzzzzzzzzzzzzzzz1110???? : begin ok = 0; delta = 1; out = invalid; end
32'b11110???10??????10??????10?????? : begin
ok = 1; delta = 4; out = { in[26:24], in[21:16], in[13:8], in[5:0] };
if (out <= 32'hFFFF && !allow_overlong) begin ok = 0; out = invalid; end
else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
else if (out > 32'hFFFF && out <= 32'h10FFFF && !allow_supplemental_planes) begin ok = 0; out = invalid; end
else if (out > 32'h10FFFF && !allow_alien_codepoints) begin ok = 0; out = invalid; end
end
32'bzzzzzzzz11110???10??????10?????? : begin ok = 0; delta = 3; out = invalid; end
32'bzzzzzzzzzzzzzzzz11110???10?????? : begin ok = 0; delta = 2; out = invalid; end
32'bzzzzzzzzzzzzzzzzzzzzzzzz11110??? : begin ok = 0; delta = 1; out = invalid; end
32'bzzzzzzzzzzzzzzzzzzzzzzzz10?????? : begin ok = 0; delta = 1; out = invalid; end
default : begin ok = 0; delta = 1; out = invalid; end
endcase
end
2 : (* hdlname = "utf8encode" *) begin : utf8encode
if (in <= 32'h7F) begin ok = 1; delta = 1; out = { 1'b0, in[6:0], 24'b0 }; end
else if (in <= 32'h7FF) begin ok = 1; delta = 2; out = { 3'b110, in[10:6], 2'b10, in[5:0], 16'b0 }; end
else if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
else if (in <= 32'hFFFF) begin ok = 1; delta = 3; out = { 4'b1110, in[15:12], 2'b10, in[11:6], 2'b10, in[5:0], 8'b0 }; end
else if (in > 32'hFFFF && in <= 32'h10FFFF && !allow_supplemental_planes) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
else if (in > 32'h10FFFF && !allow_alien_codepoints) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
else if (in <= 32'h1FFFFF) begin ok = 1; delta = 4; out = { 5'b11110, in[20:18], 2'b10, in[17:12], 2'b10, in[11:6], 2'b10, in[5:0] }; end
else begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
end
3 : (* hdlname = "utf8encoderev" *) begin : utf8encoderev
if (in <= 32'h7F) begin ok = 1; delta = 1; out = { 24'b0, 1'b0, in[6:0] }; end
else if (in <= 32'h7FF) begin ok = 1; delta = 2; out = { 16'b0, 3'b110, in[10:6], 2'b10, in[5:0] }; end
else if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin ok = 0; delta = 3; out = 32'h00EFBFBD; end
else if (in <= 32'hFFFF) begin ok = 1; delta = 3; out = { 8'b0, 4'b1110, in[15:12], 2'b10, in[11:6], 2'b10, in[5:0] }; end
else if (in > 32'hFFFF && in <= 32'h10FFFF && !allow_supplemental_planes) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
else if (in > 32'h10FFFF && !allow_alien_codepoints) begin ok = 0; delta = 3; out = 32'hEFBFBD00; end
else if (in <= 32'h1FFFFF) begin ok = 1; delta = 4; out = { 5'b11110, in[20:18], 2'b10, in[17:12], 2'b10, in[11:6], 2'b10, in[5:0] }; end
else begin ok = 0; delta = 3; out = 32'h00EFBFBD; end
end
4 : (* hdlname = "utf16decode" *) begin : utf16decode
if (big_endian) begin
tmp = { in[23:16], in[31:24], in[7:0], in[15:8] };
end
casez (tmp)
32'b110110??????????110111?????????? : begin
ok = 1; delta = 4; out = { tmp[25:16], tmp[9:0] } + 32'h10000;
if (out <= 32'hFFFF && !allow_overlong) begin ok = 0; out = invalid; end
else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
else if (out > 32'hFFFF && !allow_supplemental_planes) begin ok = 0; out = invalid; end
end
32'b????????????????zzzzzzzzzzzzzzzz : begin
ok = 1; delta = 2; out = tmp[31:16];
if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
end
endcase
end
5 : (* hdlname = "utf16decoderev" *) begin : utf16decoderev
if (big_endian) begin
tmp = { in[23:16], in[31:24], in[7:0], in[15:8] };
end
casez (tmp)
32'b110110??????????110111?????????? : begin
ok = 1; delta = 4; out = { tmp[25:16], tmp[9:0] } + 32'h10000;
if (out <= 32'hFFFF && !allow_overlong) begin ok = 0; out = invalid; end
else if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
else if (out > 32'hFFFF && !allow_supplemental_planes) begin ok = 0; out = invalid; end
end
32'bzzzzzzzzzzzzzzzz???????????????? : begin
ok = 1; delta = 2; out = tmp[15:0];
if (out >= 32'hD800 && out <= 32'hDFFF && !allow_surrogates) begin ok = 0; out = invalid; end
end
endcase
end
6 : (* hdlname = "utf16encode" *) begin : utf16encode
if (in <= 32'hFFFF) begin
if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin
ok = 0; delta = 2; tmp = { invalid[15:0], 16'b0 };
end else begin
ok = 1; delta = 2; tmp = { in[15:0], 16'b0 };
end
end else begin
if (in <= 32'h10FFFF && allow_supplemental_planes) begin
ok = 1; delta = 4; tmp = { 6'b110110, in1[19:10], 6'b110111, in1[9:0] };
end else begin
ok = 0; delta = 2; tmp = { invalid[15:0], 16'b0 };
end
end
out = tmp;
if (big_endian) begin
out = { tmp[23:16], tmp[31:24], tmp[7:0], tmp[15:8] };
end
end
7 : (* hdlname = "utf16encoderev" *) begin : utf16encoderev
if (in <= 32'hFFFF) begin
if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin
ok = 0; delta = 2; tmp = { 16'b0, invalid[15:0] };
end else begin
ok = 1; delta = 2; tmp = { 16'b0, in[15:0] };
end
end else begin
if (in <= 32'h10FFFF && allow_supplemental_planes) begin
ok = 1; delta = 4; tmp = { 6'b110110, in1[19:10], 6'b110111, in1[9:0] };
end else begin
ok = 0; delta = 2; tmp = { 16'b0, invalid[15:0] };
end
end
out = tmp;
if (big_endian) begin
out = { tmp[23:16], tmp[31:24], tmp[7:0], tmp[15:8] };
end
end
8, 9 : (* hdlname = "utf32decode" *) begin : utf32decode
if (big_endian) begin
tmp = { in[7:0], in[15:8], in[23:16], in[31:24] };
end
if (tmp >= 32'hD800 && tmp <= 32'hDFFF && !allow_surrogates) begin ok = 0; delta = 4; out = invalid; end
else if (tmp <= 32'hFFFF) begin ok = 1; delta = 4; out = tmp; end
else if (tmp > 32'hFFFF && tmp <= 32'h10FFFF && allow_supplemental_planes) begin ok = 1; delta = 4; out = tmp; end
else if (tmp > 32'h1FFFFF && allow_alien_codepoints) begin ok = 1; delta = 4; out = tmp; end
else begin ok = 0; delta = 4; out = invalid; end
end
10, 11 : (* hdlname = "utf32encode" *) begin : utf32encode
if (in >= 32'hD800 && in <= 32'hDFFF && !allow_surrogates) begin ok = 0; delta = 4; tmp = invalid; end
else if (in <= 32'hFFFF) begin ok = 1; delta = 4; tmp = in; end
else if (in > 32'hFFFF && in <= 32'h10FFFF && allow_supplemental_planes) begin ok = 1; delta = 4; tmp = in; end
else if (in > 32'h1FFFFF && allow_alien_codepoints) begin ok = 1; delta = 4; tmp = in; end
else begin ok = 0; delta = 4; tmp = invalid; end
out = tmp;
if (big_endian) begin
out = { tmp[7:0], tmp[15:8], tmp[23:16], tmp[31:24] };
end
end
default: begin
ok = x; delta = x; out = x;
end
endcase
end
endmodule
// Pseudo-code for the functions described above
// TODO: add temporal/nontemporal hints
#include <stdlib.h>
#include <stdint.h>
static size_t min(size_t a, size_t b) {
return a < b ? a : b;
}
void* _memcpy(void * restrict dst, const void * restrict src, size_t len) {
// TODO: handle overlap
unsigned char *cdst = dst, *csrc = src;
for (size_t p = 0; p < len; p++)
cdst[p] = csrc[p];
return dst;
}
void* _memcpyex(void * restrict dst, const void * restrict src, size_t elem_sz, size_t elem_cnt, size_t dst_stride, size_t src_stride) {
void *rv = dst;
for (size_t i=0; i<elem_cnt; i++) {
_memcpy(dst, src, elem_sz);
dst += dst_stride;
src += src_stride;
}
return rv;
}
size_t _memcmp(void *a, void *b, size_t len) {
const unsigned char mask = 0xFF; // TODO: extend to support a custom bitmask
unsigned char *ca = a, *cb = b;
for (size_t p = 0; p < len; p++) {
if ((ca[p] & mask) != (cb[p] & mask))
return p;
}
return len;
}
void* _memmem(void *haystack, size_t haystack_len, void *needle, size_t needle_len, size_t stride) {
// TODO: handle negative stride
for (size_t p = 0; p < haystack_len-needle_len; p += stride)
if (_memcmp(haystack+p, needle, needle_len) == 0)
return haystack+p;
return NULL;
}
void _memset(void * restrict dst, size_t dst_len, void * restrict src, size_t src_len) {
for (size_t p = 0; p < dst_len; p += src_len)
_memcpy(dst+p, src, min(dst_len-p, src_len));
}
void _memswp(void * restrict a, void * restrict b, size_t len) {
// TODO: handle overlap
char ta, tb;
for (size_t p = 0; p < len; p++) {
_memcpy(&ta, a+p, 1);
_memcpy(&tb, b+p, 1);
_memcpy(a+p, &tb, 1);
_memcpy(b+p, &ta, 1);
}
}
typedef struct { const uint8_t x[256/8]; } bytemask __attribute__ ((aligned (64)));
void* _memchr(void *a, size_t len, size_t stride, bytemask needle) {
for (size_t p = 0; p < len; p+=stride) {
unsigned char b;
_memcpy(&b, a+p, 1);
if (needle.x[b/8]&(1<<(b%8)))
return a+p;
}
return NULL;
}
size_t _memsieve(void *src, size_t src_len, void *dst, size_t dst_len, bytemask needle) {
// TODO: add masking/non-gather support
size_t d = 0;
for (size_t s = 0; s < src_len && d < dst_len; s++) {
unsigned char b;
_memcpy(&b, src+s, 1);
if (needle.x[b/8]&(1<<(b%8)))
_memcpy(dst+(d++), &b, 1);
}
return d;
}
@CAFxX
Copy link
Author

CAFxX commented Dec 3, 2022

TODO: efficient SoA/AoS via strided view of memory exposed by the memory subsystem

S = 2^s
E = 2^e
A = addr in view
Ar = a/E*S + a%E

@CAFxX
Copy link
Author

CAFxX commented May 9, 2023

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment