ngrodzitski/LUT+SIMD.md

## LUT+SIMD.md

      
    Raw
  

              LUT+SIMD.md
            
          
    The problem

For parsers implemented as state machines, a common pattern often emerges:
with each subsequent byte, the parser either transitions to a new state
or remains in the current state. To determine which of these scenarios applies to a given byte,
you can employ a Look-Up Table (LUT) approach.
static constexpr std::array<uint8_t, 256> alphanumeric_lut{
    /* 0x00 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    /* 0x10 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    /* 0x20 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    /* 0x30 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
    /* 0x40 */ 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    /* 0x50 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
    /* 0x60 */ 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    /* 0x70 */ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
    /* 0x80 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    /* ... */ 
    /* 0xF0 */ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
};

while( ptr != end_ptr && alphanumeric_lut[*ptr] )
{
    ++ptr;
}
To optimize the parser's efficiency, it would be beneficial to locate
the position of the next state-changing byte. This way, we can efficiently navigate directly to it.
Solution

We can utilize SSE instructions to process 16 items at once,
which is quite efficient but it still cannot handle a LUT of size 256.
In specific scenarios, and this is not uncommon, we can optimize by replacing the initial LUT of size 256
with two LUTs of size 16 each.
_mm_shuffle_epi8
instruction enables us to perform a lookup for 16 nibbles (a nibble being a 4-bit chunk of a byte)
using a single instruction. The concept is to perform separate lookups for the low nibble and high nibble
and then merge the results. To merge them one can use some bitwise operation (like AND).
        |                 Low nibble
        |     0          1          2          3     ...      E          F 
--------+---------------------------------------------------------------------
H     0 | H[0]&L[0]  H[0]&L[1]  H[0]&L[2]  H[0]&L[3] ...  H[0]&L[E]  H[0]&L[F] 
i     1 | H[1]&L[0]  H[1]&L[1]  H[1]&L[2]  H[1]&L[3] ...  H[1]&L[E]  H[1]&L[F] 
g     2 | H[2]&L[0]  H[2]&L[1]  H[2]&L[2]  H[2]&L[3] ...  H[2]&L[E]  H[2]&L[F] 
h     3 | H[3]&L[0]  H[3]&L[1]  H[3]&L[2]  H[3]&L[3] ...  H[3]&L[E]  H[3]&L[F] 
   ...  |  ...        ...        ...        ...      ...   ...        ...   
ni ...  |  ...        ...        ...        ...      ...   ...        ...   
bb    E | H[E]&L[0]  H[E]&L[1]  H[E]&L[2]  H[E]&L[3] ...  H[E]&L[E]  H[E]&L[F] 
le    F | H[F]&L[0]  H[F]&L[1]  H[F]&L[2]  H[F]&L[3] ...  H[F]&L[E]  H[F]&L[F] 

So the next question is how to find H (high nibbles LUT) and L (low nibbles LUT)
that will effectively substitute the original LUT.
Another handy idea comes from we are not necessary constraint to value of 1 == H[i]&L[j] if LUT[i<<4 | j] == 1
we just need it to be 0 != H[i]&L[j].
Once we know H[] and L[] we can code the algo:
// Locate then in the same cache line (which also mak them aligned for SSE registers)
struct alignas(64) tlo_thi_pair_t 
{
  std::array<uint8_t, 16> tlo{};
  std::array<uint8_t, 16> thi{};
};

static constexpr tlo_thi_pair_t alphanumeric_luts{
  {254, 255, 255, 255, 255, 255, 255, 255, 255, 255, 253, 1, 1, 1, 1, 1},
  {0, 0, 0, 2, 1, 254, 1, 252, 0, 0, 0, 0, 0, 0, 0, 0}
};

const char * find_first_non_anumeric( const char * ptr,
                                      const char * end_ptr )
{
  if( end_ptr - ptr >= 16 ) 
  {
    __m128i lut_tlo;
    __m128i lut_thi;
    lut_tlo = _mm_load_si128( (const __m128i *)alphanumeric_luts.tlo.data() );
    lut_thi = _mm_load_si128( (const __m128i *)alphanumeric_luts.thi.data() );
  
    for( ;end_ptr - ptr >= 16; ptr += 16 ) 
    {
      __m128i lut_res_lo;
      __m128i lut_res_hi;
      __m128i input;
      input = _mm_loadu_si128( ( const __m128i*)ptr ); // Input is not necessarily aligned
  
                                            
      lut_res_lo = _mm_shuffle_epi8( lut_tlo, 
                                    // Mask the low nibble:
                                    _mm_and_si128( input, _mm_set1_epi8( 0x0F ) ) );

      lut_res_hi = _mm_shuffle_epi8( lut_thi, 
                                     // This effectively gives us the high nibble 
                                     // in low 4bits positions within a bytes:
                                     // 2. Shift the whole 128 bits 4 positions left
                                     _mm_srli_epi16(
                                        //  1. Mask the high nibble;
                                        _mm_and_si128( input, _mm_set1_epi8( 0xF0 ) ), 
                                        4 ) );

      // Reuse input to store comparison with zero 
      // of a bitwise-AND result of low/high lookups.
      // For bytes in `lut_res_lo&lut_res_hi` that are equal to zero
      // the corresponding byte will be set to 0xFF.
      input = _mm_cmpeq_epi8( _mm_and_si128( lut_res_lo, lut_res_hi ), 
                              _mm_setzero_si128() );

      // Get the mask of values in input.
      // This wel set bit to 1 if a corresponding byte in input is oxFF.
      auto index = _mm_movemask_epi8(input);

      if( 0 != index )
      {
        // So we have a byte that LUTs to zero in original LUT.
        return ptr + __builtin_ctz(index);
      }
    }
  }
    // Scan the tail in a simple way:
    while( ptr < end_ptr && anumeric_main_lut[ (uint8_t)*ptr ] ) 
    {
       ++ptr;
    }

  return ptr;
}
See the full example here
The very interesting remaining question though is the following:

how to get these magic numbers for alphanumeric_luts?

The answer is kind of simple and frustrating:
You need to solve a multivariable system of equations:
H[i]&L[j] != 0 iff LUT[i<<4 | j] == 1
H[i]&L[j]`== 0 iff LUT[i<<4 | j] == 0

In many cases it can be solved manually. But luckely nowadays we have z3.
So we can describe our problem to Z3 and let it find a solution.
Here is a nice "start" python script (need to install z3-solver package):
from z3 import *

lo = Array('A', IntSort(), BitVecSort(8))
hi = Array('B', IntSort(), BitVecSort(8))
lut = Array('X', IntSort(), BitVecSort(8))

solver = Solver()

etalon = [
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,
    0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
    0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
    1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
    0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
]

for i in range(256):
    if etalon[i] != 0:
        solver.add(Select(lut, i) != BitVecVal(0,8))
    else:
        solver.add(Select(lut, i) == BitVecVal(0,8))

    solver.add((Select(lo, i&0xf) & Select(hi, i>>4)) == Select(lut, i))


print(solver.check())
m = solver.model()

aa = []
bb = []
for i in range(16):
    aa.append(m.evaluate(Select(lo, i)).as_long())
    bb.append(m.evaluate(Select(hi, i)).as_long())

print(f"\n\nlow = {aa}\nhi = {bb}\n")

yy = [0]*256
for i in range(16):
    for j in range(16):
        yy[(i << 4) | j] = aa[j] & bb[i]

print(f"high[i]&low[j] = {yy}\n\netalon = {etalon}")
References


https://lemire.me/blog/2023/09/04/locating-identifiers-quickly-arm-neon-edition/
https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#text=_mm_shuffle_epi8
https://github.com/Z3Prover/z3/wiki
https://en.wikipedia.org/wiki/Z3_Theorem_Prover