auroranockert/gist:7a0f6a99a0f3bde2facb Secret

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    This was a draft for a blog post, it did never really happen. But I noticed a
lot of people starting to link to this, which is why I reworked some of the
content into something a little more polished at
http://blog.aventine.se/post/16318162396/simd
The rest of this gist is the original content, which contains a bit more
ramblings on an actual implementation and a lot more weird language.
SIMD

Mozilla has a bug, https://bugzilla.mozilla.org/show_bug.cgi?id=644389 relating
to the lack of SIMD instructions in JS, what they want is to essentially add
assembly language to the web, for possibly a very high performance increase on
computation heavy code.
This is technology that is directly competing with WebCL and NaCl, and is in
many ways a very good one, it would provide many of the advantages of NaCl with
only a few of the disadvantages. But there are a few problems, it could in
theory allow you to build scripts that only run on certain browsers on certain
CPUs.
Of course, WebGL and typed arrays already provides JS with most of this
disadvantage, since typed arrays already expose the native endianness of the
hardware, and may or may not require specific graphics hardware that is not
easily emulatable.
There are a few ways to allow the Javascript programmer high performance
primitives for building applications, some worse than others.
Raw Assembly - Very High

You could simply let the programmer write a piece of code in assembly as a string, or as a separate script, linked to your Javascript. In the same way that
you do it in C with many common compilers, allowing programmers full flexibility
when writing code and allowing the programmer access to the lowest level
possible.
Advantages

Very powerful (all CPU-specific features exposed)
Very fast
Much code already available
No one expects it to be portable

Disadvantages

How do we know the code is safe?
A language within a language, with very different semantics
How do we keep track of register usage, memory usage

And while it allows the programmer to interact with the CPU at the lowest level,
that also is a significant problem. Ensuring that native code is safe is hard,
really hard. Which would make any inline assembly proposal get shot down pretty
quickly, it is also not portable, which is a disadvantage.
Standard Intrinsics - High

Let Javascript programmers use intrinsics identical to those exposed by C
compilers, these are also CPU specific, or specific to a family of CPUs, and
generally map to a single or a few instructions. They are often designed to
provide a friendlier API than the raw instructions. On Intel for example, they
are three-operand instead of two-operand, and the compiler assigns registers for
you.
Advantages

Very powerful (all CPU-specific features exposed)
Very fast (with a good compiler and optimizer)
Much code already available
No one expects it to be portable

Disadvantages

May need to add datatypes (m64... for Intel, float32x4_t... for ARM NEON)
Need to check every load/store for security issues
Modifies the javascript runtime

An API like this would allow programmers to take existing code, rip out the C
and replace it with Javascript and the kernel written using intrinsics would
run unmodified, allowing a quick speedup in many common algorithms.
This has some advantages and some disadvantages, the programmer will never
expect this code to be portable between processors or browsers, which allows us
to remove support in the future, or change the implementation. But on the other
hand, future browsers or uncommon processors might simply not work, or run
unaccelerated Javascript instead, making the application unnecessarily slow.
Specific Intrinsics - High

Javascript specific intrinsics, these would still be CPU specific, and expose
all or most functionality of the CPU, but with intrinsics optimized for security
and use with Javascript. I guess they would only operate on memory (typed
arrays) and could therefore have automatic type recognition at compile-time.
Advantages

Powerful (all features we need could be exposed)
Very fast (with a good compiler and optimizer)
Nicer syntax
No one expects it to be portable

Disadvantages

No 64-bit integer support
Need to check every load/store for security issues
Modifies the javascript runtime

Such an API would have most of the advantages and disadvantages that the
standard intrinsics would provide, but would trade the ability to use existing
code for a nicer API or better performance depending on implementation.
Generic Vector/Matrix API - Very Low

A Javascript vector/matrix API could expose most of the required floating point
functionality that we would get with SIMD, except that it would be much slower
for small vectors, making it a lot less useful for WebGL/Games etc.
Just exposing something like BLAS has some advantages, since programmers are
used to it, and it has very high-speed implementations on every platform with
any kind of support for floating-point. Also, if the system has coprocessors
with significant floating point capability (like a GPU), the chance that they
implement a fast BLAS is pretty high, which could be important for performance
on future embedded platforms like phones or tablets.
Advantages

Good for scientific computing
Easy to use
Very optimized
Safe
Portable

Disadvantages

No media instruction support
Slow for small vectors/matrices

While BLAS would provide a high-performance API, I have a hard time seeing that
it would be useful in the domains that Javascript is used in, web applications.
There are probably few Javascript applications that solve large systems of
linear equations or do many large matrix-matrix multiplications, so while I
really like BLAS, I think it will only add complexity to the browsers for no
significant gain right now.
Generic Intrinsics - Low

Writing generic intrinsics that you can emulate without any SIMD unit, and that
can be accelerated with SIMD if the processor actually supports the instruction
in question.
Advantages

Portable
Nice syntax

Disadvantages

Safety..?
Modifies the javascript runtime
Needs fallbacks
Needs an API to expose which instructions are accelerated
Does not expose all capabilities
Exposes non-accelerated capabilities
Complicated

While it is from my perspective the worst implementation, I can also see why it
would work, floats are probably the most common datatype in many WebGL
applications and games, and they are supported pretty evenly in all SIMD
implementations we care about, and the interesting operations can easily be
emulated on devices without support for SIMD.
It can probably even be emulated a lot faster than normal JS operations, since
the loops are unrolled and provide a bit more information about dependencies
between instructions, more specific datatypes etc. allowing the JIT to generate
better code.
Conclusion

So there is definitely a point in doing a general API for graphics and floating
point operations, it would also be useful for audio processing and so on, but
for integer and media instructions, where there is a significant spread in
implementations, I cannot really see how a generic API is going to work.
LLVM provides general instructions and types for SIMD, with target
specific instructions in addition to that, so some sort of mix is definitely
possible where there are a basic subset that is enabled on all CPUs and a larger
set of media instructions and instructions that are hard to emulate that are
only available on specific targets.
For Aurora-JS, I don't think many of the floating point operations will be of
much use, and most of a generic SIMD API would probably be concerned with
floats, at least to begin with, but I am still working on a possible proposal
that would provide a few basic primitives that could really help some of the
audio processing, and probably help graphics related tasks a lot.
Proposal

A draft proposal to get the creative juices flowing, most of these instructions
should be accelerated on x86 with SSE and on ARM with Neon. But Neon is very
limited when it comes to IEEE-754 support, so this API will either only support
flush-to-zero (FTZ) or possibly make denormals implementation dependent.
Only a subset that can be easily implemented without Neon on ARM, and which can
be easily implemented without AVX on Intel will be implemented, this means that
it is mostly limited to floating point instructions.
The operations to all instructions are Typed Array views, of the new types
Uint8x32,   Uint8x16
Uint16x16,  Uint16x8
Uint32x8,   Uint32x8
Int8x32,    Int8x16
Int16x16,   Int16x8
Int32x8,    Int32x8
Float32x8,  Float32x4
Float64x4,  Float64x2
These views represent 16-byte aligned memory, or when the JIT optimizes, a
single or a group of registers. This is due to performance reasons, and because
some architectures (Altivec on PowerPC) can only load vectors from 16-byte
aligned addresses, and some (SSE on x86) require special instructions to load
from unaligned addresses. Neon only requires 8-byte alignment, but will still
be limited to 16-byte alignment in this API to make sure code is reasonably
portable.
The 256-bit wide SIMD instructions will probably be synthesized from two 128-bit
instructions for a while, at least until AVX gets common on x86, and Float64
instructions will need to be synthesized from VFP vector instructions on ARM.
The 256-bit wide instructions may still be a little faster than the 128-bit
instructions if you can use them, since they should in theory be a little
easier to optimize for than multiple 128-bit instructions. But they could also
simply not be implemented to start with, since AVX is pretty rare.
To read Javascript values from the arrays you are working on, you need to
create an aliasing view of a scalar type, and read the values it via that.
The instruction that is actually synthesized would preferably be determined
automatically from the view they operate on, the instructions that take two
operands, need to have the same type, or it should raise an exception on
runtime. The general instructions could ignore that restriction, possibly, since they don't operate on the data, just shuffle it around

General Instructions

fill(view, index)
clear(view, index)
move(view, index, view, index)
swap(view, index, view, index)
reverse16(view, index, view, index)
reverse32(view, index, view, index)
reverse64(view, index, view, index)
moveduplow(view, index, view, index)
moveduphhigh(view, index, view, index)
interleave8(view, index, view, index...)
interleave16(view, index, view, index...)
interleave32(view, index, view, index...)
deinterleave8(view, index, view, index...)
deinterleave16(view, index, view, index...)
deinterleave32(view, index, view, index...)


Prefetch

prefetchNT(view, index)
prefetchT0(view, index)
prefetchT1(view, index)


Bitwise Instructions

not(view, index, view, index)
or(view, index, view, index, view, index)
and(view, index, view, index, view, index)
xor(view, index, view, index, view, index)


Shifts

shl(view, index, view, index)
shr(view, index, view, index)


Arithmetic

abs(view, index, view, index)
neg(view, index, view, index)
acc(view, index, view, index)
add(view, index, view, index, view, index)
sub(view, index, view, index, view, index)
mul(view, index, view, index, view, index)
div(view, index, view, index, view, index)
subadd(view, index, view, index, view, index)
mulacc(view, index, view, index, view, index)


Compare

ge(view, index, view, index, view, index)
gt(view, index, view, index, view, index)
le(view, index, view, index, view, index)
lt(view, index, view, index, view, index)
eq(view, index, view, index, view, index)


Special

min(view, index, view, index, view, index)
max(view, index, view, index, view, index)
rcpe(view, index, view, index)
sqrt(view, index, view, index)
sqrte(view, index, view, index)


Many of the instructions that take a scalar, can optionally be implemented as an
immediate if it is constant, or if the architecture does not support it natively
(like on x86) then it must be implemented as a load/splat and then the
operation.
The only operation that takes a scalar view/index is move, the second argument
can point to a scalar view of the correct type, this argument only needs natural
alignment for that type.
Other SIMD architectures will probably support a very similar set of
instructions, and should easily be introduced when they are supported by
Firefox. But I am too lazy to look up the exact Power/SPARC/MIPS instructions
that would be a possible implementation right now.
Some of these instructions are obvious what they do, some are not
fill (a, i)

Fills a[i] with 0b1...
clear (a, i)

Fills a[i] with 0b0...
move (a, i, b, j)

Moves the data, b[j] = a[i].
If a[j] is a scalar, duplicate it into every location in b[j]
reverse16(a, i, b, j)

b[j] is assigned the value of a[i] with all components of every 2-byte unit
reversed. For example,
(Uint8x16) { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
would be transformed into
(Uint8x16) { 1, 0, 3, 2, 5, 4, 7, 6, 9, 8, 11, 10, 13, 12, 15, 14 }.
reverse32(a, i, b, j)

b[j] is assigned the value of a[i] with all components of every 4-byte unit
reversed.
(Uint8x16) { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
would be transformed into
(Uint8x16) { 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 }.
reverse64(a, i, b, j)

b[j] is assigned the value of a[i] with all components of every 8-byte unit
reversed.
(Uint8x16) { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }
would be transformed into
(Uint8x16) { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 }.
moveduplow(a, i, b, j)

b[j] is assigned with the value of a[i], but the low part of every SIMD pair
is duplicated to the high part.
(Float32x4) { 1.0, 2.0, 3.0, 4.0 }
would be transformed into
(Float32x4) { 1.0, 1.0, 3.0, 3.0 }.
moveduphigh(a, i, b, j)

b[j] is assigned with the value of a[i], but the high part of every SIMD pair
is duplicated to the low part.
(Float32x4) { 1.0, 2.0, 3.0, 4.0 }
would be transformed into
(Float32x4) { 2.0, 2.0, 4.0, 4.0 }.
interleave8(a, i, b, j...)

Similar to a move instruction, but the move is interleaved, you are allowed
to pass up to 4 source operands, and a single destination. The data is written
to the source, in an interleaved fashion,
(Uint8x16) { 1, 2, 3, 4 ... }, { 5, 6, 7, 8 ... }
would be written as
{ 1, 5, 2, 6, 3, 7, 4, 8 ... }.
interleave16(a, i, b, j...)

Similar to the previous instruction, but with 16-byte interleave,
(Uint8x16) { 1, 2, 3, 4 ... }, { 5, 6, 7, 8 ... }
would be written as
{ 1, 2, 5, 6, 3, 4, 7, 8 ... }.
interleave32(a, i, b, j...)

Similar to the previous interleave32, but with 32-byte interleave,
(Uint8x16) { 1, 2, 3, 4 ... }, { 5, 6, 7, 8 ... }
would be written as
{ 1, 2, 3, 4, 5, 6, 7, 8 ... }.
deinterleave8(a, i, b, j...)

The reverse operation of interleave
(Uint8) { 1, 2, 3, 4, 5, 6, 7, 8 ... }
would be read as
(Uint8x16) { 1, 5 ... }, { 2, 6 ... }, { 3, 7 ... }, { 4, 8 ... }
with a stride of four (passing four destinations and one source).
deinterleave16(a, i, b, j...)

Similar to the previous instruction, but with 16-byte interleave,
(Uint8) { 1, 2, 3, 4, 5, 6, 7, 8 ... }
would be read as
(Uint8x16) { 1, 2 ... }, { 3, 4 ... }, { 5, 6 ... }, { 7, 8 ... }
with a stride of four (passing four destinations and one source).
deinterleave32(a, i, b, j...)

Similar to the previous instruction, but with 32-byte interleave,
(Uint8) { 1, 2, 3, 4, 5, 6, 7, 8 ... }
would be read as
(Uint8x16) { 1, 2, 3, 4 ... }, { 5, 6, 7, 8 ... }
with a stride of two (passing two destinations and one source).
prefetchNT(view, index)

Prefetches into L1-cache for read, may be interpreted as a NOP.
Should map to PREFETCHNTA on x86, probably PLD on ARM.
prefetchT0(view, index)

Prefetches into all levels of cache, may be interpreted as a NOP.
Should map to PREFETCHT0 on x86, PLD on ARM.
prefetchT1(view, index)

Prefetches into 2nd level cache or above, may be interpreted as a NOP.
Should map to PREFETCHT1 on x86, probably PLD on ARM.
not(a, i, b, j)

Flip all the bits in b[j], and assign the value to a[i].
or(a, i, b, j, c, k)

Assign c[k] with a[i] bitwise-or b[j].
and(a, i, b, j, c, k)

Assign c[k] with a[i] bitwise-and b[j].
xor(a, i, b, j, c, k)

Assign c[k] with a[i] bitwise-exclusive-or b[j].
shl(a, i, b, j, c, k)

Shift a[i] elementwise left by the value of b[j] and assigns to c[k].
Large or negative shifts are undefined.
shr(a, i, b, j, c, k)

Shift a[i] elementwise right by the value of b[j] and assigns to c[k].
Large or negative shifts are undefined.
abs(a, i, b, j)

Assigns b[j] elementwise the absolute value of a[i].
neg(a, i, b, j)

Assigns b[j] elementwise the negation of a[i].
acc(a, i, b, j)

Adds a[i] elementwise to b[j] and assigns the sum to b[j].
add(a, i, b, j, c, k)

Adds a[i] elementwise to b[j] and assigns the sum to c[k].
sub(a, i, b, j, c, k)

Subtracts a[i] elementwise from b[j] and assigns the difference to c[k].
mul(a, i, b, j, c, k)

Multiplies a[i] elementwise with b[j] and assigns the product to c[k].
div(a, i, b, j, c, k)

Divides a[i] elementwise with b[j] and assigns the quotient to c[k].
subadd(a, i, b, j, c, k)

The first element in each pair in a[i] and b[j] is subtracted, the next is
added.
(Float32x4) { 1.0, 3.0, 5.0, 7.0 }, { 0.0, 1.0, 2.0, 3.0 }
when subadded would result in,
(Float32x4) { 1.0, 4.0, 3.0, 10.0 }.
mulacc(a, i, b, j, c, k)

c[k] += a[i] * b[j]
ge(a, i, b, j, c, k)

Each element c[k] is filled with ones if the corresponding element in a[i]
is greater than or equal to the element in b[j], else it is filled with
zeroes.
gt(a, i, b, j, c, k)

Each element c[k] is filled with ones if the corresponding element in a[i]
is greater than the element in b[j], else it is filled with
zeroes.
le(a, i, b, j, c, k)

Each element c[k] is filled with ones if the corresponding element in a[i]
is less than or equal to the element in b[j], else it is filled with
zeroes.
lt(a, i, b, j, c, k)

Each element c[k] is filled with ones if the corresponding element in a[i]
is less than the element in b[j], else it is filled with
zeroes.
eq(a, i, b, j, c, k)

Each element c[k] is filled with ones if the corresponding element in a[i]
is equal to the element in b[j], else it is filled with
zeroes.
cmp(a, i, b, j, c, k)

Each element c[k] is filled with ones if the corresponding element in a[i]
is equal to the element in b[j], else it is filled with
zeroes.
min(a, i, b, j, c, k)

Each element in c[k] is assigned the minimum of the corresponding elements
in a[i] and b[j].
max(a, i, b, j, c, k)

Each element in c[k] is assigned the maximum of the corresponding elements
in a[i] and b[j].
rcpe(a, i, b, j)

b[j] is assigned with an elementwise approximation of the reciprocal of a[i].
sqrt(a, i, b, j)

b[j] is assigned with an elementwise square root of a[i].
sqrte(a, i, b, j, c, k)

b[j] is assigned with an elementwise approximation of the square root of a[i].
Some Examples

Most of these are just sketches, they don't handle edge cases like when the
operands are not a multiple of the SIMD length etc. They are also not optimized,
and it is the first time I have ever written Neon assembly, and I haven't tried
actually executing any of these fragments, they are buggy, incorrect, slow and
plain stupid.
I have tried to keep all SSE examples in Intel order, and all the SIMD
API proposal examples in AT&T order, I think I failed in a few cases simply
because I am tired, please read what I mean, not what I write. I tried following
the order in the Assembly manual for Neon, but might have failed in some places.
I also cheated and added memory operands for ARM, it doesn't really have them,
but adding extra loads was too annoying, since I am not really differentiating
registers and memory anyhow.
Most of the ideas for examples are stolen and adapted from dsp.js
Interleave

Sometimes useful in audio processing, not much use alone
m0, m1 are two channels of mono. m2 is the interleaved stereo.
Proposed Version

for (var i = 0; i < length / 2; i++) {
  interleave32  m0, i, m1, i, m2, i * 2
}

JS Version

for (var i = 0; i < length; i++) {
  m2[2 * i + 0] = m0[i]
  m2[2 * i + 1] = m1[i]
}

(Faux) SSE Version

for (var i = 0; i < length; i++) {
  MOVAPS          r0,  m0[i]        # interleave32 pt. 1
  MOVAPS          r1,  m1[i]        # interleave32 pt. 2
  MOVAPS          r2,     r0        # interleave32 pt. 3
  SHUFPS  $0x11,  r2,     r1        # interleave32 pt. 4
  SHUFPS  $0x27,  r2,     r2        # interleave32 pt. 5
  MOVAPS          m2[..], r2        # interleave32 pt. 6
  MOVAPS          r2,     r0        # interleave32 pt. 7
  SHUFPS  $0xBB,  r2,     r1        # interleave32 pt. 8
  SHUFPS  $0x27,  r2,     r2        # interleave32 pt. 9
  MOVAPS          m2[..], r2        # interleave32 pt. 10
}

(Faux) Neon Version

for (var i = 0; i < length; i++) {
  VLD1.32         r0,       m0[i]   # interleave32 pt. 1
  VLD1.32         r1,       m1[i]   # interleave32 pt. 2
  VST2.32         {r0, r1}, m2[..]  # interleave32 pt. 3
}

Deinterleave

The reverse operation, not much use alone
m0 is interleaved stereo, m0, m1 are two channels of mono.
Proposed Version

for (var i = 0; i < length / 2; i++) {
  deinterleave32  m1, i, m2, i, m0, 2 * i
}

JS Version

for (var i = 0; i < length / 2; i++) {
  m1[i] = m0[2 * i + 0]
  m2[i] = m0[2 * i + 1]
}

(Faux) SSE Version

for (var i = 0; i < length; i++) {
  MOVAPS          r0,  m0[..]       # deinterleave32 pt. 1
  MOVAPS          r1,  m0[..]       # deinterleave32 pt. 2
  MOVAPS          r2,     r0        # deinterleave32 pt. 3
  SHUFPS  $0x33,  r2,     r1        # deinterleave32 pt. 4
  MOVAPS          m1[..], r2        # deinterleave32 pt. 5
  MOVAPS          r2,     r0        # deinterleave32 pt. 6
  SHUFPS  $0x77,  r2,     r1        # deinterleave32 pt. 7
  MOVAPS          m2[..], r2        # deinterleave32 pt. 8
}

(Faux) Neon Version

for (var i = 0; i < length; i++) {
  VLD2.32         {r0, r1}, m0[i]   # interleave32 pt. 1
  VST1.32         r0,       m1[..]  # interleave32 pt. 2
  VST1.32         r1,       m2[..]  # interleave32 pt. 3
}

Complex Multiplication

Needed for implementing a fast DFT, very useful in general
m0, m1 contains complex numbers { ar, ai, br, bi } and  { cr, ci, dr, di }.
r0, r1 are 'registers'.
Proposed Version

moveduplow    r0, 0, m0, i
moveduphigh   r0, 0, m0, i
mul           r0, 0, r0, 0, m1, i
mul           r1, 0, r1, 0, m1, i
reverse64     r1, 0, r1
subadd        r0, 0, r0, 0, r1, 0

JS Version

[ar, ai] = [m0[2 * i], m0[2 * i + 1]]
[br, bi] = [m1[2 * i], m1[2 * i + 1]]

cr = ar * br - ai * bi
ci = ar * bi + ai * br

(Faux) SSE Version

MOVAPS          r0,  m0[i]    # moveduplow pt. 1
SHUFPS  $0xA0,  r0,     r0    # moveduplow pt. 2
MOVAPS          r1,  m0[i]    # moveduphigh pt. 1
SHUFPS  $0xF5   r1,     r1    # moveduphigh pt. 2
MULPS           r0,  m1[i]    # mul
MULPS           r1,  m1[i]    # mul
SHUFPS  $0xB1   r1,     r1    # reverse64
MULPS           r1,     $c    # subadd pt. 1, $c = { -1.0f, 1.0f ... }
ADDPS           r0,     r1    # subadd pt. 2

(Faux) SSE3 Version

MOVSLDUP        r0,  m0[i]    # moveduplow
MOVSHDUP        r1,  m0[i]    # moveduphigh
MULPS           r0,  m1[i]    # mul2
MULPS           r1,  m1[i]    # mul2
SHUFPS  $0xB1   r1,     r1    # reverse64
ADDSUBPS        r0,     r1    # subadd2

(Faux) Neon Version

VLD1.32         r0,   m0[i]   # moveduplow pt. 1
VDUP.32         r0,   r0[0]   # moveduplow pt. 2
VLD1.32         r1,   m0[i]   # moveduphigh pt. 1
VDUP.32         r1,   r1[1]   # moveduphigh pt. 2
VMUL.F32  r0,   r0,   m1[i]   # mul
VMUL.F32  r1,   r1,   m1[i]   # mul
VREV64.32       r1,   r1      # reverse64
VMLA.F32  r0,   r1,   $c      # subadd, $c = { -1.0f, 1.0f ... }

Matrix-Matrix Multiplication (4x4)

Affine transforms etc.
m0, m1 contains matrices.
s0, s1 contains matrices as scalars.
m2, s2 contains the result.
r0, r1, r2 are 'registers'
Proposed Version

for (var i = 0; i < 4; i++) {
  move    r1, 0, m1, i
  clear   r2, 0
  
  for (var j = 0; j < 4; j++) {
    move    r0, 0, s1, (4 * i + j)
    mulacc  r2, 0, r0, 0, r1, 0
  }
  
  move    m2, i, r2, 0
}

JS Version

for (var i = 0; i < 16; i += 4) {
  for (var j = 0; j < 4; j++) {
    s2[i + j] = s1[j + 0] * s2[i + 0] + s1[j +  4] * s2[i + 1] +
                s1[j + 8] * s2[i + 2] + s1[j + 12] * s2[i + 3]
  }
}

(Faux) SSE Version

  for (var i = 0; i < 4; i++) {
    MOVAPS r1, m1[i]                    # move
    XORPS  r2, r2                       # clear
    
    for (var j = 0; j < 4; j++) {
      SHUFPS  $0,   r0, s1[..]          # move (is not actually aligned)
      MULPS         r0,     r1          # mulacc pt. 1
      ADDPS         r2,     r0          # mulacc pt. 2
    }
    
    MOVAPS m2[i], r1                    # move
  }

(Faux) Neon Version

  for (var i = 0; i < 4; i++) {
    VLD1.32         r1,     m0[i]       # move
    VBIC.I8         r2,     $0x00       # clear
    
    for (var j = 0; j < 4; j++) {
      VLDR.64       r0,     s1[..]      # move
      VDUP.32       r0,     r0[0]       # splat
      VMLA.F32      r2,     r0,     r1  # mulacc
    }
    
    VST1.32         r2,     m2[i]       # move
  }