Skip to content

Instantly share code, notes, and snippets.

@auroranockert
Created July 29, 2011 12:18
Show Gist options
  • Save auroranockert/7a0f6a99a0f3bde2facb to your computer and use it in GitHub Desktop.
Save auroranockert/7a0f6a99a0f3bde2facb to your computer and use it in GitHub Desktop.

This was a draft for a blog post, it did never really happen. But I noticed a lot of people starting to link to this, which is why I reworked some of the content into something a little more polished at http://blog.aventine.se/post/16318162396/simd

The rest of this gist is the original content, which contains a bit more ramblings on an actual implementation and a lot more weird language.

SIMD

Mozilla has a bug, https://bugzilla.mozilla.org/show_bug.cgi?id=644389 relating to the lack of SIMD instructions in JS, what they want is to essentially add assembly language to the web, for possibly a very high performance increase on computation heavy code.

This is technology that is directly competing with WebCL and NaCl, and is in many ways a very good one, it would provide many of the advantages of NaCl with only a few of the disadvantages. But there are a few problems, it could in theory allow you to build scripts that only run on certain browsers on certain CPUs.

Of course, WebGL and typed arrays already provides JS with most of this disadvantage, since typed arrays already expose the native endianness of the hardware, and may or may not require specific graphics hardware that is not easily emulatable.

There are a few ways to allow the Javascript programmer high performance primitives for building applications, some worse than others.

Raw Assembly - Very High

You could simply let the programmer write a piece of code in assembly as a string, or as a separate script, linked to your Javascript. In the same way that you do it in C with many common compilers, allowing programmers full flexibility when writing code and allowing the programmer access to the lowest level possible.

Advantages

  • Very powerful (all CPU-specific features exposed)
  • Very fast
  • Much code already available
  • No one expects it to be portable

Disadvantages

  • How do we know the code is safe?
  • A language within a language, with very different semantics
  • How do we keep track of register usage, memory usage

And while it allows the programmer to interact with the CPU at the lowest level, that also is a significant problem. Ensuring that native code is safe is hard, really hard. Which would make any inline assembly proposal get shot down pretty quickly, it is also not portable, which is a disadvantage.

Standard Intrinsics - High

Let Javascript programmers use intrinsics identical to those exposed by C compilers, these are also CPU specific, or specific to a family of CPUs, and generally map to a single or a few instructions. They are often designed to provide a friendlier API than the raw instructions. On Intel for example, they are three-operand instead of two-operand, and the compiler assigns registers for you.

Advantages

  • Very powerful (all CPU-specific features exposed)
  • Very fast (with a good compiler and optimizer)
  • Much code already available
  • No one expects it to be portable

Disadvantages

  • May need to add datatypes (m64... for Intel, float32x4_t... for ARM NEON)
  • Need to check every load/store for security issues
  • Modifies the javascript runtime

An API like this would allow programmers to take existing code, rip out the C and replace it with Javascript and the kernel written using intrinsics would run unmodified, allowing a quick speedup in many common algorithms.

This has some advantages and some disadvantages, the programmer will never expect this code to be portable between processors or browsers, which allows us to remove support in the future, or change the implementation. But on the other hand, future browsers or uncommon processors might simply not work, or run unaccelerated Javascript instead, making the application unnecessarily slow.

Specific Intrinsics - High

Javascript specific intrinsics, these would still be CPU specific, and expose all or most functionality of the CPU, but with intrinsics optimized for security and use with Javascript. I guess they would only operate on memory (typed arrays) and could therefore have automatic type recognition at compile-time.

Advantages

  • Powerful (all features we need could be exposed)
  • Very fast (with a good compiler and optimizer)
  • Nicer syntax
  • No one expects it to be portable

Disadvantages

  • No 64-bit integer support
  • Need to check every load/store for security issues
  • Modifies the javascript runtime

Such an API would have most of the advantages and disadvantages that the standard intrinsics would provide, but would trade the ability to use existing code for a nicer API or better performance depending on implementation.

Generic Vector/Matrix API - Very Low

A Javascript vector/matrix API could expose most of the required floating point functionality that we would get with SIMD, except that it would be much slower for small vectors, making it a lot less useful for WebGL/Games etc.

Just exposing something like BLAS has some advantages, since programmers are used to it, and it has very high-speed implementations on every platform with any kind of support for floating-point. Also, if the system has coprocessors with significant floating point capability (like a GPU), the chance that they implement a fast BLAS is pretty high, which could be important for performance on future embedded platforms like phones or tablets.

Advantages

  • Good for scientific computing
  • Easy to use
  • Very optimized
  • Safe
  • Portable

Disadvantages

  • No media instruction support
  • Slow for small vectors/matrices

While BLAS would provide a high-performance API, I have a hard time seeing that it would be useful in the domains that Javascript is used in, web applications.

There are probably few Javascript applications that solve large systems of linear equations or do many large matrix-matrix multiplications, so while I really like BLAS, I think it will only add complexity to the browsers for no significant gain right now.

Generic Intrinsics - Low

Writing generic intrinsics that you can emulate without any SIMD unit, and that can be accelerated with SIMD if the processor actually supports the instruction in question.

Advantages

  • Portable
  • Nice syntax

Disadvantages

  • Safety..?
  • Modifies the javascript runtime
  • Needs fallbacks
  • Needs an API to expose which instructions are accelerated
  • Does not expose all capabilities
  • Exposes non-accelerated capabilities
  • Complicated

While it is from my perspective the worst implementation, I can also see why it would work, floats are probably the most common datatype in many WebGL applications and games, and they are supported pretty evenly in all SIMD implementations we care about, and the interesting operations can easily be emulated on devices without support for SIMD.

It can probably even be emulated a lot faster than normal JS operations, since the loops are unrolled and provide a bit more information about dependencies between instructions, more specific datatypes etc. allowing the JIT to generate better code.

Conclusion

So there is definitely a point in doing a general API for graphics and floating point operations, it would also be useful for audio processing and so on, but for integer and media instructions, where there is a significant spread in implementations, I cannot really see how a generic API is going to work.

LLVM provides general instructions and types for SIMD, with target specific instructions in addition to that, so some sort of mix is definitely possible where there are a basic subset that is enabled on all CPUs and a larger set of media instructions and instructions that are hard to emulate that are only available on specific targets.

For Aurora-JS, I don't think many of the floating point operations will be of much use, and most of a generic SIMD API would probably be concerned with floats, at least to begin with, but I am still working on a possible proposal that would provide a few basic primitives that could really help some of the audio processing, and probably help graphics related tasks a lot.

Proposal

A draft proposal to get the creative juices flowing, most of these instructions should be accelerated on x86 with SSE and on ARM with Neon. But Neon is very limited when it comes to IEEE-754 support, so this API will either only support flush-to-zero (FTZ) or possibly make denormals implementation dependent.

Only a subset that can be easily implemented without Neon on ARM, and which can be easily implemented without AVX on Intel will be implemented, this means that it is mostly limited to floating point instructions.

The operations to all instructions are Typed Array views, of the new types

Uint8x32, Uint8x16 Uint16x16, Uint16x8 Uint32x8, Uint32x8

Int8x32, Int8x16 Int16x16, Int16x8 Int32x8, Int32x8

Float32x8, Float32x4 Float64x4, Float64x2

These views represent 16-byte aligned memory, or when the JIT optimizes, a single or a group of registers. This is due to performance reasons, and because some architectures (Altivec on PowerPC) can only load vectors from 16-byte aligned addresses, and some (SSE on x86) require special instructions to load from unaligned addresses. Neon only requires 8-byte alignment, but will still be limited to 16-byte alignment in this API to make sure code is reasonably portable.

The 256-bit wide SIMD instructions will probably be synthesized from two 128-bit instructions for a while, at least until AVX gets common on x86, and Float64 instructions will need to be synthesized from VFP vector instructions on ARM.

The 256-bit wide instructions may still be a little faster than the 128-bit instructions if you can use them, since they should in theory be a little easier to optimize for than multiple 128-bit instructions. But they could also simply not be implemented to start with, since AVX is pretty rare.

To read Javascript values from the arrays you are working on, you need to create an aliasing view of a scalar type, and read the values it via that.

The instruction that is actually synthesized would preferably be determined automatically from the view they operate on, the instructions that take two operands, need to have the same type, or it should raise an exception on runtime. The general instructions could ignore that restriction, possibly, since they don't operate on the data, just shuffle it around

  • General Instructions
    • fill(view, index)
    • clear(view, index)
    • move(view, index, view, index)
    • swap(view, index, view, index)
    • reverse16(view, index, view, index)
    • reverse32(view, index, view, index)
    • reverse64(view, index, view, index)
    • moveduplow(view, index, view, index)
    • moveduphhigh(view, index, view, index)
    • interleave8(view, index, view, index...)
    • interleave16(view, index, view, index...)
    • interleave32(view, index, view, index...)
    • deinterleave8(view, index, view, index...)
    • deinterleave16(view, index, view, index...)
    • deinterleave32(view, index, view, index...)
  • Prefetch
    • prefetchNT(view, index)
    • prefetchT0(view, index)
    • prefetchT1(view, index)
  • Bitwise Instructions
    • not(view, index, view, index)
    • or(view, index, view, index, view, index)
    • and(view, index, view, index, view, index)
    • xor(view, index, view, index, view, index)
  • Shifts
    • shl(view, index, view, index)
    • shr(view, index, view, index)
  • Arithmetic
    • abs(view, index, view, index)
    • neg(view, index, view, index)
    • acc(view, index, view, index)
    • add(view, index, view, index, view, index)
    • sub(view, index, view, index, view, index)
    • mul(view, index, view, index, view, index)
    • div(view, index, view, index, view, index)
    • subadd(view, index, view, index, view, index)
    • mulacc(view, index, view, index, view, index)
  • Compare
    • ge(view, index, view, index, view, index)
    • gt(view, index, view, index, view, index)
    • le(view, index, view, index, view, index)
    • lt(view, index, view, index, view, index)
    • eq(view, index, view, index, view, index)
  • Special
    • min(view, index, view, index, view, index)
    • max(view, index, view, index, view, index)
    • rcpe(view, index, view, index)
    • sqrt(view, index, view, index)
    • sqrte(view, index, view, index)

Many of the instructions that take a scalar, can optionally be implemented as an immediate if it is constant, or if the architecture does not support it natively (like on x86) then it must be implemented as a load/splat and then the operation.

The only operation that takes a scalar view/index is move, the second argument can point to a scalar view of the correct type, this argument only needs natural alignment for that type.

Other SIMD architectures will probably support a very similar set of instructions, and should easily be introduced when they are supported by Firefox. But I am too lazy to look up the exact Power/SPARC/MIPS instructions that would be a possible implementation right now.

Some of these instructions are obvious what they do, some are not

fill (a, i)

Fills a[i] with 0b1...

clear (a, i)

Fills a[i] with 0b0...

move (a, i, b, j)

Moves the data, b[j] = a[i].

If a[j] is a scalar, duplicate it into every location in b[j]

reverse16(a, i, b, j)

b[j] is assigned the value of a[i] with all components of every 2-byte unit reversed. For example,

(Uint8x16) { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }

would be transformed into

(Uint8x16) { 1, 0, 3, 2, 5, 4, 7, 6, 9, 8, 11, 10, 13, 12, 15, 14 }.

reverse32(a, i, b, j)

b[j] is assigned the value of a[i] with all components of every 4-byte unit reversed.

(Uint8x16) { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }

would be transformed into

(Uint8x16) { 3, 2, 1, 0, 7, 6, 5, 4, 11, 10, 9, 8, 15, 14, 13, 12 }.

reverse64(a, i, b, j)

b[j] is assigned the value of a[i] with all components of every 8-byte unit reversed.

(Uint8x16) { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 }

would be transformed into

(Uint8x16) { 7, 6, 5, 4, 3, 2, 1, 0, 15, 14, 13, 12, 11, 10, 9, 8 }.

moveduplow(a, i, b, j)

b[j] is assigned with the value of a[i], but the low part of every SIMD pair is duplicated to the high part.

(Float32x4) { 1.0, 2.0, 3.0, 4.0 }

would be transformed into

(Float32x4) { 1.0, 1.0, 3.0, 3.0 }.

moveduphigh(a, i, b, j)

b[j] is assigned with the value of a[i], but the high part of every SIMD pair is duplicated to the low part.

(Float32x4) { 1.0, 2.0, 3.0, 4.0 }

would be transformed into

(Float32x4) { 2.0, 2.0, 4.0, 4.0 }.

interleave8(a, i, b, j...)

Similar to a move instruction, but the move is interleaved, you are allowed to pass up to 4 source operands, and a single destination. The data is written to the source, in an interleaved fashion,

(Uint8x16) { 1, 2, 3, 4 ... }, { 5, 6, 7, 8 ... }

would be written as

{ 1, 5, 2, 6, 3, 7, 4, 8 ... }.

interleave16(a, i, b, j...)

Similar to the previous instruction, but with 16-byte interleave,

(Uint8x16) { 1, 2, 3, 4 ... }, { 5, 6, 7, 8 ... }

would be written as

{ 1, 2, 5, 6, 3, 4, 7, 8 ... }.

interleave32(a, i, b, j...)

Similar to the previous interleave32, but with 32-byte interleave,

(Uint8x16) { 1, 2, 3, 4 ... }, { 5, 6, 7, 8 ... }

would be written as

{ 1, 2, 3, 4, 5, 6, 7, 8 ... }.

deinterleave8(a, i, b, j...)

The reverse operation of interleave

(Uint8) { 1, 2, 3, 4, 5, 6, 7, 8 ... }

would be read as

(Uint8x16) { 1, 5 ... }, { 2, 6 ... }, { 3, 7 ... }, { 4, 8 ... }

with a stride of four (passing four destinations and one source).

deinterleave16(a, i, b, j...)

Similar to the previous instruction, but with 16-byte interleave,

(Uint8) { 1, 2, 3, 4, 5, 6, 7, 8 ... }

would be read as

(Uint8x16) { 1, 2 ... }, { 3, 4 ... }, { 5, 6 ... }, { 7, 8 ... }

with a stride of four (passing four destinations and one source).

deinterleave32(a, i, b, j...)

Similar to the previous instruction, but with 32-byte interleave,

(Uint8) { 1, 2, 3, 4, 5, 6, 7, 8 ... }

would be read as

(Uint8x16) { 1, 2, 3, 4 ... }, { 5, 6, 7, 8 ... }

with a stride of two (passing two destinations and one source).

prefetchNT(view, index)

Prefetches into L1-cache for read, may be interpreted as a NOP.

Should map to PREFETCHNTA on x86, probably PLD on ARM.

prefetchT0(view, index)

Prefetches into all levels of cache, may be interpreted as a NOP.

Should map to PREFETCHT0 on x86, PLD on ARM.

prefetchT1(view, index)

Prefetches into 2nd level cache or above, may be interpreted as a NOP.

Should map to PREFETCHT1 on x86, probably PLD on ARM.

not(a, i, b, j)

Flip all the bits in b[j], and assign the value to a[i].

or(a, i, b, j, c, k)

Assign c[k] with a[i] bitwise-or b[j].

and(a, i, b, j, c, k)

Assign c[k] with a[i] bitwise-and b[j].

xor(a, i, b, j, c, k)

Assign c[k] with a[i] bitwise-exclusive-or b[j].

shl(a, i, b, j, c, k)

Shift a[i] elementwise left by the value of b[j] and assigns to c[k].

Large or negative shifts are undefined.

shr(a, i, b, j, c, k)

Shift a[i] elementwise right by the value of b[j] and assigns to c[k].

Large or negative shifts are undefined.

abs(a, i, b, j)

Assigns b[j] elementwise the absolute value of a[i].

neg(a, i, b, j)

Assigns b[j] elementwise the negation of a[i].

acc(a, i, b, j)

Adds a[i] elementwise to b[j] and assigns the sum to b[j].

add(a, i, b, j, c, k)

Adds a[i] elementwise to b[j] and assigns the sum to c[k].

sub(a, i, b, j, c, k)

Subtracts a[i] elementwise from b[j] and assigns the difference to c[k].

mul(a, i, b, j, c, k)

Multiplies a[i] elementwise with b[j] and assigns the product to c[k].

div(a, i, b, j, c, k)

Divides a[i] elementwise with b[j] and assigns the quotient to c[k].

subadd(a, i, b, j, c, k)

The first element in each pair in a[i] and b[j] is subtracted, the next is added.

(Float32x4) { 1.0, 3.0, 5.0, 7.0 }, { 0.0, 1.0, 2.0, 3.0 }

when subadded would result in,

(Float32x4) { 1.0, 4.0, 3.0, 10.0 }.

mulacc(a, i, b, j, c, k)

c[k] += a[i] * b[j]

ge(a, i, b, j, c, k)

Each element c[k] is filled with ones if the corresponding element in a[i] is greater than or equal to the element in b[j], else it is filled with zeroes.

gt(a, i, b, j, c, k)

Each element c[k] is filled with ones if the corresponding element in a[i] is greater than the element in b[j], else it is filled with zeroes.

le(a, i, b, j, c, k)

Each element c[k] is filled with ones if the corresponding element in a[i] is less than or equal to the element in b[j], else it is filled with zeroes.

lt(a, i, b, j, c, k)

Each element c[k] is filled with ones if the corresponding element in a[i] is less than the element in b[j], else it is filled with zeroes.

eq(a, i, b, j, c, k)

Each element c[k] is filled with ones if the corresponding element in a[i] is equal to the element in b[j], else it is filled with zeroes.

cmp(a, i, b, j, c, k)

Each element c[k] is filled with ones if the corresponding element in a[i] is equal to the element in b[j], else it is filled with zeroes.

min(a, i, b, j, c, k)

Each element in c[k] is assigned the minimum of the corresponding elements in a[i] and b[j].

max(a, i, b, j, c, k)

Each element in c[k] is assigned the maximum of the corresponding elements in a[i] and b[j].

rcpe(a, i, b, j)

b[j] is assigned with an elementwise approximation of the reciprocal of a[i].

sqrt(a, i, b, j)

b[j] is assigned with an elementwise square root of a[i].

sqrte(a, i, b, j, c, k)

b[j] is assigned with an elementwise approximation of the square root of a[i].

Some Examples

Most of these are just sketches, they don't handle edge cases like when the operands are not a multiple of the SIMD length etc. They are also not optimized, and it is the first time I have ever written Neon assembly, and I haven't tried actually executing any of these fragments, they are buggy, incorrect, slow and plain stupid.

I have tried to keep all SSE examples in Intel order, and all the SIMD API proposal examples in AT&T order, I think I failed in a few cases simply because I am tired, please read what I mean, not what I write. I tried following the order in the Assembly manual for Neon, but might have failed in some places.

I also cheated and added memory operands for ARM, it doesn't really have them, but adding extra loads was too annoying, since I am not really differentiating registers and memory anyhow.

Most of the ideas for examples are stolen and adapted from dsp.js

Interleave

Sometimes useful in audio processing, not much use alone

m0, m1 are two channels of mono. m2 is the interleaved stereo.

Proposed Version

for (var i = 0; i < length / 2; i++) {
  interleave32  m0, i, m1, i, m2, i * 2
}

JS Version

for (var i = 0; i < length; i++) {
  m2[2 * i + 0] = m0[i]
  m2[2 * i + 1] = m1[i]
}

(Faux) SSE Version

for (var i = 0; i < length; i++) {
  MOVAPS          r0,  m0[i]        # interleave32 pt. 1
  MOVAPS          r1,  m1[i]        # interleave32 pt. 2
  MOVAPS          r2,     r0        # interleave32 pt. 3
  SHUFPS  $0x11,  r2,     r1        # interleave32 pt. 4
  SHUFPS  $0x27,  r2,     r2        # interleave32 pt. 5
  MOVAPS          m2[..], r2        # interleave32 pt. 6
  MOVAPS          r2,     r0        # interleave32 pt. 7
  SHUFPS  $0xBB,  r2,     r1        # interleave32 pt. 8
  SHUFPS  $0x27,  r2,     r2        # interleave32 pt. 9
  MOVAPS          m2[..], r2        # interleave32 pt. 10
}

(Faux) Neon Version

for (var i = 0; i < length; i++) {
  VLD1.32         r0,       m0[i]   # interleave32 pt. 1
  VLD1.32         r1,       m1[i]   # interleave32 pt. 2
  VST2.32         {r0, r1}, m2[..]  # interleave32 pt. 3
}

Deinterleave

The reverse operation, not much use alone

m0 is interleaved stereo, m0, m1 are two channels of mono.

Proposed Version

for (var i = 0; i < length / 2; i++) {
  deinterleave32  m1, i, m2, i, m0, 2 * i
}

JS Version

for (var i = 0; i < length / 2; i++) {
  m1[i] = m0[2 * i + 0]
  m2[i] = m0[2 * i + 1]
}

(Faux) SSE Version

for (var i = 0; i < length; i++) {
  MOVAPS          r0,  m0[..]       # deinterleave32 pt. 1
  MOVAPS          r1,  m0[..]       # deinterleave32 pt. 2
  MOVAPS          r2,     r0        # deinterleave32 pt. 3
  SHUFPS  $0x33,  r2,     r1        # deinterleave32 pt. 4
  MOVAPS          m1[..], r2        # deinterleave32 pt. 5
  MOVAPS          r2,     r0        # deinterleave32 pt. 6
  SHUFPS  $0x77,  r2,     r1        # deinterleave32 pt. 7
  MOVAPS          m2[..], r2        # deinterleave32 pt. 8
}

(Faux) Neon Version

for (var i = 0; i < length; i++) {
  VLD2.32         {r0, r1}, m0[i]   # interleave32 pt. 1
  VST1.32         r0,       m1[..]  # interleave32 pt. 2
  VST1.32         r1,       m2[..]  # interleave32 pt. 3
}

Complex Multiplication

Needed for implementing a fast DFT, very useful in general

m0, m1 contains complex numbers { ar, ai, br, bi } and { cr, ci, dr, di }. r0, r1 are 'registers'.

Proposed Version

moveduplow    r0, 0, m0, i
moveduphigh   r0, 0, m0, i
mul           r0, 0, r0, 0, m1, i
mul           r1, 0, r1, 0, m1, i
reverse64     r1, 0, r1
subadd        r0, 0, r0, 0, r1, 0

JS Version

[ar, ai] = [m0[2 * i], m0[2 * i + 1]]
[br, bi] = [m1[2 * i], m1[2 * i + 1]]

cr = ar * br - ai * bi
ci = ar * bi + ai * br

(Faux) SSE Version

MOVAPS          r0,  m0[i]    # moveduplow pt. 1
SHUFPS  $0xA0,  r0,     r0    # moveduplow pt. 2
MOVAPS          r1,  m0[i]    # moveduphigh pt. 1
SHUFPS  $0xF5   r1,     r1    # moveduphigh pt. 2
MULPS           r0,  m1[i]    # mul
MULPS           r1,  m1[i]    # mul
SHUFPS  $0xB1   r1,     r1    # reverse64
MULPS           r1,     $c    # subadd pt. 1, $c = { -1.0f, 1.0f ... }
ADDPS           r0,     r1    # subadd pt. 2

(Faux) SSE3 Version

MOVSLDUP        r0,  m0[i]    # moveduplow
MOVSHDUP        r1,  m0[i]    # moveduphigh
MULPS           r0,  m1[i]    # mul2
MULPS           r1,  m1[i]    # mul2
SHUFPS  $0xB1   r1,     r1    # reverse64
ADDSUBPS        r0,     r1    # subadd2

(Faux) Neon Version

VLD1.32         r0,   m0[i]   # moveduplow pt. 1
VDUP.32         r0,   r0[0]   # moveduplow pt. 2
VLD1.32         r1,   m0[i]   # moveduphigh pt. 1
VDUP.32         r1,   r1[1]   # moveduphigh pt. 2
VMUL.F32  r0,   r0,   m1[i]   # mul
VMUL.F32  r1,   r1,   m1[i]   # mul
VREV64.32       r1,   r1      # reverse64
VMLA.F32  r0,   r1,   $c      # subadd, $c = { -1.0f, 1.0f ... }

Matrix-Matrix Multiplication (4x4)

Affine transforms etc.

m0, m1 contains matrices. s0, s1 contains matrices as scalars. m2, s2 contains the result.

r0, r1, r2 are 'registers'

Proposed Version

for (var i = 0; i < 4; i++) {
  move    r1, 0, m1, i
  clear   r2, 0
  
  for (var j = 0; j < 4; j++) {
    move    r0, 0, s1, (4 * i + j)
    mulacc  r2, 0, r0, 0, r1, 0
  }
  
  move    m2, i, r2, 0
}

JS Version

for (var i = 0; i < 16; i += 4) {
  for (var j = 0; j < 4; j++) {
    s2[i + j] = s1[j + 0] * s2[i + 0] + s1[j +  4] * s2[i + 1] +
                s1[j + 8] * s2[i + 2] + s1[j + 12] * s2[i + 3]
  }
}

(Faux) SSE Version

  for (var i = 0; i < 4; i++) {
    MOVAPS r1, m1[i]                    # move
    XORPS  r2, r2                       # clear
    
    for (var j = 0; j < 4; j++) {
      SHUFPS  $0,   r0, s1[..]          # move (is not actually aligned)
      MULPS         r0,     r1          # mulacc pt. 1
      ADDPS         r2,     r0          # mulacc pt. 2
    }
    
    MOVAPS m2[i], r1                    # move
  }

(Faux) Neon Version

  for (var i = 0; i < 4; i++) {
    VLD1.32         r1,     m0[i]       # move
    VBIC.I8         r2,     $0x00       # clear
    
    for (var j = 0; j < 4; j++) {
      VLDR.64       r0,     s1[..]      # move
      VDUP.32       r0,     r0[0]       # splat
      VMLA.F32      r2,     r0,     r1  # mulacc
    }
    
    VST1.32         r2,     m2[i]       # move
  }
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment