Skip to content

Instantly share code, notes, and snippets.

@Theldus
Last active February 11, 2024 19:56
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save Theldus/3281aa3e277c7df707be4e4495d79363 to your computer and use it in GitHub Desktop.
Save Theldus/3281aa3e277c7df707be4e4495d79363 to your computer and use it in GitHub Desktop.
Does V8's JIT Engine Truly Compete with GCC?

Does V8's JIT Engine Truly Compete with GCC?

JavaScript is a language that has never particularly piqued my interest, as I prefer working at the binary, CPU, operating system level, etc. However, something that has truly caught my attention recently is V8, or more precisely, its ability to perform Just-in-time Compilation (JIT).

V8's JIT is highly acclaimed for its speed and consistently ranks not far from the top in various benchmarks. While it might not always be the fastest, it is orders of magnitude ahead of purely interpreted languages such as Python (CPython), PHP, and Perl.

This article aims to address some of my questions:

  1. Is it truly as remarkable as claimed?
  2. What is the ASM (x86_64) generated by it? Is it close to that of a compiled language?
  3. Can you debug this JIT-generated code?

Disclaimer: The text reflects my personal opinion, with no affiliation to V8. Everything presented here is the result of my exploration over the past week. Please avoid drawing hasty conclusions or taking the content too seriously. I am still learning, and any assistance on the subject is highly appreciated.

My Environment

The environment for conducting my tests is as follows:

  • Slackware 14.2-current, w/ Linux v5.4.186
  • GCC v9.3.0 / Clang v14.0.6
  • V8 and V8-debug v12.3.127, obtained from jsvu
  • was2wat (git~1.0.34-36-gef851559), obtained from WABT: The WebAssembly Binary Toolkit

Let's JIT!

All analyses conducted here will be based on the following code snippet, specifically the mul function:

mul.js:

const SIZE = 10000000
function mul(a,b,c) {
  for (let i = 0; i < SIZE; i++)
    a[i] = b[i] * c[i];
}

var a = Array(SIZE).fill(0)
var b = Array(SIZE).fill(50)
var c = Array(SIZE).fill(40)

const iter = Number(arguments[0]);

t0 = performance.now()
for (let i = 0; i < iter; i++)
  mul(a,b,c)
t1 = performance.now()

console.log("Time: " + (t1-t0) + " ms");

The code is quite simple: it successively multiplies two vectors, b and c, and stores the result in a. Two reasons led me to choose such a simple code:

  1. It is a straightforward and potentially optimizable piece of code.
  2. I really intend to read the ASM generated by the JIT, and a small codebase potentially generates a concise ASM code.

Basic and Optimizing Compilers

V8 always attempts to JIT-compile its code, but with a twist: it initially does so with a fast compiler without optimizations and executes that compiled code. If, during the execution of this code, V8 detects 'hot' functions—functions that potentially stress the CPU—the code is then recompiled with an Optimizing Compiler. One such compiler is Turbofan, and that's what we are going to explore here.

GDB Enters the Scene!

Assuming the tools are properly installed, it is possible to dump the machine instructions generated by Turbofan with:

$ v8-debug --print-opt-code add.js -- 100

producing an output like:

--- Raw source ---
(a,b,c) {
    for (let i = 0; i < SIZE; i++)
        a[i] = b[i] * c[i];
}

--- Optimized code ---
optimization_id = 1
source_position = 36
kind = TURBOFAN
name = mul
stack_slots = 21
compiler = turbofan
address = 0x304a00002549

Instructions (size = 1208)
0x7fffa8005840     0  488d1df9ffffff       REX.W leaq rbx,[rip+0xfffffff9]
0x7fffa8005847     7  483bd9               REX.W cmpq rbx,rcx
0x7fffa800584a     a  740d                 jz 0x7fffa8005859  <+0x19>
[...]
(huge wall of 284 intructions, for 2 lines of code... good luck)

But that's not very exciting:

  • The code displayed on the screen is extensive and lacks much useful information about what is happening where.
  • The line address is not very useful in a system that uses ASLR.

We truly need debugging capability to decipher all this code, and we will have it!

To use GDB within the JIT, we need to be a little clever: the instruction dump above is done before code execution, so we need to interrupt the v8 execution before that happens and then add breakpoints wherever we want.

In version v12.3.127 of v8, this happens inside the Disassemble() function in src/objects/code.cc, specifically at line 189, just after the DisassembleCodeRange() function call (you can check this on v8's src). Additionally, the installed versions of v8 and v8-debug provided by jsvu are actually shell scripts for the real program.

In summary, what you really need to load your script and break just after the instruction dump is:

$ gdb \
  -ex "set confirm off" \
  -ex "b _start" \
  -ex "r" \
  -ex "b code.cc:189" \
  -ex "c" \
  --args $HOME/.jsvu/engines/v8-debug/v8-debug \
  --snapshot_blob="$HOME/.jsvu/engines/v8-debug/snapshot_blob.bin" \
  --print-opt-code \
  mul.js -- 100

After that, simply set a breakpoint for some point of interest within your JITed function and analyze.

Below is the heavily commented code of the mul() function:

--- Raw source ---
(a,b,c) {
    for (let i = 0; i < SIZE; i++)
        a[i] = b[i] * c[i];
}

--- Optimized code ---
optimization_id = 1
source_position = 36
kind = TURBOFAN
name = mul
stack_slots = 21
compiler = turbofan
address = 0x304a00002549

Instructions (size = 1208)
[snip]

initial count' = 56651
mul:
                                           count = rax
/-> 0x7fffa8005a80   240  488bc1               REX.W movq rax,rcx
|
|                                          SIZE = mem[rbp-0x60]/2
|   0x7fffa8005a83   243  448b4da0             movl r9,[rbp-0x60]
|   0x7fffa8005a87   247  41d1f9               sarl r9, 1
|
|                                          if (count < SIZE)
|   0x7fffa8005a8a   24a  413bc1               cmpl rax,r9
|   0x7fffa8005a8d   24d  0f8dc0000000         jge 0x7fffa8005b53  <+0x313> (NT)
|   0x7fffa8005a93   253  3bc6                 cmpl rax,rsi
|   0x7fffa8005a95   255  0f8328020000         jnc 0x7fffa8005cc3  <+0x483> (NT)
|
|                                          r9 = b[idx]
|   0x7fffa8005a9b   25b  458b4c8307           movl r9,[r11+rax*4+0x7]
|   0x7fffa8005aa0   260  41baffffffff         movl r10,0xffffffff
|   0x7fffa8005aa6   266  4d3bca               REX.W cmpq r9,r10
|   0x7fffa8005aa9   269  760d                 jna 0x7fffa8005ab8  <+0x278> (T) -\
|   0x7fffa8005aab   26b  ba02000000           movl rdx,0x2                      |
|   0x7fffa8005ab0   270  41ff95c0530000       call [r13+0x53c0]                 |
|   0x7fffa8005ab7   277  cc                   int3l                             |
|   0x7fffa8005ab8   278  3bc2                 cmpl rax,rdx                    <-/
|   0x7fffa8005aba   27a  0f8307020000         jnc 0x7fffa8005cc7  <+0x487> (NT)
|
|                                          r15 = a[idx]
|   0x7fffa8005ac0   280  458b7c8007           movl r15,[r8+rax*4+0x7]
|   0x7fffa8005ac5   285  41baffffffff         movl r10,0xffffffff
|   0x7fffa8005acb   28b  4d3bfa               REX.W cmpq r15,r10
|   0x7fffa8005ace   28e  760d                 jna 0x7fffa8005add  <+0x29d> (T) -\
|   0x7fffa8005ad0   290  ba02000000           movl rdx,0x2                      |
|   0x7fffa8005ad5   295  41ff95c0530000       call [r13+0x53c0]                 |
|   0x7fffa8005adc   29c  cc                   int3l                             |
|   0x7fffa8005add   29d  41f6c101             testb r9,0x1                    <-/
|   0x7fffa8005ae1   2a1  0f85e4010000         jnz 0x7fffa8005ccb  <+0x48b> (NT)
|
|                                          r9 /= 2
|   0x7fffa8005ae7   2a7  41d1f9               sarl r9, 1
|   0x7fffa8005aea   2aa  41f6c701             testb r15,0x1
|   0x7fffa8005aee   2ae  0f85db010000         jnz 0x7fffa8005ccf  <+0x48f> (NT)
|
|                                          r15 /= 2
|   0x7fffa8005af4   2b4  41d1ff               sarl r15, 1
|   0x7fffa8005af7   2b7  418bd9               movl rbx,r9
|   0x7fffa8005afa   2ba  33c9                 xorl rcx,rcx
|
|                                  >>>>    tmp = rbx*r15   <<<< (40 insns, 4 jumps)
|   0x7fffa8005afc   2bc  410fafdf             imull rbx,r15
|
|                                          if (!overflow(tmp))
|   0x7fffa8005b00   2c0  0f90c1               setol cl
|   0x7fffa8005b03   2c3  85c9                 testl rcx,rcx
|   0x7fffa8005b05   2c5  0f85c8010000         jnz 0x7fffa8005cd3  <+0x493> (NT)
|   0x7fffa8005b0b   2cb  85db                 testl rbx,rbx
|   0x7fffa8005b0d   2cd  0f850c000000         jnz 0x7fffa8005b1f  <+0x2df> (T) -\
|   0x7fffa8005b13   2d3  450bf9               orl r15,r9                        |
|   0x7fffa8005b16   2d6  4585ff               testl r15,r15                     |
|   0x7fffa8005b19   2d9  0f8cb8010000         jl 0x7fffa8005cd7  <+0x497>       |
|                                                                                |
|                                          if (count < SIZE)                     |
|   0x7fffa8005b1f   2df  4439e0               cmpl rax,r12                    <-/
|   0x7fffa8005b22   2e2  0f83b3010000         jnc 0x7fffa8005cdb  <+0x49b> (NT)
|
|                                          tmp2 = tmp*2
|   0x7fffa8005b28   2e8  488bcb               REX.W movq rcx,rbx
|   0x7fffa8005b2b   2eb  03cb                 addl rcx,rbx
|
|                                          if (!overflow(tmp2))
|   0x7fffa8005b2d   2ed  0f80ac010000         jo 0x7fffa8005cdf  <+0x49f> (NT)
|
|                                          a[count] = tmp2
|   0x7fffa8005b33   2f3  894c8707             movl [rdi+rax*4+0x7],rcx
|
|                                          count = rcx+1 (count++)
|   0x7fffa8005b37   2f7  488bc8               REX.W movq rcx,rax
|   0x7fffa8005b3a   2fa  83c101               addl rcx,0x1
|
|                                          if (!overflow(count))
|   0x7fffa8005b3d   2fd  0f80a0010000         jo 0x7fffa8005ce3  <+0x4a3> (NT)
|
|                                          if (!should_not_interrupt)
|                                     StackGuard::address_of_interrupt_request
|   0x7fffa8005b43   303  41807db100           cmpb [r13-0x4f]
|
\-- 0x7fffa8005b48   308  0f8432ffffff         jz  0x7fffa8005a80  <+0x240> (T)
    0x7fffa8005b4e   30e  e99c000000           jmp 0x7fffa8005bef  <+0x3af>

[snip]

The generated code may look complicated, but it's actually quite simple. However, there are some interesting points worth noting:

  1. The Turbofan was indeed used for code optimization, just as expected. Note that this was intentional, and the loop invoking the mul() function makes it 'hot,' prompting v8 to optimize it.
  2. The loop starts from index 56651 instead of 0, and the reason for this is straightforward: the optimized code starts from where the non-optimized code left off, which makes sense, doesn't it?
  3. Notice that the read values are divided by 2, and when saved, multiplied by
  1. In memory, the vector b is stored as a sequence of 100 (instead of 50), and c as a sequence of 80 (instead of 40)... don't ask me why. Any ideas?

Apart from that, traditional overflow checks are performed, and the loop proceeds, performing one multiplication per iteration, with 40 instructions and 4 branches taken between each imull.

In the end, with this being the best that Turbofan can do for this code, I was slightly disappointed. I expected loop unrolling, perhaps SIMD, and so on. Well, we'll have surprises later, don't be sad =).

GCC Enters the Chat!

Having seen the previous JIT code, what code does GCC produce? Is it somewhat equivalent? Are there optimizations? Let's find out.

The equivalent C code looks like this:

mul.c:

#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <inttypes.h>
#include <time.h>

#define SIZE 10000000

static int64_t difftimespec_us(
  const struct timespec end, const struct timespec start)
{
    return ((int64_t)end.tv_sec - (int64_t)start.tv_sec) * (int64_t)1000000
         + ((int64_t)end.tv_nsec - (int64_t)start.tv_nsec) / 1000;
}

void mul(int *restrict a, int *restrict b, int *restrict c)
{
  for (size_t i = 0; i < SIZE; i++)
    a[i] = b[i] * c[i];
}

int main(int argc, char **argv)
{
  int64_t diff;
  struct timespec t0, t1;
  int *a = malloc(sizeof(*a) * SIZE);
  int *b = malloc(sizeof(*b) * SIZE);
  int *c = malloc(sizeof(*c) * SIZE);

  memset(a, 0, SIZE);
  for (size_t i = 0; i < SIZE; i++) {
    b[i] = 50;
    c[i] = 40;
  }

  int iter = atoi(argv[1]);

clock_gettime(CLOCK_MONOTONIC, &t0);
  for (int i = 0; i < iter; i++)
    mul(a,b,c);
clock_gettime(CLOCK_MONOTONIC, &t1);

  diff = difftimespec_us(t1, t0);
  printf("Time: %f ms\n", diff/1000.0);
  return (0);
}

which produces the following asm when built with -O0 on GCC v9.3.0:

void mul(int *restrict a, int *restrict b, int *restrict c)
{
  4011be:          55                       push   %rbp
  4011bf:          48 89 e5                 mov    %rsp,%rbp
  4011c2:          48 89 7d e8              mov    %rdi,-0x18(%rbp)
  4011c6:          48 89 75 e0              mov    %rsi,-0x20(%rbp)
  4011ca:          48 89 55 d8              mov    %rdx,-0x28(%rbp)
    for (size_t i = 0; i < SIZE; i++)
  4011ce:          48 c7 45 f8 00 00 00     movq   $0x0,-0x8(%rbp)
  4011d5:          00
  4011d6:      /-- eb 47                    jmp    40121f <mul+0x61>
        a[i] = b[i] * c[i];
  4011d8:   /--|-> 48 8b 45 f8              mov    -0x8(%rbp),%rax
  4011dc:   |  |   48 8d 14 85 00 00 00     lea    0x0(,%rax,4),%rdx
  4011e3:   |  |   00
  4011e4:   |  |   48 8b 45 e0              mov    -0x20(%rbp),%rax
  4011e8:   |  |   48 01 d0                 add    %rdx,%rax
  4011eb:   |  |   8b 08                    mov    (%rax),%ecx           ecx = b[i]
  4011ed:   |  |   48 8b 45 f8              mov    -0x8(%rbp),%rax
  4011f1:   |  |   48 8d 14 85 00 00 00     lea    0x0(,%rax,4),%rdx
  4011f8:   |  |   00
  4011f9:   |  |   48 8b 45 d8              mov    -0x28(%rbp),%rax
  4011fd:   |  |   48 01 d0                 add    %rdx,%rax
  401200:   |  |   8b 00                    mov    (%rax),%eax           eax = c[i]
  401202:   |  |   48 8b 55 f8              mov    -0x8(%rbp),%rdx
  401206:   |  |   48 8d 34 95 00 00 00     lea    0x0(,%rdx,4),%rsi
  40120d:   |  |   00
  40120e:   |  |   48 8b 55 e8              mov    -0x18(%rbp),%rdx
  401212:   |  |   48 01 f2                 add    %rsi,%rdx
  401215:   |  |   0f af c1                 imul   %ecx,%eax             eax  = eax*ecx
  401218:   |  |   89 02                    mov    %eax,(%rdx)           a[i] = eax
    for (size_t i = 0; i < SIZE; i++)
  40121a:   |  |   48 83 45 f8 01           addq   $0x1,-0x8(%rbp)       i++;
  40121f:   |  \-> 48 81 7d f8 7f 96 98     cmpq   $0x98967f,-0x8(%rbp)  if (i < count)
  401226:   |      00
  401227:   \----- 76 af                    jbe    4011d8 <mul+0x1a>     (T)
}
  401229:          90                       nop
  40122a:          90                       nop
  40122b:          5d                       pop    %rbp
  40122c:          c3                       ret

Similar to the one generated by the JIT but much simpler and more straightforward, with only 18 instructions and a jump taken between each imul.

Can GCC go beyond this? Certainly, with -O3 and -march=native, GCC easily makes use of AVX2:

clock_gettime(CLOCK_MONOTONIC, &t0);
  40111d:             48 8d 75 b0           lea    -0x50(%rbp),%rsi
  401121:             bf 01 00 00 00        mov    $0x1,%edi
  401126:             41 89 c6              mov    %eax,%r14d
  401129:             e8 02 ff ff ff        call   401030 <clock_gettime@plt>
    for (int i = 0; i < iter; i++)
  40112e:             45 85 ff              test   %r15d,%r15d
  401131:   /-------- 7e 36                 jle    401169 <main+0xe9>
  401133:   |         31 d2                 xor    %edx,%edx
  401135:   |  /----> 31 c0                 xor    %eax,%eax
  401137:   |  |      66 0f 1f 84 00 00 00  nopw   0x0(%rax,%rax,1)
  40113e:   |  |      00 00
        a[i] = b[i] * c[i];
  401140:   |  |  /-> c4 c1 7e 6f 14 04     vmovdqu (%r12,%rax,1),%ymm2
  401146:   |  |  |   c4 e2 6d 40 04 03     vpmulld (%rbx,%rax,1),%ymm2,%ymm0
  40114c:   |  |  |   c4 c1 7e 7f 44 05 00  vmovdqu %ymm0,0x0(%r13,%rax,1)
    for (size_t i = 0; i < SIZE; i++)
  401153:   |  |  |   48 83 c0 20           add    $0x20,%rax
  401157:   |  |  |   48 3d 00 5a 62 02     cmp    $0x2625a00,%rax
  40115d:   |  |  \-- 75 e1                 jne    401140 <main+0xc0>
    for (int i = 0; i < iter; i++)
  40115f:   |  |      ff c2                 inc    %edx
  401161:   |  |      44 39 f2              cmp    %r14d,%edx
  401164:   |  \----- 75 cf                 jne    401135 <main+0xb5>
  401166:   |         c5 f8 77              vzeroupper
        mul(a,b,c);
clock_gettime(CLOCK_MONOTONIC, &t1);
  401169:   \-------> 48 8d 75 c0           lea    -0x40(%rbp),%rsi
  40116d:             bf 01 00 00 00        mov    $0x1,%edi
  401172:             e8 b9 fe ff ff        call   401030 <clock_gettime@plt>

The mul() function was inlined, and now 8 numbers are read and multiplied at a time!

Regarding performance: A comparative analysis will be conducted later, don't worry.

WebAssembly Comes to the Rescue

Fairly enough, it's understandable that JIT-compiled code won't really match up to GCC with -O3, let alone with -march=native: analyzing whether code is 'hot' or not, compiling and executing at runtime, and still finding a balance between compilation time and performance seems quite complicated. Moreover, JS being a dynamically typed language imposes more constraints on the code to be compiled and executed (a problem that Java, for example, doesn't have).

On the other hand, WebAssembly is compiled 'ahead-of-time', similar to Java or C. It has static typing (which avoids the compile-recompile cycle of JS code), its bytecode is smaller than plain-text JS, and it even skips the parsing stage of the source code. All of this creates a very favorable scenario for optimization by Clang.

Having said that, there are high expectations for WebAssembly. Let's see what we can achieve.

The new code for this test are as follows:

load.js:

const buf = read('mul.wasm', 'binary');
const mod = new WebAssembly.Module(buf);

var imports = {
  env: {
    log_time: function(arg) {
      console.log("Time elapsed: " + arg + " ms");
    },
    performance_now: function() {
      return performance.now();
    },
  }
};

const instance = new WebAssembly.Instance(mod, imports);
const { do_mul } = instance.exports;

do_mul(100);

mul_wasm.c:

#include <stdint.h>
#include <stddef.h>

extern double performance_now();
extern void log_time(double);

#define SIZE 10000000

int a[SIZE];
int b[SIZE];
int c[SIZE];

void mul(int *restrict a, int *restrict b, int *restrict c)
{
  for (size_t i = 0; i < SIZE; i++)
    a[i] = b[i] * c[i];
}

void do_mul(int iter)
{
  double t0, t1;

  for (size_t i = 0; i < SIZE; i++) {
    b[i] = 50;
    c[i] = 40;
  }

t0 = performance_now();
  for (int i = 0; i < iter; i++)
    mul(a,b,c);
t1 = performance_now();

  log_time(t1-t0);
}

The C code is basically identical to the previous one, with the only difference being the JS function calls, necessary to obtain timings.

The above example can be compiled with:

$ clang \
  --target=wasm32 \
  --no-standard-libraries \
  -Wl,--no-entry \
  -Wl,--export=do_mul \
  -Wl,--allow-undefined \
  -g -o mul.wasm mul_wasm.c -O3

This time, I'll be using -O3 to see what WebAssembly can do best...

Extracting WebAssembly's JIT Code & Analyzing

To run in GDB, obtain the WebAssembly JITed ASM, and debug, similar procedures to before are required, but with slight adaptations. This time, the v8 function responsible for dumping WebAssembly instructions is WasmCode::Disassemble() from the file wasm-code-manager.cc. Specifically, we need to add a break at line 436.

In summary, something like:

$ gdb \
  -ex "set confirm off" \
  -ex "b _start" \
  -ex "r" \
  -ex "b wasm-code-manager.cc:436" \
  -ex "c" \
  --args $HOME/.jsvu/engines/v8-debug/v8-debug \
  --snapshot_blob="$HOME/.jsvu/engines/v8-debug/snapshot_blob.bin" \
  --allow-natives-syntax \
  --print-wasm-code \
  --no-liftoff \
  load.js

Note some new flags like: --no-liftoff (to avoid using the common compiler and force the use of Turbofan) and --print-wasm-code instead of --print-opt-code.

Below is the heavily commented code of the mul() function:

x86_64 code (click to expand)
--- WebAssembly code ---
name: do_mul
index: 2
kind: wasm function
compiler: TurboFan
Body (size = 3712 = 3680 + 32 padding)
Instructions (size = 3660)

r9 = count
r9 = starts at 0xFFFF_FFFF - 40_000_000

mul:
[snip]
/-> 0x1c4750719e00   600  493b65a0             REX.W cmpq rsp,[r13-0x60]
|   0x1c4750719e04   604  0f860f080000         jna 0x1c475071a619  <+0xe19> (NT)
|   0x1c4750719e0a   60a  458d9900b8c404       leal r11,[r9+0x4c4b800]
|   0x1c4750719e11   611  41baffffffff         movl r10,0xffffffff
|   0x1c4750719e17   617  4d3bda               REX.W cmpq r11,r10
|   0x1c4750719e1a   61a  761d                 jna 0x1c4750719e39  <+0x639> (T) -\
|   0x1c4750719e1c   61c  bf01000000           movl rdi,0x1                      |
|   0x1c4750719e21   621  4989e2               REX.W movq r10,rsp                |
|   0x1c4750719e24   624  4883ec08             REX.W subq rsp,0x8                |
|   0x1c4750719e28   628  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c4750719e2c   62c  4c891424             REX.W movq [rsp],r10              |
|   0x1c4750719e30   630  488b050dfaffff       REX.W movq rax,[rip+0xfffffa0d]   |
|   0x1c4750719e37   637  ffd0                 call rax                          |
|   0x1c4750719e39   639  458da1005e6202       leal r12,[r9+0x2625e00]         <-/
|   0x1c4750719e40   640  41baffffffff         movl r10,0xffffffff
|   0x1c4750719e46   646  4d3be2               REX.W cmpq r12,r10
|   0x1c4750719e49   649  761d                 jna 0x1c4750719e68  <+0x668> (T) -\
|   0x1c4750719e4b   64b  bf01000000           movl rdi,0x1                      |
|   0x1c4750719e50   650  4989e2               REX.W movq r10,rsp                |
|   0x1c4750719e53   653  4883ec08             REX.W subq rsp,0x8                |
|   0x1c4750719e57   657  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c4750719e5b   65b  4c891424             REX.W movq [rsp],r10              |
|   0x1c4750719e5f   65f  488b05def9ffff       REX.W movq rax,[rip+0xfffff9de]   |
|   0x1c4750719e66   666  ffd0                 call rax                          |
|                                                                                |
|                                         r11 = c[idx]                           |
|   0x1c4750719e68   668  468b1c1a             movl r11,[rdx+r11*1]            <-/
|   0x1c4750719e6c   66c  41baffffffff         movl r10,0xffffffff
|
|                                         if (r11 < 32bit)
|   0x1c4750719e72   672  4d3bda               REX.W cmpq r11,r10
|   0x1c4750719e75   675  761d                 jna 0x1c4750719e94  <+0x694> (T) -\
|   0x1c4750719e77   677  bf01000000           movl rdi,0x1                      |
|   0x1c4750719e7c   67c  4989e2               REX.W movq r10,rsp                |
|   0x1c4750719e7f   67f  4883ec08             REX.W subq rsp,0x8                |
|   0x1c4750719e83   683  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c4750719e87   687  4c891424             REX.W movq [rsp],r10              |
|   0x1c4750719e8b   68b  488b05b2f9ffff       REX.W movq rax,[rip+0xfffff9b2]   |
|   0x1c4750719e92   692  ffd0                 call rax                          |
|                                                                                |
|                                         r12 = b[idx]                           |
|   0x1c4750719e94   694  468b2422             movl r12,[rdx+r12*1]            <-/
|   0x1c4750719e98   698  41baffffffff         movl r10,0xffffffff
|
|                                         if (r12 < 32bit)
|   0x1c4750719e9e   69e  4d3be2               REX.W cmpq r12,r10
|   0x1c4750719ea1   6a1  761d                 jna 0x1c4750719ec0  <+0x6c0> (T) -\
|   0x1c4750719ea3   6a3  bf01000000           movl rdi,0x1                      |
|   0x1c4750719ea8   6a8  4989e2               REX.W movq r10,rsp                |
|   0x1c4750719eab   6ab  4883ec08             REX.W subq rsp,0x8                |
|   0x1c4750719eaf   6af  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c4750719eb3   6b3  4c891424             REX.W movq [rsp],r10              |
|   0x1c4750719eb7   6b7  488b0586f9ffff       REX.W movq rax,[rip+0xfffff986]   |
|   0x1c4750719ebe   6be  ffd0                 call rax                          |
|   0x1c4750719ec0   6c0  458db900122707       leal r15,[r9+0x7271200]         <-/
|   0x1c4750719ec7   6c7  41baffffffff         movl r10,0xffffffff
|   0x1c4750719ecd   6cd  4d3bfa               REX.W cmpq r15,r10
|   0x1c4750719ed0   6d0  761d                 jna 0x1c4750719eef  <+0x6ef> (T) -\
|   0x1c4750719ed2   6d2  bf01000000           movl rdi,0x1                      |
|   0x1c4750719ed7   6d7  4989e2               REX.W movq r10,rsp                |
|   0x1c4750719eda   6da  4883ec08             REX.W subq rsp,0x8                |
|   0x1c4750719ede   6de  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c4750719ee2   6e2  4c891424             REX.W movq [rsp],r10              |
|   0x1c4750719ee6   6e6  488b0557f9ffff       REX.W movq rax,[rip+0xfffff957]   |
|   0x1c4750719eed   6ed  ffd0                 call rax                          |
|                                                                                |
|                                                                                |
|                                 >>>>    r11 = r11*r12   <<<<                   |
|   0x1c4750719eef   6ef  450fafdc             imull r11,r12                   <-/
|
|   0x1c4750719ef3   6f3  458da104b8c404       leal r12,[r9+0x4c4b804]
|   0x1c4750719efa   6fa  41baffffffff         movl r10,0xffffffff
|   0x1c4750719f00   700  4d3be2               REX.W cmpq r12,r10
|   0x1c4750719f03   703  761d                 jna 0x1c4750719f22  <+0x722> (T) -\
|   0x1c4750719f05   705  bf01000000           movl rdi,0x1                      |
|   0x1c4750719f0a   70a  4989e2               REX.W movq r10,rsp                |
|   0x1c4750719f0d   70d  4883ec08             REX.W subq rsp,0x8                |
|   0x1c4750719f11   711  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c4750719f15   715  4c891424             REX.W movq [rsp],r10              |
|   0x1c4750719f19   719  488b0524f9ffff       REX.W movq rax,[rip+0xfffff924]   |
|   0x1c4750719f20   720  ffd0                 call rax                          |
|                                                                                |
|                                         a[idx] = r11                           |
|   0x1c4750719f22   722  46891c3a             movl [rdx+r15*1],r11            <-/
|
|   0x1c4750719f26   726  458d99045e6202       leal r11,[r9+0x2625e04]
|   0x1c4750719f2d   72d  41baffffffff         movl r10,0xffffffff
|   0x1c4750719f33   733  4d3bda               REX.W cmpq r11,r10
|   0x1c4750719f36   736  761d                 jna 0x1c4750719f55  <+0x755> (T) -\
|   0x1c4750719f38   738  bf01000000           movl rdi,0x1                      |
|   0x1c4750719f3d   73d  4989e2               REX.W movq r10,rsp                |
|   0x1c4750719f40   740  4883ec08             REX.W subq rsp,0x8                |
|   0x1c4750719f44   744  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c4750719f48   748  4c891424             REX.W movq [rsp],r10              |
|   0x1c4750719f4c   74c  488b05f1f8ffff       REX.W movq rax,[rip+0xfffff8f1]   |
|   0x1c4750719f53   753  ffd0                 call rax                          |
|                                                                                |
|                                         r12 = c[idx]                           |
|   0x1c4750719f55   755  468b2422             movl r12,[rdx+r12*1]            <-/
|   0x1c4750719f59   759  41baffffffff         movl r10,0xffffffff
|   0x1c4750719f5f   75f  4d3be2               REX.W cmpq r12,r10
|   0x1c4750719f62   762  761d                 jna 0x1c4750719f81  <+0x781> (T) -\
|   0x1c4750719f64   764  bf01000000           movl rdi,0x1                      |
|   0x1c4750719f69   769  4989e2               REX.W movq r10,rsp                |
|   0x1c4750719f6c   76c  4883ec08             REX.W subq rsp,0x8                |
|   0x1c4750719f70   770  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c4750719f74   774  4c891424             REX.W movq [rsp],r10              |
|   0x1c4750719f78   778  488b05c5f8ffff       REX.W movq rax,[rip+0xfffff8c5]   |
|   0x1c4750719f7f   77f  ffd0                 call rax                          |
|   0x1c4750719f81   781  468b1c1a             movl r11,[rdx+r11*1]            <-/
|   0x1c4750719f85   785  41baffffffff         movl r10,0xffffffff
|   0x1c4750719f8b   78b  4d3bda               REX.W cmpq r11,r10
|   0x1c4750719f8e   78e  761d                 jna 0x1c4750719fad  <+0x7ad> (T) -\
|   0x1c4750719f90   790  bf01000000           movl rdi,0x1                      |
|   0x1c4750719f95   795  4989e2               REX.W movq r10,rsp                |
|   0x1c4750719f98   798  4883ec08             REX.W subq rsp,0x8                |
|   0x1c4750719f9c   79c  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c4750719fa0   7a0  4c891424             REX.W movq [rsp],r10              |
|   0x1c4750719fa4   7a4  488b0599f8ffff       REX.W movq rax,[rip+0xfffff899]   |
|   0x1c4750719fab   7ab  ffd0                 call rax                          |
|   0x1c4750719fad   7ad  458db904122707       leal r15,[r9+0x7271204]         <-/
|   0x1c4750719fb4   7b4  41baffffffff         movl r10,0xffffffff
|   0x1c4750719fba   7ba  4d3bfa               REX.W cmpq r15,r10
|   0x1c4750719fbd   7bd  761d                 jna 0x1c4750719fdc  <+0x7dc> (T) -\
|   0x1c4750719fbf   7bf  bf01000000           movl rdi,0x1                      |
|   0x1c4750719fc4   7c4  4989e2               REX.W movq r10,rsp                |
|   0x1c4750719fc7   7c7  4883ec08             REX.W subq rsp,0x8                |
|   0x1c4750719fcb   7cb  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c4750719fcf   7cf  4c891424             REX.W movq [rsp],r10              |
|   0x1c4750719fd3   7d3  488b056af8ffff       REX.W movq rax,[rip+0xfffff86a]   |
|   0x1c4750719fda   7da  ffd0                 call rax                          |
|                                                                                |
|                                 >>>>    r12 = r12*r11   <<<<                   |
|   0x1c4750719fdc   7dc  450fafe3             imull r12,r11                   <-/
|
|                                         a[idx] = r12
|   0x1c4750719fe0   7e0  4689243a             movl [rdx+r15*1],r12
|
|                                         count += 8
|   0x1c4750719fe4   7e4  4183c108             addl r9,0x8
|                                         if (count == 0) (wraparound)
|   0x1c4750719fe8   7e8  0f84c8030000         jz 0x1c475071a3b6  <+0xbb6> (NT)
|
|   0x1c4750719fee   7ee  458d9900b8c404       leal r11,[r9+0x4c4b800]
|   0x1c4750719ff5   7f5  41baffffffff         movl r10,0xffffffff
|   0x1c4750719ffb   7fb  4d3bda               REX.W cmpq r11,r10
|   0x1c4750719ffe   7fe  761d                 jna 0x1c475071a01d  <+0x81d> (T) -\
|   0x1c475071a000   800  bf01000000           movl rdi,0x1                      |
|   0x1c475071a005   805  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a008   808  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a00c   80c  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a010   810  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a014   814  488b0529f8ffff       REX.W movq rax,[rip+0xfffff829]   |
|   0x1c475071a01b   81b  ffd0                 call rax                          |
|   0x1c475071a01d   81d  458da1005e6202       leal r12,[r9+0x2625e00]         <-/
|   0x1c475071a024   824  41baffffffff         movl r10,0xffffffff
|   0x1c475071a02a   82a  4d3be2               REX.W cmpq r12,r10
|   0x1c475071a02d   82d  761d                 jna 0x1c475071a04c  <+0x84c> (T) -\
|   0x1c475071a02f   82f  bf01000000           movl rdi,0x1                      |
|   0x1c475071a034   834  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a037   837  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a03b   83b  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a03f   83f  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a043   843  488b05faf7ffff       REX.W movq rax,[rip+0xfffff7fa]   |
|   0x1c475071a04a   84a  ffd0                 call rax                          |
|                                                                                |
|                                         r11 = c[idx]                           |
|   0x1c475071a04c   84c  468b1c1a             movl r11,[rdx+r11*1]            <-/

|   0x1c475071a050   850  41baffffffff         movl r10,0xffffffff
|   0x1c475071a056   856  4d3bda               REX.W cmpq r11,r10
|   0x1c475071a059   859  761d                 jna 0x1c475071a078  <+0x878> (T) -\
|   0x1c475071a05b   85b  bf01000000           movl rdi,0x1                      |
|   0x1c475071a060   860  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a063   863  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a067   867  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a06b   86b  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a06f   86f  488b05cef7ffff       REX.W movq rax,[rip+0xfffff7ce]   |
|   0x1c475071a076   876  ffd0                 call rax                          |
|                                                                                |
|                                         r12 = b[idx]                           |
|   0x1c475071a078   878  468b2422             movl r12,[rdx+r12*1]            <-/
|
|   0x1c475071a07c   87c  41baffffffff         movl r10,0xffffffff
|   0x1c475071a082   882  4d3be2               REX.W cmpq r12,r10
|   0x1c475071a085   885  761d                 jna 0x1c475071a0a4  <+0x8a4> (T) -\
|   0x1c475071a087   887  bf01000000           movl rdi,0x1                      |
|   0x1c475071a08c   88c  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a08f   88f  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a093   893  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a097   897  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a09b   89b  488b05a2f7ffff       REX.W movq rax,[rip+0xfffff7a2]   |
|   0x1c475071a0a2   8a2  ffd0                 call rax                          |
|   0x1c475071a0a4   8a4  458db900122707       leal r15,[r9+0x7271200]         <-/
|   0x1c475071a0ab   8ab  41baffffffff         movl r10,0xffffffff
|   0x1c475071a0b1   8b1  4d3bfa               REX.W cmpq r15,r10
|   0x1c475071a0b4   8b4  761d                 jna 0x1c475071a0d3  <+0x8d3> (T) -\
|   0x1c475071a0b6   8b6  bf01000000           movl rdi,0x1                      |
|   0x1c475071a0bb   8bb  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a0be   8be  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a0c2   8c2  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a0c6   8c6  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a0ca   8ca  488b0573f7ffff       REX.W movq rax,[rip+0xfffff773]   |
|   0x1c475071a0d1   8d1  ffd0                 call rax                          |
|                                                                                |
|                                 >>>>    r11 = r11*r12   <<<<                   |
|   0x1c475071a0d3   8d3  450fafdc             imull r11,r12                   <-/
|   0x1c475071a0d7   8d7  458da104b8c404       leal r12,[r9+0x4c4b804]
|   0x1c475071a0de   8de  41baffffffff         movl r10,0xffffffff
|   0x1c475071a0e4   8e4  4d3be2               REX.W cmpq r12,r10
|   0x1c475071a0e7   8e7  761d                 jna 0x1c475071a106  <+0x906> (T) -\
|   0x1c475071a0e9   8e9  bf01000000           movl rdi,0x1                      |
|   0x1c475071a0ee   8ee  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a0f1   8f1  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a0f5   8f5  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a0f9   8f9  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a0fd   8fd  488b0540f7ffff       REX.W movq rax,[rip+0xfffff740]   |
|   0x1c475071a104   904  ffd0                 call rax                          |
|                                                                                |
|                                         a[idx] = r11                           |
|   0x1c475071a106   906  46891c3a             movl [rdx+r15*1],r11            <-/
|
|   0x1c475071a10a   90a  458d99045e6202       leal r11,[r9+0x2625e04]
|   0x1c475071a111   911  41baffffffff         movl r10,0xffffffff
|   0x1c475071a117   917  4d3bda               REX.W cmpq r11,r10
|   0x1c475071a11a   91a  761d                 jna 0x1c475071a139  <+0x939> (T) -\
|   0x1c475071a11c   91c  bf01000000           movl rdi,0x1                      |
|   0x1c475071a121   921  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a124   924  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a128   928  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a12c   92c  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a130   930  488b050df7ffff       REX.W movq rax,[rip+0xfffff70d]   |
|   0x1c475071a137   937  ffd0                 call rax                          |
|                                                                                |
|                                         r12 = c[idx]                           |
|   0x1c475071a139   939  468b2422             movl r12,[rdx+r12*1]            <-/
|
|   0x1c475071a13d   93d  41baffffffff         movl r10,0xffffffff
|   0x1c475071a143   943  4d3be2               REX.W cmpq r12,r10
|   0x1c475071a146   946  761d                 jna 0x1c475071a165  <+0x965> (T) -\
|   0x1c475071a148   948  bf01000000           movl rdi,0x1                      |
|   0x1c475071a14d   94d  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a150   950  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a154   954  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a158   958  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a15c   95c  488b05e1f6ffff       REX.W movq rax,[rip+0xfffff6e1]   |
|   0x1c475071a163   963  ffd0                 call rax                          |
|                                                                                |
|                                         r11 = b[idx]                           |
|   0x1c475071a165   965  468b1c1a             movl r11,[rdx+r11*1]            <-/
|
|   0x1c475071a169   969  41baffffffff         movl r10,0xffffffff
|   0x1c475071a16f   96f  4d3bda               REX.W cmpq r11,r10
|   0x1c475071a172   972  761d                 jna 0x1c475071a191  <+0x991> (T) -\
|   0x1c475071a174   974  bf01000000           movl rdi,0x1                      |
|   0x1c475071a179   979  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a17c   97c  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a180   980  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a184   984  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a188   988  488b05b5f6ffff       REX.W movq rax,[rip+0xfffff6b5]   |
|   0x1c475071a18f   98f  ffd0                 call rax                          |
|   0x1c475071a191   991  458db904122707       leal r15,[r9+0x7271204]         <-/
|   0x1c475071a198   998  41baffffffff         movl r10,0xffffffff
|   0x1c475071a19e   99e  4d3bfa               REX.W cmpq r15,r10
|   0x1c475071a1a1   9a1  761d                 jna 0x1c475071a1c0  <+0x9c0> (T) -\
|   0x1c475071a1a3   9a3  bf01000000           movl rdi,0x1                      |
|   0x1c475071a1a8   9a8  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a1ab   9ab  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a1af   9af  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a1b3   9b3  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a1b7   9b7  488b0586f6ffff       REX.W movq rax,[rip+0xfffff686]   |
|   0x1c475071a1be   9be  ffd0                 call rax                          |
|                                                                                |
|                                 >>>>    r12 = r12*r11   <<<<                   |
|   0x1c475071a1c0   9c0  450fafe3             imull r12,r11                   <-/
|                                         a[idx] = r12
|   0x1c475071a1c4   9c4  4689243a             movl [rdx+r15*1],r12
|
|                                         count += 8
|   0x1c475071a1c8   9c8  4183c108             addl r9,0x8
|                                         if (count == 0) (wraparound)
|   0x1c475071a1cc   9cc  0f84e4010000         jz 0x1c475071a3b6  <+0xbb6> (NT)
|
|   0x1c475071a1d2   9d2  458d9900b8c404       leal r11,[r9+0x4c4b800]
|   0x1c475071a1d9   9d9  41baffffffff         movl r10,0xffffffff
|   0x1c475071a1df   9df  4d3bda               REX.W cmpq r11,r10
|   0x1c475071a1e2   9e2  761d                 jna 0x1c475071a201  <+0xa01> (T) -\
|   0x1c475071a1e4   9e4  bf01000000           movl rdi,0x1                      |
|   0x1c475071a1e9   9e9  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a1ec   9ec  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a1f0   9f0  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a1f4   9f4  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a1f8   9f8  488b0545f6ffff       REX.W movq rax,[rip+0xfffff645]   |
|   0x1c475071a1ff   9ff  ffd0                 call rax                          |
|   0x1c475071a201   a01  458da1005e6202       leal r12,[r9+0x2625e00]         <-/
|   0x1c475071a208   a08  41baffffffff         movl r10,0xffffffff
|   0x1c475071a20e   a0e  4d3be2               REX.W cmpq r12,r10
|   0x1c475071a211   a11  761d                 jna 0x1c475071a230  <+0xa30> (T) -\
|   0x1c475071a213   a13  bf01000000           movl rdi,0x1                      |
|   0x1c475071a218   a18  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a21b   a1b  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a21f   a1f  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a223   a23  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a227   a27  488b0516f6ffff       REX.W movq rax,[rip+0xfffff616]   |
|   0x1c475071a22e   a2e  ffd0                 call rax                          |
|                                                                                |
|                                         r11 = c[idx]                           |
|   0x1c475071a230   a30  468b1c1a             movl r11,[rdx+r11*1]            <-/
|
|   0x1c475071a234   a34  41baffffffff         movl r10,0xffffffff
|   0x1c475071a23a   a3a  4d3bda               REX.W cmpq r11,r10
|   0x1c475071a23d   a3d  761d                 jna 0x1c475071a25c  <+0xa5c> (T) -\
|   0x1c475071a23f   a3f  bf01000000           movl rdi,0x1                      |
|   0x1c475071a244   a44  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a247   a47  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a24b   a4b  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a24f   a4f  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a253   a53  488b05eaf5ffff       REX.W movq rax,[rip+0xfffff5ea]   |
|   0x1c475071a25a   a5a  ffd0                 call rax                          |
|                                                                                |
|                                         r12 = b[idx]                           |
|   0x1c475071a25c   a5c  468b2422             movl r12,[rdx+r12*1]            <-/
|
|   0x1c475071a260   a60  41baffffffff         movl r10,0xffffffff
|   0x1c475071a266   a66  4d3be2               REX.W cmpq r12,r10
|   0x1c475071a269   a69  761d                 jna 0x1c475071a288  <+0xa88> (T) -\
|   0x1c475071a26b   a6b  bf01000000           movl rdi,0x1                      |
|   0x1c475071a270   a70  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a273   a73  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a277   a77  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a27b   a7b  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a27f   a7f  488b05bef5ffff       REX.W movq rax,[rip+0xfffff5be]   |
|   0x1c475071a286   a86  ffd0                 call rax                          |
|   0x1c475071a288   a88  458db900122707       leal r15,[r9+0x7271200]         <-/
|   0x1c475071a28f   a8f  41baffffffff         movl r10,0xffffffff
|   0x1c475071a295   a95  4d3bfa               REX.W cmpq r15,r10
|   0x1c475071a298   a98  761d                 jna 0x1c475071a2b7  <+0xab7> (T) -\
|   0x1c475071a29a   a9a  bf01000000           movl rdi,0x1                      |
|   0x1c475071a29f   a9f  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a2a2   aa2  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a2a6   aa6  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a2aa   aaa  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a2ae   aae  488b058ff5ffff       REX.W movq rax,[rip+0xfffff58f]   |
|   0x1c475071a2b5   ab5  ffd0                 call rax                          |
|                                                                                |
|                                 >>>>    r11 = r11*r12   <<<<                   |
|   0x1c475071a2b7   ab7  450fafdc             imull r11,r12                   <-/
|
|   0x1c475071a2bb   abb  458da104b8c404       leal r12,[r9+0x4c4b804]
|   0x1c475071a2c2   ac2  41baffffffff         movl r10,0xffffffff
|   0x1c475071a2c8   ac8  4d3be2               REX.W cmpq r12,r10
|   0x1c475071a2cb   acb  761d                 jna 0x1c475071a2ea  <+0xaea> (T) -\
|   0x1c475071a2cd   acd  bf01000000           movl rdi,0x1                      |
|   0x1c475071a2d2   ad2  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a2d5   ad5  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a2d9   ad9  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a2dd   add  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a2e1   ae1  488b055cf5ffff       REX.W movq rax,[rip+0xfffff55c]   |
|   0x1c475071a2e8   ae8  ffd0                 call rax                          |
|                                                                                |
|                                         a[idx] = r11                           |
|   0x1c475071a2ea   aea  46891c3a             movl [rdx+r15*1],r11            <-/
|
|   0x1c475071a2ee   aee  458d99045e6202       leal r11,[r9+0x2625e04]
|   0x1c475071a2f5   af5  41baffffffff         movl r10,0xffffffff
|   0x1c475071a2fb   afb  4d3bda               REX.W cmpq r11,r10
|   0x1c475071a2fe   afe  761d                 jna 0x1c475071a31d  <+0xb1d> (T) -\
|   0x1c475071a300   b00  bf01000000           movl rdi,0x1                      |
|   0x1c475071a305   b05  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a308   b08  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a30c   b0c  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a310   b10  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a314   b14  488b0529f5ffff       REX.W movq rax,[rip+0xfffff529]   |
|   0x1c475071a31b   b1b  ffd0                 call rax                          |
|                                                                                |
|                                         r12 = c[idx]                           |
|   0x1c475071a31d   b1d  468b2422             movl r12,[rdx+r12*1]            <-/
|
|   0x1c475071a321   b21  41baffffffff         movl r10,0xffffffff
|   0x1c475071a327   b27  4d3be2               REX.W cmpq r12,r10
|   0x1c475071a32a   b2a  761d                 jna 0x1c475071a349  <+0xb49> (T) -\
|   0x1c475071a32c   b2c  bf01000000           movl rdi,0x1                      |
|   0x1c475071a331   b31  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a334   b34  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a338   b38  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a33c   b3c  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a340   b40  488b05fdf4ffff       REX.W movq rax,[rip+0xfffff4fd]   |
|   0x1c475071a347   b47  ffd0                 call rax                          |
|                                                                                |
|                                         r11 = b[idx]                           |
|   0x1c475071a349   b49  468b1c1a             movl r11,[rdx+r11*1]            <-/
|
|   0x1c475071a34d   b4d  41baffffffff         movl r10,0xffffffff
|   0x1c475071a353   b53  4d3bda               REX.W cmpq r11,r10
|   0x1c475071a356   b56  761d                 jna 0x1c475071a375  <+0xb75> (T) -\
|   0x1c475071a358   b58  bf01000000           movl rdi,0x1                      |
|   0x1c475071a35d   b5d  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a360   b60  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a364   b64  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a368   b68  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a36c   b6c  488b05d1f4ffff       REX.W movq rax,[rip+0xfffff4d1]   |
|   0x1c475071a373   b73  ffd0                 call rax                          |
|   0x1c475071a375   b75  458db904122707       leal r15,[r9+0x7271204]         <-/
|   0x1c475071a37c   b7c  41baffffffff         movl r10,0xffffffff
|   0x1c475071a382   b82  4d3bfa               REX.W cmpq r15,r10
|   0x1c475071a385   b85  761d                 jna 0x1c475071a3a4  <+0xba4> (T) -\
|   0x1c475071a387   b87  bf01000000           movl rdi,0x1                      |
|   0x1c475071a38c   b8c  4989e2               REX.W movq r10,rsp                |
|   0x1c475071a38f   b8f  4883ec08             REX.W subq rsp,0x8                |
|   0x1c475071a393   b93  4883e4f0             REX.W andq rsp,0xf0               |
|   0x1c475071a397   b97  4c891424             REX.W movq [rsp],r10              |
|   0x1c475071a39b   b9b  488b05a2f4ffff       REX.W movq rax,[rip+0xfffff4a2]   |
|   0x1c475071a3a2   ba2  ffd0                 call rax                          |
|                                                                                |
|                                 >>>>    r12 = r12*r11   <<<<                   |
|   0x1c475071a3a4   ba4  450fafe3             imull r12,r11                   <-/
|                                         a[idx] = r12
|   0x1c475071a3a8   ba8  4689243a             movl [rdx+r15*1],r12
|
|                                         count += 8
|   0x1c475071a3ac   bac  4183c108             addl r9,0x8
|                                         if (count != 0)
\-  0x1c475071a3b0   bb0  0f854afaffff         jnz 0x1c4750719e00  <+0x600>  (T)
[snip]

Although the produced code is extensive, its operation is simple: the function underwent inlining, and the loop underwent 'unrolling,' performing 6 multiplications per iteration, with an average of ~22 instructions and 5 jumps between each imul. In practice, unrolling might provide some benefits, but apart from that, it's basically the same ASM produced by the JITed JS, and it certainly wouldn't yield significant gains compared to the best code emitted by GCC!

Is it the end? Have our options run out? No!

If WebAssembly Supported SIMD... Oh Wait, It Does!

WebAssembly does support SIMD instructions, and it is expected that these vector instructions will make the code JITed by Turbofan also vectorized.

By default, SIMD instructions are not emitted in a build with -O3 and require a special flag for that: -msimd128. This way, Clang will emit vectorized code in your wasm.

Rebuild your code with:

$ clang \
  --target=wasm32 \
  --no-standard-libraries \
  -Wl,--no-entry \
  -Wl,--export=do_mul \
  -Wl,--allow-undefined \
  -g -o mul.wasm mul_wasm.c -O3 -msimd128

Which produces WebAssembly code like:

$ wasm2wat mul.wasm
(module
  (type (;0;) (func (result f64)))
  (type (;1;) (func (param f64)))
  (type (;2;) (func (param i32)))
  (import "env" "performance_now" (func $performance_now (type 0)))
  (import "env" "log_time" (func $log_time (type 1)))
  (func $do_mul (type 2) (param i32)
    (local i32 v128 v128 f64 i32)

    [snip]

    call $performance_now
    local.set 4
    block  ;; label = @1
      local.get 0
      i32.const 1
      i32.lt_s
      br_if 0 (;@1;)
      i32.const 0
      local.set 5
      loop  ;; label = @2
        i32.const -40000000
        local.set 1
        loop  ;; label = @3
          local.get 1
          i32.const 120001024
          i32.add
          local.get 1
          i32.const 80001024
          i32.add
          v128.load             <<<< SIMD
          local.get 1
          i32.const 40001024
          i32.add
          v128.load             <<<< SIMD
          i32x4.mul             <<<< SIMD
          v128.store            <<<< SIMD
          local.get 1
          i32.const 120001040
          i32.add
          local.get 1
          i32.const 80001040
          i32.add
          v128.load             <<<< SIMD
          local.get 1
          i32.const 40001040
          i32.add
          v128.load             <<<< SIMD
          i32x4.mul             <<<< SIMD
          v128.store            <<<< SIMD
          local.get 1
          i32.const 32
          i32.add
          local.tee 1
          br_if 0 (;@3;)
        end
        local.get 5
        i32.const 1
        i32.add
        local.tee 5
        local.get 0
        i32.ne
        br_if 0 (;@2;)
      end
    end
    call $performance_now

    [snip]

Of course, this is not a guarantee that the v8 JIT will transform this into SIMD or how it would do the equivalent for x86_64. So, let's see what this actually generates:

x86_64 code (click to expand)
--- WebAssembly code ---
name: do_mul
index: 2
kind: wasm function
compiler: TurboFan
Body (size = 2880 = 2868 + 12 padding)
Instructions (size = 2848)

r9 = count
r9 = starts at 0xFFFF_FFFF - 40_000_000

mul:
[snip]
/-> 0x3ce639a41c80   480  493b65a0             REX.W cmpq rsp,[r13-0x60]
|   0x3ce639a41c84   484  0f8666060000         jna 0x3ce639a422f0  <+0xaf0> (NT)
|   0x3ce639a41c8a   48a  458d9900b8c404       leal r11,[r9+0x4c4b800]
|   0x3ce639a41c91   491  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41c97   497  4d3bda               REX.W cmpq r11,r10
|   0x3ce639a41c9a   49a  761d                 jna 0x3ce639a41cb9  <+0x4b9> (T) -\
|   0x3ce639a41c9c   49c  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41ca1   4a1  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41ca4   4a4  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41ca8   4a8  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41cac   4ac  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41cb0   4b0  488b058dfbffff       REX.W movq rax,[rip+0xfffffb8d]   |
|   0x3ce639a41cb7   4b7  ffd0                 call rax                        <-/
|   0x3ce639a41cb9   4b9  458da1005e6202       leal r12,[r9+0x2625e00]
|   0x3ce639a41cc0   4c0  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41cc6   4c6  4d3be2               REX.W cmpq r12,r10
|   0x3ce639a41cc9   4c9  761d                 jna 0x3ce639a41ce8  <+0x4e8> (T) -\
|   0x3ce639a41ccb   4cb  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41cd0   4d0  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41cd3   4d3  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41cd7   4d7  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41cdb   4db  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41cdf   4df  488b055efbffff       REX.W movq rax,[rip+0xfffffb5e]   |
|   0x3ce639a41ce6   4e6  ffd0                 call rax                          |
|                                                                                |
|                                         xmm0 = c[index]                        |
|   0x3ce639a41ce8   4e8  c4a17a6f041a         vmovdqu xmm0,[rdx+r11*1]        <-/
|                                         xmm2 = b[index]
|   0x3ce639a41cee   4ee  c4a17a6f1422         vmovdqu xmm2,[rdx+r12*1]
|
|   0x3ce639a41cf4   4f4  458d9900122707       leal r11,[r9+0x7271200]
|   0x3ce639a41cfb   4fb  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41d01   501  4d3bda               REX.W cmpq r11,r10
|   0x3ce639a41d04   504  761d                 jna 0x3ce639a41d23  <+0x523> (T) -\
|   0x3ce639a41d06   506  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41d0b   50b  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41d0e   50e  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41d12   512  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41d16   516  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41d1a   51a  488b0523fbffff       REX.W movq rax,[rip+0xfffffb23]   |
|   0x3ce639a41d21   521  ffd0                 call rax                          |
|                                                                                |
|                                         xmm0 = xmm0*xmm2                       |
|   0x3ce639a41d23   523  c4e27940c2           vpmulld xmm0,xmm0,xmm2          <-/
|   0x3ce639a41d28   528  458da110b8c404       leal r12,[r9+0x4c4b810]
|   0x3ce639a41d2f   52f  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41d35   535  4d3be2               REX.W cmpq r12,r10
|   0x3ce639a41d38   538  761d                 jna 0x3ce639a41d57  <+0x557> (T) -\
|   0x3ce639a41d3a   53a  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41d3f   53f  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41d42   542  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41d46   546  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41d4a   54a  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41d4e   54e  488b05effaffff       REX.W movq rax,[rip+0xfffffaef]   |
|   0x3ce639a41d55   555  ffd0                 call rax                          |
|                                                                                |
|                                         a[idx] = xmm0                          |
|   0x3ce639a41d57   557  c4a17a7f041a         vmovdqu [rdx+r11*1],xmm0        <-/
|
|   0x3ce639a41d5d   55d  458d99105e6202       leal r11,[r9+0x2625e10]
|   0x3ce639a41d64   564  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41d6a   56a  4d3bda               REX.W cmpq r11,r10
|   0x3ce639a41d6d   56d  761d                 jna 0x3ce639a41d8c  <+0x58c> (T) -\
|   0x3ce639a41d6f   56f  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41d74   574  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41d77   577  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41d7b   57b  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41d7f   57f  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41d83   583  488b05bafaffff       REX.W movq rax,[rip+0xfffffaba]   |
|   0x3ce639a41d8a   58a  ffd0                 call rax                          |
|                                                                                |
|                                         xmm0 = c[idx]                          |
|   0x3ce639a41d8c   58c  c4a17a6f0422         vmovdqu xmm0,[rdx+r12*1]        <-/
|                                         xmm2 = b[idx]
|   0x3ce639a41d92   592  c4a17a6f141a         vmovdqu xmm2,[rdx+r11*1]
|
|   0x3ce639a41d98   598  458d9910122707       leal r11,[r9+0x7271210]
|   0x3ce639a41d9f   59f  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41da5   5a5  4d3bda               REX.W cmpq r11,r10
|   0x3ce639a41da8   5a8  761d                 jna 0x3ce639a41dc7  <+0x5c7> (T) -\
|   0x3ce639a41daa   5aa  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41daf   5af  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41db2   5b2  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41db6   5b6  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41dba   5ba  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41dbe   5be  488b057ffaffff       REX.W movq rax,[rip+0xfffffa7f]   |
|   0x3ce639a41dc5   5c5  ffd0                 call rax                          |
|                                                                                |
|                                         xmm0 = xmm0*xmm2                       |
|   0x3ce639a41dc7   5c7  c4e27940c2           vpmulld xmm0,xmm0,xmm2          <-/
|                                         a[idx] = xmm0
|   0x3ce639a41dcc   5cc  c4a17a7f041a         vmovdqu [rdx+r11*1],xmm0
|
|                                         count += 32
|   0x3ce639a41dd2   5d2  4183c120             addl r9,0x20
|                                         if (count == 0) (wraparound)
|   0x3ce639a41dd6   5d6  0f84a4020000         jz 0x3ce639a42080  <+0x880> (NT)
|
|   0x3ce639a41ddc   5dc  458d9900b8c404       leal r11,[r9+0x4c4b800]
|   0x3ce639a41de3   5e3  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41de9   5e9  4d3bda               REX.W cmpq r11,r10
|   0x3ce639a41dec   5ec  761d                 jna 0x3ce639a41e0b  <+0x60b> (T) -\
|   0x3ce639a41dee   5ee  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41df3   5f3  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41df6   5f6  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41dfa   5fa  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41dfe   5fe  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41e02   602  488b053bfaffff       REX.W movq rax,[rip+0xfffffa3b]   |
|   0x3ce639a41e09   609  ffd0                 call rax                          |
|   0x3ce639a41e0b   60b  458da1005e6202       leal r12,[r9+0x2625e00]         <-/
|   0x3ce639a41e12   612  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41e18   618  4d3be2               REX.W cmpq r12,r10
|   0x3ce639a41e1b   61b  761d                 jna 0x3ce639a41e3a  <+0x63a> (T) -\
|   0x3ce639a41e1d   61d  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41e22   622  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41e25   625  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41e29   629  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41e2d   62d  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41e31   631  488b050cfaffff       REX.W movq rax,[rip+0xfffffa0c]   |
|   0x3ce639a41e38   638  ffd0                 call rax                          |
|                                                                                |
|                                         xmm0 = c[idx]                          |
|   0x3ce639a41e3a   63a  c4a17a6f041a         vmovdqu xmm0,[rdx+r11*1]        <-/
|                                         xmm2 = b[idx]
|   0x3ce639a41e40   640  c4a17a6f1422         vmovdqu xmm2,[rdx+r12*1]
|
|   0x3ce639a41e46   646  458d9900122707       leal r11,[r9+0x7271200]
|   0x3ce639a41e4d   64d  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41e53   653  4d3bda               REX.W cmpq r11,r10
|   0x3ce639a41e56   656  761d                 jna 0x3ce639a41e75  <+0x675> (T) -\
|   0x3ce639a41e58   658  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41e5d   65d  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41e60   660  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41e64   664  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41e68   668  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41e6c   66c  488b05d1f9ffff       REX.W movq rax,[rip+0xfffff9d1]   |
|   0x3ce639a41e73   673  ffd0                 call rax                          |
|                                                                                |
|                                         xmm0 = xmm0*xmm2                       |
|   0x3ce639a41e75   675  c4e27940c2           vpmulld xmm0,xmm0,xmm2          <-/
|
|   0x3ce639a41e7a   67a  458da110b8c404       leal r12,[r9+0x4c4b810]
|   0x3ce639a41e81   681  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41e87   687  4d3be2               REX.W cmpq r12,r10
|   0x3ce639a41e8a   68a  761d                 jna 0x3ce639a41ea9  <+0x6a9> (T) -\
|   0x3ce639a41e8c   68c  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41e91   691  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41e94   694  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41e98   698  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41e9c   69c  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41ea0   6a0  488b059df9ffff       REX.W movq rax,[rip+0xfffff99d]   |
|   0x3ce639a41ea7   6a7  ffd0                 call rax                          |
|                                                                                |
|                                         a[idx] = xmm0                          |
|   0x3ce639a41ea9   6a9  c4a17a7f041a         vmovdqu [rdx+r11*1],xmm0        <-/
|
|   0x3ce639a41eaf   6af  458d99105e6202       leal r11,[r9+0x2625e10]
|   0x3ce639a41eb6   6b6  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41ebc   6bc  4d3bda               REX.W cmpq r11,r10
|   0x3ce639a41ebf   6bf  761d                 jna 0x3ce639a41ede  <+0x6de> (T) -\
|   0x3ce639a41ec1   6c1  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41ec6   6c6  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41ec9   6c9  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41ecd   6cd  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41ed1   6d1  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41ed5   6d5  488b0568f9ffff       REX.W movq rax,[rip+0xfffff968]   |
|   0x3ce639a41edc   6dc  ffd0                 call rax                          |
|                                                                                |
|                                         xmm0 = c[idx]                          |
|   0x3ce639a41ede   6de  c4a17a6f0422         vmovdqu xmm0,[rdx+r12*1]        <-/
|                                         xmm2 = b[idx]
|   0x3ce639a41ee4   6e4  c4a17a6f141a         vmovdqu xmm2,[rdx+r11*1]
|
|   0x3ce639a41eea   6ea  458d9910122707       leal r11,[r9+0x7271210]
|   0x3ce639a41ef1   6f1  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41ef7   6f7  4d3bda               REX.W cmpq r11,r10
|   0x3ce639a41efa   6fa  761d                 jna 0x3ce639a41f19  <+0x719> (T) -\
|   0x3ce639a41efc   6fc  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41f01   701  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41f04   704  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41f08   708  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41f0c   70c  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41f10   710  488b052df9ffff       REX.W movq rax,[rip+0xfffff92d]   |
|   0x3ce639a41f17   717  ffd0                 call rax                          |
|                                                                                |
|                                         xmm0 = xmm0*xmm2                       |
|   0x3ce639a41f19   719  c4e27940c2           vpmulld xmm0,xmm0,xmm2          <-/
|                                         a[idx] = xmm0
|   0x3ce639a41f1e   71e  c4a17a7f041a         vmovdqu [rdx+r11*1],xmm0
|
|                                         count += 32
|   0x3ce639a41f24   724  4183c120             addl r9,0x20
|                                         if (count == 0) (wraparound)
|   0x3ce639a41f28   728  0f8452010000         jz 0x3ce639a42080  <+0x880> (NT)
|
|   0x3ce639a41f2e   72e  458d9900b8c404       leal r11,[r9+0x4c4b800]
|   0x3ce639a41f35   735  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41f3b   73b  4d3bda               REX.W cmpq r11,r10
|   0x3ce639a41f3e   73e  761d                 jna 0x3ce639a41f5d  <+0x75d> (T) -\
|   0x3ce639a41f40   740  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41f45   745  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41f48   748  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41f4c   74c  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41f50   750  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41f54   754  488b05e9f8ffff       REX.W movq rax,[rip+0xfffff8e9]   |
|   0x3ce639a41f5b   75b  ffd0                 call rax                          |
|   0x3ce639a41f5d   75d  458da1005e6202       leal r12,[r9+0x2625e00]         <-/
|   0x3ce639a41f64   764  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41f6a   76a  4d3be2               REX.W cmpq r12,r10
|   0x3ce639a41f6d   76d  761d                 jna 0x3ce639a41f8c  <+0x78c> (T) -\
|   0x3ce639a41f6f   76f  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41f74   774  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41f77   777  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41f7b   77b  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41f7f   77f  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41f83   783  488b05baf8ffff       REX.W movq rax,[rip+0xfffff8ba]   |
|   0x3ce639a41f8a   78a  ffd0                 call rax                          |
|                                                                                |
|                                         xmm0 = c[idx]                          |
|   0x3ce639a41f8c   78c  c4a17a6f041a         vmovdqu xmm0,[rdx+r11*1]        <-/
|                                         xmm2 = b[idx]
|   0x3ce639a41f92   792  c4a17a6f1422         vmovdqu xmm2,[rdx+r12*1]
|
|   0x3ce639a41f98   798  458d9900122707       leal r11,[r9+0x7271200]
|   0x3ce639a41f9f   79f  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41fa5   7a5  4d3bda               REX.W cmpq r11,r10
|   0x3ce639a41fa8   7a8  761d                 jna 0x3ce639a41fc7  <+0x7c7> (T) -\
|   0x3ce639a41faa   7aa  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41faf   7af  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41fb2   7b2  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41fb6   7b6  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41fba   7ba  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41fbe   7be  488b057ff8ffff       REX.W movq rax,[rip+0xfffff87f]   |
|   0x3ce639a41fc5   7c5  ffd0                 call rax                          |
|                                                                                |
|                                         xmm0 = xmm0*xmm2                       |
|   0x3ce639a41fc7   7c7  c4e27940c2           vpmulld xmm0,xmm0,xmm2          <-/
|
|   0x3ce639a41fcc   7cc  458da110b8c404       leal r12,[r9+0x4c4b810]
|   0x3ce639a41fd3   7d3  41baffffffff         movl r10,0xffffffff
|   0x3ce639a41fd9   7d9  4d3be2               REX.W cmpq r12,r10
|   0x3ce639a41fdc   7dc  761d                 jna 0x3ce639a41ffb  <+0x7fb> (T) -\
|   0x3ce639a41fde   7de  bf01000000           movl rdi,0x1                      |
|   0x3ce639a41fe3   7e3  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a41fe6   7e6  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a41fea   7ea  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a41fee   7ee  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a41ff2   7f2  488b054bf8ffff       REX.W movq rax,[rip+0xfffff84b]   |
|   0x3ce639a41ff9   7f9  ffd0                 call rax                          |
|                                                                                |
|                                         a[idx] = xmm0                          |
|   0x3ce639a41ffb   7fb  c4a17a7f041a         vmovdqu [rdx+r11*1],xmm0        <-/
|
|   0x3ce639a42001   801  458d99105e6202       leal r11,[r9+0x2625e10]
|   0x3ce639a42008   808  41baffffffff         movl r10,0xffffffff
|   0x3ce639a4200e   80e  4d3bda               REX.W cmpq r11,r10
|   0x3ce639a42011   811  761d                 jna 0x3ce639a42030  <+0x830> (T) -\
|   0x3ce639a42013   813  bf01000000           movl rdi,0x1                      |
|   0x3ce639a42018   818  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a4201b   81b  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a4201f   81f  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a42023   823  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a42027   827  488b0516f8ffff       REX.W movq rax,[rip+0xfffff816]   |
|   0x3ce639a4202e   82e  ffd0                 call rax                          |
|                                                                                |
|                                         xmm0 = c[idx]                          |
|   0x3ce639a42030   830  c4a17a6f0422         vmovdqu xmm0,[rdx+r12*1]        <-/
|                                         xmm0 = b[idx]
|   0x3ce639a42036   836  c4a17a6f141a         vmovdqu xmm2,[rdx+r11*1]
|
|   0x3ce639a4203c   83c  458d9910122707       leal r11,[r9+0x7271210]
|   0x3ce639a42043   843  41baffffffff         movl r10,0xffffffff
|   0x3ce639a42049   849  4d3bda               REX.W cmpq r11,r10
|   0x3ce639a4204c   84c  761d                 jna 0x3ce639a4206b  <+0x86b> (T) -\
|   0x3ce639a4204e   84e  bf01000000           movl rdi,0x1                      |
|   0x3ce639a42053   853  4989e2               REX.W movq r10,rsp                |
|   0x3ce639a42056   856  4883ec08             REX.W subq rsp,0x8                |
|   0x3ce639a4205a   85a  4883e4f0             REX.W andq rsp,0xf0               |
|   0x3ce639a4205e   85e  4c891424             REX.W movq [rsp],r10              |
|   0x3ce639a42062   862  488b05dbf7ffff       REX.W movq rax,[rip+0xfffff7db]   |
|   0x3ce639a42069   869  ffd0                 call rax                          |
|                                                                                |
|                                         xmm0 = xmm0*xmm2                       |
|   0x3ce639a4206b   86b  c4e27940c2           vpmulld xmm0,xmm0,xmm2          <-/
|                                         a[idx] = xmm0
|   0x3ce639a42070   870  c4a17a7f041a         vmovdqu [rdx+r11*1],xmm0
|
|                                         count += 32
|   0x3ce639a42076   876  4183c120             addl r9,0x20
|                                         if (count != 0)
\-- 0x3ce639a4207a   87a  0f8500fcffff         jnz 0x3ce639a41c80  <+0x480> (T)
[snip]

Surprisingly (or not), the code follows exactly the same 'shape' as before: function inlined, loop unrolled with 6 multiplications, but with an important difference: SIMD finally! Note that this code uses AVX and, therefore, performs 4 multiplications at a time, or 24 multiplications per iteration of the loop. This should certainly bring some performance gain... right?

Interestingly, the x86_64 code is not an identical copy of the WASM version: the WebAssembly version performs only 2 multiplications per iteration!

Some Numbers...

I'm sure you're tired of reading x86_64 assembly, and most of you probably want numbers. After all, can the v8 JIT really compete (or come close) to compiled languages?

Description Time (ms)
add.js (JIT disabled, --jitless) 39719.07 ms
GCC/mul.c + -O0 2495.13 ms
add.js (with JIT) 2248.63 ms
load.js+mul_wasm.c (WebAsm) + -O0 2298.35 ms
load.js+mul_wasm.c (WebAsm) + -O3 1825.97 ms
GCC/mul.c + -O3 1083.70 ms
GCC/mul.c + -O3 + -march=native 1065.57 ms
load.js+mul_wasm.c (WebAsm) + -O3 + -msimd128 1064.27 ms

First of all, the difference between JIT and non-JIT times is impressive. Google's efforts to make v8 fast really seem to pay off, making browsers capable of things previously only possible in desktop environments.

Second, JITed JS (and WebAssembly) times closely approach those of unoptimized C code, which is... curious. I honestly expected a bit more, considering most developers write only in JS, but okay.

Third, enabling '-O3' in WebAssembly really starts to show some potential, slightly surpassing unoptimized C code. Finally, I had a pleasant surprise when enabling auto-vectorization with the -msimd128 flag, with performance code identical to that produced by GCC at the highest optimization level. Clearly, Clang+Turbofan did an excellent job in this last scenario.

Final Thoughts

Honestly, I am quite happy with the results and everything I learned during the process. It was both fun and exhausting, and I never thought I would be able to set breakpoints and single-step debug JIT-generated code in a runtime engine! There is still much to learn, as I cannot fully understand the JIT code generated by v8.

I also want to emphasize not taking this text too seriously—it's just a quick dive into two lines of JS code, and no definitive conclusions can be drawn from it. There are countless scenarios where the generated code could be entirely different, and I'm not looking to set any hard conclusions here. What I'm trying to bring to the table is a different angle on debugging and analyzing the ASM code generated by the JIT, something I honestly haven't come across much in benchmarks that usually stick to tables and graphs.

If you know of other interesting materials about v8 and comparative analyses, feel free to let me know =).

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment