Skip to content

Instantly share code, notes, and snippets.

@lpereira
Created May 14, 2015 10:57
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 1 You must be signed in to fork a gist
  • Save lpereira/ccbb99a0ed288a2e9487 to your computer and use it in GitHub Desktop.
Save lpereira/ccbb99a0ed288a2e9487 to your computer and use it in GitHub Desktop.
Lwan template interpreter
New Old
movsx esi,BYTE PTR [r14+0x8] movsx esi,BYTE PTR [r13+0x8]
mov rdi,r12 mov rdi,rbp
add rbx,0x1 add r13,0x18
call 0x412fd0 <strbuf_append_char> call 0x413ed0 <strbuf_append_char>
mov rax,QWORD PTR [r15+rbx*8] mov eax,DWORD PTR [r13+0x0]
jmp QWORD PTR [rax*8+0x416440] mov rax,QWORD PTR [rax*8+0x4166a0]
jmp rax
C: C:
pc++; chunk++;
goto *dispatch_table[ops[pc]]; goto *dispatch_table[chunk->action];
`struct chunk` is 0x18 bytes. `ops` is an array of all the `action` fields from a `struct chunk` array,
tightly packed together. Code is pretty much equivalent, except that GCC had to load the next address
to `rax` and then jump there; it couldn't fuse the read with the indirect jump. Why?
@ricbit
Copy link

ricbit commented May 14, 2015

Does it make a difference? It may be the case that both forms are actually the same after converting to microops in your target processor. The second has the slight disadvantage of clobbering rax, but the compiler may have realized it won't matter after register renaming, and it may be that the second form uses shorter opcodes (and thus put less pressure on code cache).

I can't tell for sure without knowing the target architecture.

@lpereira
Copy link
Author

@ricbit: I didn't measure; I've no idea if there's any difference, but I was puzzled anyway. The version on the left is 8 bytes shorter than the version on the right. Architecture is x86-64; my machine is a Sandy Bridge Core i7.

The C code for this snippet is here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment