rednaxelafx/gist:759495

## gistfile1.txt
Java source code:
k = i + j;

May compile to Java bytecode:
iload_0
iload_1
iadd
istore_2

And may turn into Dalvik VM code:
add-int v2, v1, v0

Compare HotSpot Client VM's interpreter in JDK6u18 with Dalvik's interpreter in Android 2.0, on x86.
To execute the program above, the code traces from unrolling the intepreters' fetch-dispatch-execute loop,
are:

HotSpot's interpreter (client mode default config):
;;-------------iload_0-------------
mov   eax, dword ptr [edi]
movzx ebx, byte ptr [esi + 1]
inc   esi
jmp   dword ptr [ebx*4 + 6DB188C8]
;;-------------iload_1-------------
push  eax
mov   eax, dword ptr [edi-4]
movzx ebx, byte ptr [esi+1]
inc   esi
jmp   dword ptr [ebx*4 + 6DB188C8]
;;--------------iadd---------------
pop   edx
add   eax, edx
movzx ebx, byte ptr [esi + 1]
inc   esi
jmp   dword ptr [ebx*4 + 6DB188C8]
;;------------istore_2-------------
mov   dword ptr [edi-8],eax
movzx ebx,byte ptr [esi+1]
inc   esi
jmp   dword ptr [ebx*4 + 6DB19CC8]

Dalvik's interpreter:
;;------------add-int--------------
movzx eax, byte ptr [edx + 2]
movzx ecx, byte ptr [edx + 3]
mov   eax, dword ptr [esi + eax*4]
add   eax, dword ptr [esi + ecx*4]
movzx ecx, bh
movzx ebx, word ptr [edx + 4]
lea   edx, dword ptr [edx + 4]
mov   dword ptr [esi + ecx*4], eax
movzx eax, bl                       ; GOTO_NEXT "computed next" version
sal   eax, $$$handler_size_bits
add   eax, edi
jmp   eax


If we strip off the fetch/dispatch part from the two code traces above, we'll get:

HotSpot:
;;-------------iload_0-------------
mov   eax, dword ptr [edi]
;;-------------iload_1-------------
push  eax
mov   eax, dword ptr [edi - 4]
;;--------------iadd---------------
pop   edx
add   eax, edx
;;------------istore_2-------------
mov   dword ptr [edi - 8], eax

Dalvik:
;;------------add-int--------------
movzx eax, byte ptr [edx + 2]
movzx ecx, byte ptr [edx + 3]
mov   eax, dword ptr [esi + 4*eax]
add   eax, dword ptr [esi + 4*ecx]
movzx ecx, bh
mov   dword ptr [esi + 4*ecx], eax

Now we can see that in this example, counting the number of instruction that actually executes user code's
original semantics, both HotSpot's and Dalvik's interpreter uses 6 x86 instructions.

Which means, HotSpot doesn't lose performance in the "execution" part just because the JVM spec defined a
stack-based instruction set. By using 1-top-of-stack caching, HotSpot can still make efficient use of machine
registers during interpretation, in spite of the fact it's emulating a stack-based abstract machine.

On the other hand, Dalvik's interpreter (on x86) stores all of its "virtual registers" on the stack frames,
which is in memory, which is in turn slower to access than HotSpot's TOS (top-of-stack) value. Of course,
Dalvik can further tune the interpreter to try and squeeze even more performance out, but due to the scarce
number of registers available on x86, it's going to be pretty hard. It'll be easier if there are more free
registers, like x86-64 or some RISC processor.

But because JVM has to use more number of bytecode instructions than Dalvik to do the same work, the "fetch-
dispatch" part makes HotSpot's interpreter have to pay more interpretation overhead than Dalvik's.

------------------------------------------------------------------------------------------------

It's interesting if we look at Sun JDK 1.1.8's interpreter. To run the example shown above, and again count-
ing just the "execution" part, we'd get:
;;-------------iload_0-------------
mov   ebx, dword ptr [ebp]
;;-------------iload_1-------------
mov   ecx, dword ptr [ebp + 4]
;;--------------iadd---------------
add   ebx, ecx
;;------------istore_2-------------
mov   dword ptr [ebp + 8], ebx

That's 2 memory reads and 1 memory write, exactly what you'd get were the example written in C and compiled
without optimization, which is not bad for an interpreter. This is also the effect of multi-state top-of-
stack caching.
	Java source code:
	k = i + j;

	May compile to Java bytecode:
	iload_0
	iload_1
	iadd
	istore_2

	And may turn into Dalvik VM code:
	add-int v2, v1, v0

	Compare HotSpot Client VM's interpreter in JDK6u18 with Dalvik's interpreter in Android 2.0, on x86.
	To execute the program above, the code traces from unrolling the intepreters' fetch-dispatch-execute loop,
	are:

	HotSpot's interpreter (client mode default config):
	;;-------------iload_0-------------
	mov eax, dword ptr [edi]
	movzx ebx, byte ptr [esi + 1]
	inc esi
	jmp dword ptr [ebx*4 + 6DB188C8]
	;;-------------iload_1-------------
	push eax
	mov eax, dword ptr [edi-4]
	movzx ebx, byte ptr [esi+1]
	inc esi
	jmp dword ptr [ebx*4 + 6DB188C8]
	;;--------------iadd---------------
	pop edx
	add eax, edx
	movzx ebx, byte ptr [esi + 1]
	inc esi
	jmp dword ptr [ebx*4 + 6DB188C8]
	;;------------istore_2-------------
	mov dword ptr [edi-8],eax
	movzx ebx,byte ptr [esi+1]
	inc esi
	jmp dword ptr [ebx*4 + 6DB19CC8]

	Dalvik's interpreter:
	;;------------add-int--------------
	movzx eax, byte ptr [edx + 2]
	movzx ecx, byte ptr [edx + 3]
	mov eax, dword ptr [esi + eax*4]
	add eax, dword ptr [esi + ecx*4]
	movzx ecx, bh
	movzx ebx, word ptr [edx + 4]
	lea edx, dword ptr [edx + 4]
	mov dword ptr [esi + ecx*4], eax
	movzx eax, bl ; GOTO_NEXT "computed next" version
	sal eax, $$$handler_size_bits
	add eax, edi
	jmp eax


	If we strip off the fetch/dispatch part from the two code traces above, we'll get:

	HotSpot:
	;;-------------iload_0-------------
	mov eax, dword ptr [edi]
	;;-------------iload_1-------------
	push eax
	mov eax, dword ptr [edi - 4]
	;;--------------iadd---------------
	pop edx
	add eax, edx
	;;------------istore_2-------------
	mov dword ptr [edi - 8], eax

	Dalvik:
	;;------------add-int--------------
	movzx eax, byte ptr [edx + 2]
	movzx ecx, byte ptr [edx + 3]
	mov eax, dword ptr [esi + 4*eax]
	add eax, dword ptr [esi + 4*ecx]
	movzx ecx, bh
	mov dword ptr [esi + 4*ecx], eax

	Now we can see that in this example, counting the number of instruction that actually executes user code's
	original semantics, both HotSpot's and Dalvik's interpreter uses 6 x86 instructions.

	Which means, HotSpot doesn't lose performance in the "execution" part just because the JVM spec defined a
	stack-based instruction set. By using 1-top-of-stack caching, HotSpot can still make efficient use of machine
	registers during interpretation, in spite of the fact it's emulating a stack-based abstract machine.

	On the other hand, Dalvik's interpreter (on x86) stores all of its "virtual registers" on the stack frames,
	which is in memory, which is in turn slower to access than HotSpot's TOS (top-of-stack) value. Of course,
	Dalvik can further tune the interpreter to try and squeeze even more performance out, but due to the scarce
	number of registers available on x86, it's going to be pretty hard. It'll be easier if there are more free
	registers, like x86-64 or some RISC processor.

	But because JVM has to use more number of bytecode instructions than Dalvik to do the same work, the "fetch-
	dispatch" part makes HotSpot's interpreter have to pay more interpretation overhead than Dalvik's.

	------------------------------------------------------------------------------------------------

	It's interesting if we look at Sun JDK 1.1.8's interpreter. To run the example shown above, and again count-
	ing just the "execution" part, we'd get:
	;;-------------iload_0-------------
	mov ebx, dword ptr [ebp]
	;;-------------iload_1-------------
	mov ecx, dword ptr [ebp + 4]
	;;--------------iadd---------------
	add ebx, ecx
	;;------------istore_2-------------
	mov dword ptr [ebp + 8], ebx

	That's 2 memory reads and 1 memory write, exactly what you'd get were the example written in C and compiled
	without optimization, which is not bad for an interpreter. This is also the effect of multi-state top-of-
	stack caching.