tiqwab/openjdk_interpreter.md

## openjdk_interpreter.md

      
    Raw
  

              openjdk_interpreter.md
            
          
    OpenJDK の interpreter とは


Java byte code の interpreter
(Java) メソッドはまず interpreter で実行される
(条件を満たすと) JIT コンパイルで生成されたネイティブコードを実行する

簡単な Java byte code 例
class Example1 {
    public static int add() {
        int x = 1;
        int y = 2;
        return x + y;
    }
}
$ javac Example1.java
$ javap -c -v Example1.class
  ...
  public static int add();
    descriptor: ()I
    flags: ACC_PUBLIC, ACC_STATIC
    Code:
      stack=2, locals=2, args_size=0
         0: iconst_1 // (int) 1 をスタックに push
         1: istore_0 // スタックから pop してローカル変数 0 番へ
         2: iconst_2 // (int) 2 を push
         3: istore_1 // pop してローカル変数 1 番へ
         4: iload_0  // ローカル変数 0 番 を push
         5: iload_1  // ローカル変数 1 番を push
         6: iadd     // スタックから val2, val1 と pop して (val1 + val2) を push
         7: ireturn  // スタック先頭の値を返す
      LineNumberTable:
        line 10: 0
        line 11: 2
        line 12: 4
  ...

(Hotspot VMの基本構造を理解する の図9 の引用入れたい)
2 種類の Java byte code interpreter

OpenJDK の RuntimeOverview#Interpreter に記載されているように、2 種類の実装が存在。

Cpp Interpreter
Template Interpreter

Cpp Interpreter

CppInterpreter とは OpenJDK 内の interpreter 実装の 1 つで、上の overview で

a classic switch-statement loop

を採用した実装。
ソースを見てみるとまさに switch 文によるループで Java byte code を処理しているのがわかる。
// in bytecodeInterpreter.cpp
// CppInterpreter 内で BytecodeInterpreter::run を使用
//
// 一部省略 & 単純化
void
BytecodeInterpreter::run(interpreterState istate) {
    ...
    register address pc = istate->bcp(); // bcp = byte code pointer
    ...
    while (1) {
        opcode = *pc;

        switch (opcode)
        {
        ...
        CASE(_i2f):       /* convert top of stack int to float */
            SET_STACK_FLOAT(VMint2Float(STACK_INT(-1)), -1);
            UPDATE_PC_AND_CONTINUE(1);
        ...
        CASE(_ireturn):
            {
            // Allow a safepoint before returning to frame manager.
            SAFEPOINT;

            goto handle_return;
            }
        }
        ...
    }
    ...
    handle_return: {
        ...
    }
}
CppInterpreter は処理が直感的ですが開発用ビルドでフラグを有効にしないと使用されません。
通常より性能に優れた Template Interpreter が使用されます。
Template Interpreter


各 Java byte code について専用のネイティブコードを (JVM 起動直後に) 生成する
生成したネイティブコードの間を飛び回る感じになる

arraylength の template 例 (x86_64 環境)
# 'java' comes from development build
$ java -XX:+PrintInterpreter Hello | tee interpreter.log

interpreter.log を見ると interpreter 用に生成されたアセンブリが見られる。
例えば arraylength (配列オブジェクトの長さを取得する) は以下のようになる。
arraylength  190 arraylength  [0x00007f6d3903c8a0, 0x00007f6d3903c8c0]  32 bytes

  # (必要なら) operand の取得
  0x00007f6d3903c8a0: pop    %rax

  # メインの処理
  0x00007f6d3903c8a1: mov    0xc(%rax),%eax       # 0xc(%rax) に配列オブジェクトならば長さが格納されている

  # 次の Java byte code を読む準備
  0x00007f6d3903c8a4: movzbl 0x1(%r13),%ebx       # ebx レジスタに次の Java byte code を保存
  0x00007f6d3903c8a9: inc    %r13                 # bcp (byte code pointer) の increment
  0x00007f6d3903c8ac: movabs $0x7f6d4ed7bd60,%r10 # r10 レジスタに dispatch table のベースアドレスを保存
  0x00007f6d3903c8b6: jmpq   *(%r10,%rbx,8)       # 次の Java byte code 用のコードへ jmp
  0x00007f6d3903c8ba: nop
  0x00007f6d3903c8bb: nop
  0x00007f6d3903c8bc: nop
  0x00007f6d3903c8bd: nop
  0x00007f6d3903c8be: nop
  0x00007f6d3903c8bf: nop

コメントで書いたように生成されるアセンブリを見るとメインの処理の前後は大体似たようなパターンが見られる。
(適当な図)
OpenJDK のソースとしては

Template
TemplateTable
DispatchTable

あたりが関係するコードになるはず。
Template

in src/hotspot/share/interpreter/templateTable.hpp

interpreter 用 native code の generator

 class Template {
   ...
   
   typedef void (*generator)(int arg);
 
   int       _flags;        // describes interpreter template properties (bcp unknown)
   TosState  _tos_in;      // tos cache state before template execution
   TosState  _tos_out;     // tos cache state after  template execution
   generator _gen;         // template code generator
   int       _arg;         // argument for template code generator
   
   ...
 }
TemplateTable

in src/hotspot/share/interpreter/templateTable.hpp

Template をまとめたもの
各 Java byte code についてどのような native code を生成するかは TemplateTable の initialize メソッドから追える

JVM 起動時にこのメソッドが呼ばれる


 void TemplateTable::initialize() {
   ...
   
   //                                 interpr. templates
   // Java spec bytecodes             ubcp|disp|clvm|iswd  in    out   generator             argument
   
     def(Bytecodes::_arraylength    , ____|____|____|____, atos, itos, arraylength         ,  _           );
     
   ...
 }
各 generator は CPU 依存なので src/hotspot/cpu/<arch>/templateTable_<arch>.cpp のようなソース上に存在する。例えば x86 の場合 src/hotspot/cpu/x86/templateTable_x86.cpp。
__ の実態は InterpreterMacroAssembler と呼ばれるもので動的にネイティブコードを生成するために使用している。
void TemplateTable::arraylength() {
  transition(atos, itos);
  __ null_check(rax, arrayOopDesc::length_offset_in_bytes());
  __ movl(rax, Address(rax, arrayOopDesc::length_offset_in_bytes()));
}
この中には上のアセンブリでいう「メインの処理」しかない。 __movl がそれ。
(transition はただの assetion で、null_check はデフォルト何もしないっぽい)
DispatchTable

in src/hotspot/share/interpreter/templateInterpreter.hpp

各 Java byte code interpreter 用に生成したコードのエントリポイント (のアドレス) を管理

256 (< Java byte code の種類) x tos (Top Of Stack) 10 のアドレスを _table に持つ


set_entry でエントリポイントの設定
table_for で tos 毎のテーブルのベースアドレスを取得

 class DispatchTable {
  public:
   enum { length = 1 << BitsPerByte }; // an entry point for each byte value
  public:
   address _table[number_of_states][length]; // dispatch tables, indexed by tosca and bytecode

   ...

   void       set_entry(int i, EntryPoint& entry);     // set entry point for a given bytecode i
   address*   table_for(TosState state)          { return _table[state]; }
 }
このテーブルの初期化は ``TemplateInterpreterGenerator::set_entry_points_for_all_bytes` で行われる。
tos (Top Of Stack) について


各 Java byte code 処理前後でスタックの先頭値の型として期待するもの

JVM はスタックマシンなので各命令の入力、出力の型ともいえる
ここでは入力側の tos を in tos, 出力側を out tos と呼ぶ


Java byte code が決まれば in tos, out tos も決まる

例えば arraylength なら in tos = atos, out tos = itos

配列オブジェクト (への参照) を受け取って、その長さ (int) を返す


tos による最適化 (の考え方)


連続する Java byte code 間の計算値の受け渡しをスタックで行う

直感的
もっと早くしたい？


値の受け渡しをスタックではなくレジスタで行う

前の byte code の out tos と次の byte code の in tos が一致すればいい


tos による最適化 (の実装)


使用するレジスタ (x86_64 の場合)

rax for long, int, short, char ,byte, and boolean
xmm0 for double and float


Java byte code を処理したあとの出力値をレジスタに置く
自分の out tos に応じて異なるアドレスへ jmp

アドレスは次の Java byte code と自分の out tos を元に DispatchTable から求める


次の Java byte code の in tos と一致している場合、レジスタに置いた値をそのまま使用して処理を進める
一致していない場合、その値を自分は使わないということなのでとりあえずスタックに置き直しておく (あとの Java byte code で使用するはず)

例えば上の arraylength に対する DispatchTable の持つアドレスは

_table[atos][190] には 0x00007f6d3903c8a1

直前の Java byte code の処理で配列オブジェクトの参照値をレジスタに置いているはずなのでそのまま使う


_table[vtos][190] には 0x00007f6d3903c8a0

直前の Java byte code では何もスタックに積んでいない。ただそれより前の Java byte code の処理で配列オブジェクトの参照値をスタックに積んだはずなのでそれを pop する


それ以外はエラー (にしていたはず)

arraylength  190 arraylength  [0x00007f6d3903c8a0, 0x00007f6d3903c8c0]  32 bytes
  0x00007f6d3903c8a0: pop    %rax
  0x00007f6d3903c8a1: mov    0xc(%rax),%eax       # 0xc(%rax) に配列オブジェクトならば長さが格納されている
  0x00007f6d3903c8a4: movzbl 0x1(%r13),%ebx       # ebx レジスタに次の Java byte code を保存
  0x00007f6d3903c8a9: inc    %r13                 # bcp (byte code pointer) の increment
  0x00007f6d3903c8ac: movabs $0x7f6d4ed7bd60,%r10 # r10 レジスタに dispatch table のベースアドレスを保存
  0x00007f6d3903c8b6: jmpq   *(%r10,%rbx,8)       # 次の Java byte code 用のコードへ jmp
  0x00007f6d3903c8ba: nop
  0x00007f6d3903c8bb: nop
  0x00007f6d3903c8bc: nop
  0x00007f6d3903c8bd: nop
  0x00007f6d3903c8be: nop
  0x00007f6d3903c8bf: nop

(gdb で見るなら TemplateInterpreter::_normal_table._table[8][190] とかで)
この部分の native code を生成するのは以下の処理。
void TemplateInterpreterGenerator::set_short_entry_points(Template* t, address& bep, address& cep, address& sep, address& aep, address& iep, address& lep, address& fep, address& dep, address& vep) {
  assert(t->is_valid(), "template must exist");
  switch (t->tos_in()) {
    case btos:
    case ztos:
    case ctos:
    case stos:
      ShouldNotReachHere();  // btos/ctos/stos should use itos.
      break;
    case atos: vep = __ pc(); __ pop(atos); aep = __ pc(); generate_and_dispatch(t); break;
    case itos: vep = __ pc(); __ pop(itos); iep = __ pc(); generate_and_dispatch(t); break;
    case ltos: vep = __ pc(); __ pop(ltos); lep = __ pc(); generate_and_dispatch(t); break;
    case ftos: vep = __ pc(); __ pop(ftos); fep = __ pc(); generate_and_dispatch(t); break;
    case dtos: vep = __ pc(); __ pop(dtos); dep = __ pc(); generate_and_dispatch(t); break;
    case vtos: set_vtos_entry_points(t, bep, cep, sep, aep, iep, lep, fep, dep, vep);     break;
    default  : ShouldNotReachHere();                                                 break;
  }
}
これを見ると結局は以下の 4 パターンに一般化できる。

Xtos to Xtos (Xtos は vtos 以外のいずれか)

レジスタで受け渡された値を使って処理を進める。最も効率的


vtos to Xtos

レジスタでの受け渡しに失敗 (jmp 前のテンプレートでは tos を用意しないので)
スタックに値があるはずなのでそれを pop してから処理を開始する。


Xtos to vtos

レジスタでの受け渡しに失敗 (jmp 先のテンプレートでは tos を使用しないので)
後続の Java byte code の処理で使用する可能性があるので、レジスタで渡された値をスタックに入れ直してから処理を進める


vtos to vtos

レジスタでの受け渡しをしない
実際のソースではパターンとしては Xtos to vtos と同じとみなしている (スタックに入れ直さないだけ)