Skip to content

Instantly share code, notes, and snippets.

@lrytz
Last active July 1, 2016 14:01
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save lrytz/7156173feaffcb9f419c949d05defb5e to your computer and use it in GitHub Desktop.
Save lrytz/7156173feaffcb9f419c949d05defb5e to your computer and use it in GitHub Desktop.
Performance of using default methods to compile Scala trait methods

Performance of using default methods to compile Scala trait methods

In Scala 2.12, bodies of methods defined in traits will be compiled to default methods in the interface classfile. In short, we have the following bytecode formats for concrete trait methods:

  • 2.11.x: trait method bodies are in static methods in the trait's T$impl class. Classes extending a trait get a virtual method that implements the abstract method in the interface and forwards to the static implementation method.
  • 2.12.0-M4: trait method bodies are in (non-static) interface default methods, subclasses get an virtual method (overridding the default method) that forwards to that default method using invokespecial (a super call).
  • 33e7106: in most cases, no more forwarders are generated in subclasses as they are not needed: the JVM will resolve the correct method. Concrete trait methods are invoked either using invokeinterface (if the static receiver type is the trait) or invokevirtual (if the static receiver type is the subclass).
  • 2.12.0-M5: trait method bodies are emitted in static methods in the interface classfile. The default methods forward to the static methods.

Recently we observed that 33e7106 causes a 20% slowdown of the Scala compiler (tested by compiling the sources of better-files). (Since we're still lacking a proper performance regression testing infra, the slowdown was discovered later and pinned down using git-bisect).

First observation: the slowdown is not due to additional logic introduced by the patch, but the change in the bytecode of the compiler itself. This can be verified easily: a compiler built from revision 33e7106 using its parent (b932419) as STARR has no slowdown. Building it with itself as STARR, the resulting compiler runs slower.

This means that any Scala applications using concrete trait methods is likely to be affected by this problem.

This post logs our attempts to find the root cause of the slowdown.

Some details on the HotSpot compiler

This section explains some details of the HotSpot optimizer. It assembles information from various sources and own observations; it might contain mistakes and misunderstandings. It is certainly simplified and incomplete. More details are available in the linked resources.

JITing

My recommended reference for this first section is the talk "JVM Mechanics" by Doug Hawkins (video, slides).

First of all, JVM 8 uses two JIT compilers: C1 and C2. C1 is fast but performs only basic optimizations, in particular it does not perform speculative optimizations based on profiling (frequency of branches, type profiles at callsites). C2 is profile-guided and speculative but slower.

The JVM starts by interpreting the program. It only compiles methods that are either called often enough or that have long enough loops. There are two counters for each method:

  • the number of times it is invoked, and
  • the number of times a backwards branch is executed.

The decision to compile a method is based on these counters. A simplified (ignoring backwards branches), typical scenario: after 2000 invocations a method gets compiled by C1, after 15000 it gets re-compiled by C2 (see this answer on SO for more details). Note that the C1-generated assembly is instrumented to update the two counters (and also to collect other profiling data that will be used by the C2 optimizer). After compiling a method, new invocations of the method will use the newly generated assembly.

The above works well for a method that is invoked many times, but what happens to a long-running method that is invoked only once, but has a long loop? The decision to compile this method is taken when the counter of backwards branches passes a threshold. Once compilation is done, the JVM performs a so-called on-stack replacement (OSR): the stack frame of the running method is modified as necessary and execution continues using the new assembly.

An OSR / loop compilation of a method is always tied to a specific loop: the entry point of the generated assembly is at the end of the loop (locations are referred to by the index in the jvm bytecode, called "bytecode index" / bci). If there are multiple hot loops within a method, the same method may get multiple OSR compiled versions. More details on this can be found in this post by Krystal Mok which explains the many details of the -XX:+PrintCompilation output.

Inlining

For this section my reference s Aleksey Shipilёv's extensive post The Black Magic of (Java) Method Dispatch.

Inlining is fundamental because it acts as an enabler for most other optimizations. As Aleksey says in the conclusion: "inlining actually broadens the scope of other optimizations, and that alone is, in many cases, enough reason to inline".

Both C1 and C2 perform inlining. The policy whether to inline a method is non-trivial and uses several heuristics (implemented in bytecodeInfo.cpp, methods should_inline, should_not_inline and try_to_inline). A simplified summary:

  • Trivial methods (6 bytes by default, MaxTrivialSize) are always inlined.
  • Methods up to 35 bytes (MaxInlineSize) invoked between 1 and 250 (MinInliningThreshold) are inlined.
  • Methods up to 325 bytes (FreqInlineSize) are inlined if the callsite is "hot" (or "frequent"), which means it is invoked more than 20 times (no command-line flag in release versions) per one invocation of the caller method.
  • The inlining depth is limited (9 by default, MaxInlineLevel).
  • No inlining is performed if the callsite method is already very large.

The procedure is the same for C1 and C2, it uses the invocation counter that is also used for compilation decisions (previous section).

Inlining virtual methods

In C1, a method can only be inlined if it can be statically resolved. This is the case for static and private methods, for constructors, but also for virtual methods that are never overridden. The JVM has full knowledge on code of the program it is executing. If a method is virtual and could in principle be overridden in a subclass, but no such subclass has been loaded (so far), an invocation of the method can only resolve to that single definition.

The process of analyzing the hierarchy of classes currently loaded in the VM is called "class hierarchy analysis" (CHA). Both C1 and C2 use the information computed by CHA to inline calls to virtual methods that are not overridden.

When the JVM loads a new class, a virtual method that was statically not overridden by CHA may get an override. All assembly code that made use of the now invalid assumption is discarded and the corresponding methods are executed again by the bytecode interpreter. This process is called deoptimization. In case a deoptimized method is currently being executed, control is passed to the interpreter by on-stack replacement.

In addition to using CHA, the C2 compiler performs speculative inlining of virtual methods based on the type profiles gathered by the interpreter and the C1-generated assembly. If the receiver type at a callsite is always the same (the callsite is "monomorphic") the method is inlined. The assembly contains a type test to validate the assumption, if it breaks the method gets deoptimized.

C2 will also inline bi-morphic callsites: the code of both callees is inlined, a type test is used to branch to the correct one (or to bail out). Finally, if the type profile shows a clear bias to a specific receiver type (for example 90%), its method is inlined and virtual dispatch is used for the other cases (shown in Aleksey's post).

If a callsite has 3+ receiver types without a clear bias, C2 does not inline and an ordinary method lookup is performed at runtime.

Note that C2 performs other speculative optimizations than profile-based inlining, for example profile-based branch elimination.

Understanding the performance regression

With the above knowledge at hand (I wish I had it when I started) we try to identify what causes the slowdown of eliminating forwarder methods.

Call performance

In a first step we measured the call performance of the various trait encodings.

The first benchmark CallPerformance has roughly the following structure:

interface I {
    default int addDefault(int a, int b) { return a + b; }

    static int addStatic(int a, int b) { return a + b; }
    default int addDefaultStatic(int a, int b) { return addStatic(a, b); }

    default int addForwarded(int a, int b) { return a + b; }

    int addInherited(int a, int b);

    int addVirtual(int a, int b);
}

static abstract class A implements I {
    public int addInherited(int a, int b) { return a + b; }
}

static class C1 extends A implements I {
    public int addForwarded(int a, int b) { return I.super.addForwarded(a, b); }

    public int addVirtual(int a, int b) { return a + b; }
}

There are identical copies of C1 (C2, ...). The example encodes the following formats (we don't test the 2.11.x format):

  • addDefault for 33e7106
  • addDefaultStatic for 2.12.0-M5
  • addForwarded for 2.12.0-M4

The methods addInherited and addVirtual don't represent trait method encodings, they are for comparison. We test all encodings in a monomorphic callsite (receiver is always C1) and in a polymorphic one.

Monomorphic case

In the monomorphic case all trait encodings are inlined and perform the same (there are tiny differences, if you are interested check Aleksey's blog post).

If we annotate all methods with JMH's DONT_INLINE directive, encodings with a forwarder (either M4-style forwarder invoking the trait default method, or the upcoming M5-style default method forwarding to static) are a bit slower (so a default method without a forwarder is faster). The penalty for having either forwarder is similar.

Polymorphic case

If the callsite is polymorphic:

  • The M4 encoding (addForwarded) is slow because the forwarder cannot beinlined. This the known issue of trait methods leading to megamorphic callsites that exists in Scala 2.11.x and older.
  • The 33e7106 (addDefaultStatic) and M5 (addDefault) encodings are also slow: the default method is not inlined (checked with -XX:+PrintInlining and by comparing with a method marked DONT_INLINE). We will explore this in detail later.

For comparison, an invocation of addInherited is inlined and therefore much faster. So an inherited virtual method is not treated in the same way as an inherited default method. The next section goes into details why this is the case.

Note: for the question why the 33e7106 ecoding causes a 20% performance regression, this cannot be the reason. We found out that addDefault is slower than it could be in the polymorphic case, but it is not slower than the M4 encoding.

CHA and default methods

The reason addDefault is not inlined while addInherited in the previous example has to do with CHA: in fact, CHA is disabled altogether for default methods. This is logged in the JVM bugtracker under JDK-8036580. It was disabled in order to fix JDK-8036100 which lead to the wrong method being inlined. (It was @retronym who initially suggested these tickets could be relevant).

The reason for addInherited being inlined is that the VM knows (from CHA) the method is not overridden in any of the loaded classes. This is tested in the InliningCHA benchmark.

The first benchmark measures a megamorphic call to addInherited, just like in the previous section. This call is inlined. The second benchmark performs the exact same operation but makes sure that new subclass CX is loaded which overrides addInherited. CHA no longer returns a single target for the method and the call is not inlined. Note that no instance of CX is created.

This seems to be a shortcoming in C2's inliner implementation: based on the type profiling data, C2 knows that the only types reaching the callsites are C1, C2, C3 and C4. Using CHA it could in principle find out that there is a single implementation of addInherited.

Method lookup in classes implementing many interfaces

We are still searching for an answer why 33e7106 caused a performance regression. Martin Thompson notes in a blog post (dated 2012):

I have observed that when a class implements multiple interfaces, with multiple methods, performance can degrade significantly because the method dispatch involves a linear search of method list

We can reproduce this in the benchmark (InterfaceManyMembers). The basic example is the following:

interface I1 { default int a1 ... }
interface I2 { default int b1 ... ; default int b2 ... }
...

class A1 implements I1 { }
class A2 implements I2 { }
...

class B1 implements I1 { }
class B2 implements I1, I2 { }
...

In the benchmark, every class (A1, A2, B1, ...) exists in four copies to make sure the callsite is megamorphic. We measure how much time an invocation of one default method takes:

  • The number of default methods in an interface does not matter, so A1.a1 and A2.b1 perform the same.
  • The number of implemented interfaces matters, there's a penalty for every additional interface. So B1.a1 is faster than B2.b1, etc.

Adding an overriding forwarder method to the subclasses does not change this result, the slowdown per additional interface remains. So this seems not to be the reason for the performance regression.

Back to CHA

Googling a little bit more about the performance of default methods, I found a relevant post on SO containing a nice benchmark.

I simplified the example into the benchmark (NoCHAPreventsOptimization), which is relatively small:

interface I {
    int getV();
    default int accessDefault() { return getV(); }
}

abstract class A implements I {
    public int accessVirtual() { return getV(); }
    public int accessForward(){ return I.super.accessDefault(); }
}

class C extends A implements I {
    public int v = 0;
    public int getV() { return v; }
}

The benchmark shows that c.v = x; c.accessDefault() is 3x slower than c.v = x; c.accessVirtual() or c.v = x; c.accessForward().

As noted in comments on the StackOverflow thread, everything is inlined in all three benchmarks, so the difference is not due to inlining. We can observe that the assembly generated for the accessDefault case is less optimized than in the other cases. Here is the output of JMH's -prof perfasm feature, it includes the assembly of the hottest regions:

In fact, the assembly for the accessVirtual and accessForward cases is identical.

One answer on the SO thread suggests that lack of CHA for the default method case prevents eliminating a type guard, which in turn prevents optimizations on the field write and read. Somebody with more experience in assembly code than me could certainly verify that.

I did not do any further research to find out what kind of optimizations depend on CHA, or if it is really the lack of CHA that causes the code not to be optimized properly. For my convenience, let's say that's beyond the scope of this post. If you have any insights or references on this topic, please forward them to me!

It seems that CHA preventing certain optimizations is the most likely source for the slowdowns we notice when running the Scala compiler.

Summary

We found a few interesting limitations in the JVM optimizer:

  • Because CHA is not supported for default methods, a megamorphic callsite to a default method is never inlined even if the method is not overridden at all.
  • Interface method lookup slows down by the number of interfaces a class implements.
  • While monomorphic calls to default methods are inlined, the lack of CHA has negative effects on other optimizations.

References

Besides the post already mentioned, Aleksey Shipilёv's blog is an excellent resource for Java and JVM intrinsics.

The talk "JVM Mechanics" by Doug Hawkins was also mentioned above (video, slides), it is a great overview on the JIT, inliner and optimizer. For an overview I can also recommend a longer blog post by René van Wijk and a shorter one by Mark Price focussing on the JIT compilers.

The JVM has an excessive number of flags for logging and tweaking:

  • Some flags are documented here
  • Many others are not documented, run java -XX:+PrintFlagsFinal to get a list of all flags

Some flags used in the examples of this post:

  • -XX:TieredStopAtLevel=1 to disable C2
  • -XX:+PrintCompilation logs methods being compiled (and deoptimized)
  • -XX:+PrintInlining logs callsites being inlined (or not), best used together with the above

JITWatch is a GUI tool that helps understanding what the JIT is doing (I haven't tried it yet).

A thread on the hotspot-compiler-dev mailing list on why CHA is disabled for interfaces. Seems to discuss the situation before default methods were a common thing.

A gist by Krystal Mok explaining many details of the -XX:+PrintCompilation output and other details of the JIT process.

The glossary on the HotSpot wiki contains some useful nomenclature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment