Using every tool in the box to squeeze out some more compiler performance.
JMC / JFR resources:
- http://docs.oracle.com/javacomponents/jmc-5-5/jfr-runtime-guide/about.htm
- http://hirt.se/blog/?p=364
- http://hirt.se/blog/?p=370
- http://www.slideshare.net/RogerBrinkley/java-mission-control-in-java-se?next_slideshow=1
Generate a profile with:
qscalac -J-Xmx2G -J-XX:+UnlockCommercialFeatures -J-XX:+UnlockDiagnosticVMOptions -J-XX:+DebugNonSafepoints -J-XX:+FlightRecorder -J-XX:FlightRecorderOptions=defaultrecording=true,dumponexit=true,stackdepth=1024,loglevel=debug,settings=profile,dumponexitpath=/tmp/dumponexit.jfr -d /tmp $(find src/compiler/ -name '*.scala')
You can replace profile
with myprofile
to pick up settings from $JDK_HOME/jre/lib/jfr/myprofile.jfc
. Export that file from the template editor in jmc
. jmc
is also used as the profile analyzer.
https://gist.github.com/rednaxelafx/1165804#file_notes.md
I managed to build OpenJDK on Mac OS Yosemite with:
cd /code
hg clone http://hg.openjdk.java.net/jdk9/jdk9
cd jdk9
bash get_source.sh
bash configure --with-debug-level=fastdebug --with-target-bits=64 --disable-zip-debug-info
make CONF=macosx-x86_64-normal-server-fastdebug
./build/macosx-x86_64-normal-server-fastdebug/jdk/bin/javac -version
This enables use of PrintOptoAssembly
and friends to see the generated machine code.
Example:
% export FILES=$(for f in $(find src/scalap -name '*.scala'))
% export FILES_SPACE_SEP=(for f in $FILES); do printf "$f "; done; printf '\n'); done)
% export M=*TypeMaps\$TypeMap.mapOver
% export JAVA_HOME=/code/jdk9/build/macosx-x86_64-normal-server-fastdebug/jdk
% (for in in {1..20}; do echo $FILES_SPACE_SEP; done) | \
% build/quick/bin/scalac -J-XX:+TieredCompilation -J-Xmx4G \
-J-XX:+UnlockDiagnosticVMOptions \
-J-XX:+LogCompilation -J-XX:LogFile=hotspot.log \
-J-XX:CompileCommand=option,$M,PrintNMethods \
-J-XX:CompileCommand=option,$M,PrintOptoAssembly \
-J-XX:+DebugNonSafepoints -d /tmp -Xresident)
JITWatch seems useful to process and visualize the log output. But I haven't got this working all that well yet.
Oracle uses JVM to run javac and java performance tests. We've had trouble in the past to deal with the large variance of runtime in the Scala compiler, but we should keep trying.
One new idea is to run the warmup code that run all the important parts of the compiler often without performing a lot of work. My idea is to create a single source file that exercises each phase (e.g. has an inner classes, a lambda, a value class, etc) and uses this as an initial round of warmup. We could then switch over to the real code under test.
Oracle plans to open source a large part of there OpenJDK performance tests. When this happens, we should see what patterns are applicable for scalac benchmarking. It will also be useful to run head to head tests so we can quantify the performance gap for comparable source files.
I understand that a useful pattern is to try to have a variety of focussed tests. So we can try running the compiler up to the typer phase; or using sources that don't incur work many phases; or directly test subtyping or member lookup.
I'm adding the benchmarks to: https://github.com/retronym/scala-jmh-suite
TODO:
- Write more useful benchmarks
- Figure out the best combination of warmup/iteration/jvm runs
- Start running the benchmarchs on a dedicated box (or boxes)
- Run suite against each tag of Scala, persist and visualize the results
- Allow the suite to be run for a specific commit.
I gathered some interesting data from this tool in my previous round of optimization. Installation is a little tricky IIRC. I'll go through it again and document it here. Java Performance (Hunt) offers some help in using the tool.
TODO: what can we learn from hardware counters? How do scalac
and javac
compare for processor cache friendliness? We might be able to get similar data from OSPS.
As they show up from profiling...
- hoist outer references in hot methods
- if this is a win, we could do it automatically in the optimizer
- Avoid boxing in settings lookups like
isDeveloper
, regex inisScala211
.- caching settings in
PerRunSettings
- add a
boolValue
method toSetting
to avoid boxing through the genericvalue
.
- caching settings in
-
isPrimitiveValueClass
currently callsList#contains
. This incurs theBoxesRuntime.equals
penalty.- MAYBE: Create a
def contains[A <: AnyRef](as: List[A], a: A): Boolean
to avoid this cost - MAYBE: Check for
isSubclass(AnyValClass)
before looking through the list. - MAYBE: Use a different data structure
- MAYBE: Create a
- Replace more uses of regular maps with
AnyRefMap
.-
originalOwner
-
isNumericSubClass
-
- Optimize
Symbol#isError
/Symbol#isImplicit
. Can we avoid looking up the current phase's flag mask for flags that are visible in all phases? -
Symbol#javaBinaryName
et al intern the full name of a class symbol in the name table.GenBCode
caches the results, but Java generic signature generation inErasure
doesn't. Might be simpler/faster to directly create theString
, rather than theName
, and to cache it transparently to callers ofjavaBinaryName
. -
TypeMap#mapOver
is a hot method.- Can we do anything to make it more JIT friendly? Processing of
TypeRef
is likely to be the hottest, try breaking this into a nested method so that JIT could inlinemapConserve
. - Is the order of
instanceOf
checks optimal?
- Can we do anything to make it more JIT friendly? Processing of
- The scheme used in
freshExistentialName
seems very wasteful; each new existential has a distinct name. The creation of these names forAsSeenFrom#captureThis
accounts for 0.8% of compilaton time in one of my runs!- Revisit https://github.com/scala/scala/commit/321338da04a6ca9bcc9b77ae663ed27f26a67d85 to understand how it fixed SI-1939.
- Try using
TypeMap
local existential-ids, rather than SymbolTable global ones. -
Symbol#originalInfo
shows 0.9% of compilation time, all of which comes fromSymbol#isMonomorphicType
, called in turn byNoArgsTypeRef#isHigherKinded
.- Try caching
isMonomorphicType
.
- Try caching
- Use a higher initial capacity for the
uniques
WeakHashSet
. Resizng this accounted for 0.7% of run time, although that figure may be higher due to the resident nature of the compiler in the benchmark.
-
transformStats
containsstats mapConserve f filter (_ != EmptyTree)
. The filter call is a signficant source of memory pressure: it's eager allocation ofList.newBuilder
is shown as 5% of allocations (by size) in the Thread Local Allocation Buffer (TLAB).- Fuse the
mapConserve
andfilter
in a custom function to avoid allocating aListBuffer
- Fuse the
- Reduce memory pressure
-
sym.typeParams zip args collect { case (tp, arg) if tp.isSpecialized => arg }
inspecializedTypeVars
-
val (keys, vals) = env.toList.unzip
innormalizeMember
(throwaway list creation) - Hot calls to
List#apply[A](A*)
(varargs boxing and construction of a list is expensive, 2% of TLAB memory pressure) -
lookupInScope
inContext
need not create a throwaway list: adapt callers to use the iterator diretly. - Optimize
TraversableOnce#toList
withif (isEmpty) Nil else to[List]
- Try writing
List#mapConserve
without indirection through (and allocation of)ListBuffer
-
- We pay for a lot of
char[]
allocation in Java generic signature generation inclassSig
. We could probably cache this on theClassSymbol
so we don't pay the cost each time a signature refers to a class. A stepping stone might be to allocate theStringBuilder
with a larger initial capacity. - Workaround SI-9257 in
UnCurry
by introducing a local val to capture the pattern binder. I think the same bug affectsSpecializedTypes#transform1
. -
specializedTypeVars
is hot, and usesSetLike#++
which internally is allocation heavy forObjectRef
(viafoldLeft
. Find a faster approach.- MAYBE: modify
TraversableOnce#foldLeft
with a fast path for empty collections.
- MAYBE: modify
- Cache
specializableTypes.map(_.typeSymbol)
inspecializedOn
. -
ContextReporter#hasErrors
should return true if_errorBuffer eq null
, rather than creating an empty buffer and asking if it is empty! - Address apparent overuse of
perRunCaches
. Why should we register maps created for each compilatio unit, (e.g.private val liftedDefs = perRunCaches.newMap[Symbol, ListBuffer[Tree]]()
inclass Flattener
? We end up with a long list of expired weak references incaches
to clean up.
I've profiled this style of head-to-head test before, but it is a good one to come back to. Script from: https://twitter.com/codemonkeyism/status/591128993186897920
% time ~/scala/2.10.4/bin/scalac -d /tmp scala/*.scala
real 0m15.002s
user 0m34.264s
sys 0m0.920s
% time ~/scala/2.11.6/bin/scalac -d /tmp scala/*.scala
real 0m12.195s
user 0m26.461s
sys 0m0.789s
% time javac -d /tmp java/*.java
real 0m2.175s
user 0m6.087s
sys 0m0.436s
topic/trait-defaults /code/scala2/sandbox/perf cat gen.sh
#!/bin/bash
rm -rf scala
rm -rf java
mkdir scala
mkdir java
for i in `seq 1 10`;
do
for j in `seq 1 10`;
do
for k in `seq 1 20`;
do
file="C${i}${j}${k}"
file2="C${i}${j}"
echo "class $file { }" > scala/$file.scala
echo "class $file { }" > java/$file.java
done
done
done
UPDATE Turns out the bottleneck there is in our bash script processing the humongous command line! We can get the source file list into to the compiler via other means, such as via stdin to the resident compiler. Thta gives us performance in the same ballpark as Java: https://gist.github.com/retronym/3be212b836de6537cc1b.
% date; time (for i in {1..1}; do echo ../scala2/sandbox/perf/scala/*.scala; done) | ~/scala/2.11.6/bin/scalac -J-Xmx2G -d /tmp -Xresident 2>&1 | perl -pne 'print scalar(localtime()), " ";'
Thu Apr 23 20:53:40 AEST 2015
Thu Apr 23 20:53:40 2015
Thu Apr 23 20:53:45 2015 nsc>
Thu Apr 23 20:53:45 2015 nsc>
real 0m5.538s
user 0m19.122s
sys 0m0.766s
% date; time (for i in {1..10}; do echo ../scala2/sandbox/perf/scala/*.scala; done) | ~/scala/2.11.6/bin/scalac -J-Xmx2G -d /tmp -Xresident 2>&1 | perl -pne 'print scalar(localtime()), " ";'
Thu Apr 23 20:53:54 AEST 2015
Thu Apr 23 20:53:54 2015
Thu Apr 23 20:53:59 2015 nsc>
Thu Apr 23 20:54:01 2015 nsc>
Thu Apr 23 20:54:02 2015 nsc>
Thu Apr 23 20:54:03 2015 nsc>
Thu Apr 23 20:54:04 2015 nsc>
Thu Apr 23 20:54:05 2015 nsc>
Thu Apr 23 20:54:06 2015 nsc>
Thu Apr 23 20:54:07 2015 nsc>
Thu Apr 23 20:54:08 2015 nsc>
Thu Apr 23 20:54:09 2015 nsc>
Thu Apr 23 20:54:09 2015 nsc>
real 0m14.942s
user 0m46.607s
sys 0m2.546s
I looked into cache frendliness of both scalac and javac. I used linux performance tools to look at counters for cache misses. I don't have the raw data handy but I remember we had similar rate of L0 and L1 cache misses to javac. My conclusion at the time was that caches misses are not the place to look for performance improvements.