retronym/profiling-notes.md

## profiling-notes.md

      
    Raw
  

              profiling-notes.md
            
          
    Tuning Scala[c]

Using every tool in the box to squeeze out some more compiler performance.
Tooling

Java Flight Recorder

JMC / JFR resources:

http://docs.oracle.com/javacomponents/jmc-5-5/jfr-runtime-guide/about.htm
http://hirt.se/blog/?p=364
http://hirt.se/blog/?p=370
http://www.slideshare.net/RogerBrinkley/java-mission-control-in-java-se?next_slideshow=1

Generate a profile with:
qscalac -J-Xmx2G -J-XX:+UnlockCommercialFeatures -J-XX:+UnlockDiagnosticVMOptions -J-XX:+DebugNonSafepoints -J-XX:+FlightRecorder -J-XX:FlightRecorderOptions=defaultrecording=true,dumponexit=true,stackdepth=1024,loglevel=debug,settings=profile,dumponexitpath=/tmp/dumponexit.jfr -d /tmp $(find src/compiler/ -name '*.scala')

You can replace profile with myprofile to pick up settings from $JDK_HOME/jre/lib/jfr/myprofile.jfc. Export that file from the template editor in jmc. jmc is also used as the profile analyzer.
-XX:+PrintComplation

https://gist.github.com/rednaxelafx/1165804#file_notes.md
I managed to build OpenJDK on Mac OS Yosemite with:
cd /code
hg clone http://hg.openjdk.java.net/jdk9/jdk9
cd jdk9
bash get_source.sh
bash configure --with-debug-level=fastdebug --with-target-bits=64 --disable-zip-debug-info
make CONF=macosx-x86_64-normal-server-fastdebug
./build/macosx-x86_64-normal-server-fastdebug/jdk/bin/javac -version

This enables use of PrintOptoAssembly and friends to see the generated machine code.
Example:
% export FILES=$(for f in $(find src/scalap -name '*.scala'))
% export FILES_SPACE_SEP=(for f in $FILES); do printf "$f "; done; printf '\n'); done)
% export M=*TypeMaps\$TypeMap.mapOver
% export JAVA_HOME=/code/jdk9/build/macosx-x86_64-normal-server-fastdebug/jdk

% (for in in {1..20}; do echo $FILES_SPACE_SEP; done) |                      \
% build/quick/bin/scalac -J-XX:+TieredCompilation -J-Xmx4G                   \
                       -J-XX:+UnlockDiagnosticVMOptions                      \
                       -J-XX:+LogCompilation -J-XX:LogFile=hotspot.log       \                 
                       -J-XX:CompileCommand=option,$M,PrintNMethods          \
                       -J-XX:CompileCommand=option,$M,PrintOptoAssembly      \
                       -J-XX:+DebugNonSafepoints -d /tmp -Xresident)

JITWatch seems useful to process and visualize the log output. But I haven't got this working all that well yet.
JMH

Oracle uses JVM to run javac and java performance tests. We've had trouble in the past to deal with the large variance of runtime in the Scala compiler, but we should keep trying.
One new idea is to run the warmup code that run all the important parts of the compiler often without performing a lot of work. My idea is to create a single source file that exercises each phase (e.g. has an inner classes, a lambda, a value class, etc) and uses this as an initial round of warmup. We could then switch over to the real code under test.
Oracle plans to open source a large part of there OpenJDK performance tests. When this happens, we should see what patterns are applicable for scalac benchmarking. It will also be useful to run head to head tests so we can quantify the performance gap for comparable source files.
I understand that a useful pattern is to try to have a variety of focussed tests. So we can try running the compiler up to the typer phase; or using sources that don't incur work many phases; or directly test subtyping or member lookup.
I'm adding the benchmarks to: https://github.com/retronym/scala-jmh-suite
TODO:

 Write more useful benchmarks
 Figure out the best combination of warmup/iteration/jvm runs
 Start running the benchmarchs on a dedicated box (or boxes)
 Run suite against each tag of Scala, persist and visualize the results
 Allow the suite to be run for a specific commit.

Oracle Solaris Performance Studio (OSPS)

I gathered some interesting data from this tool in my previous round of optimization. Installation is a little tricky IIRC. I'll go through it again and document it here. Java Performance (Hunt) offers some help in using the tool.
Intel VTune

TODO: what can we learn from hardware counters? How do scalac and javac compare for processor cache friendliness? We might be able to get similar data from OSPS.
Optimization ideas

As they show up from profiling...

 hoist outer references in hot methods

if this is a win, we could do it automatically in the optimizer


 Avoid boxing in settings lookups like isDeveloper, regex in isScala211.

 caching settings in PerRunSettings
 add a boolValue method to Setting to avoid boxing through the generic value.


 isPrimitiveValueClass currently calls List#contains. This incurs the BoxesRuntime.equals penalty.

 MAYBE: Create a def contains[A <: AnyRef](as: List[A], a: A): Boolean to avoid this cost
 MAYBE: Check for isSubclass(AnyValClass) before looking through the list.
 MAYBE: Use a different data structure


 Replace more uses of regular maps with AnyRefMap.

 originalOwner
 isNumericSubClass


 Optimize Symbol#isError / Symbol#isImplicit. Can we avoid looking up the current phase's flag mask for flags that are visible in all phases?
 Symbol#javaBinaryName et al intern the full name of a class symbol in the name table. GenBCode caches the results, but Java generic signature generation in Erasure doesn't. Might be simpler/faster to directly create the String, rather than the Name, and to cache it transparently to callers of javaBinaryName.
 TypeMap#mapOver is a hot method.

 Can we do anything to make it more JIT friendly? Processing of TypeRef is likely to be the hottest,
try breaking this into a nested method so that JIT could inline mapConserve.
 Is the order of instanceOf checks optimal?


 The scheme used in freshExistentialName seems very wasteful; each new existential has a distinct name. The creation of these names for AsSeenFrom#captureThis accounts for 0.8% of compilaton time in one of my runs!

 Revisit https://github.com/scala/scala/commit/321338da04a6ca9bcc9b77ae663ed27f26a67d85 to understand how it fixed SI-1939.
 Try using TypeMap local existential-ids, rather than SymbolTable global ones.
 Symbol#originalInfo shows 0.9% of compilation time, all of which comes from Symbol#isMonomorphicType, called in turn by NoArgsTypeRef#isHigherKinded.

 Try caching isMonomorphicType.


 Use a higher initial capacity for the uniques WeakHashSet. Resizng this accounted for 0.7% of run time, although that figure may be higher due to the resident nature of the compiler in the benchmark.


 transformStats contains stats mapConserve f filter (_ != EmptyTree). The filter call is a signficant source of memory pressure: it's eager allocation of List.newBuilder is shown as 5% of allocations (by size) in the Thread Local Allocation Buffer (TLAB).

 Fuse the mapConserve and filter in a custom function to avoid allocating a ListBuffer


 Reduce memory pressure

 sym.typeParams zip args collect { case (tp, arg) if tp.isSpecialized => arg } in specializedTypeVars
 val (keys, vals) = env.toList.unzip in normalizeMember (throwaway list creation)
 Hot calls to List#apply[A](A*) (varargs boxing and construction of a list is expensive, 2% of TLAB memory pressure)
 lookupInScope in Context need not create a throwaway list: adapt callers to use the iterator diretly.
 Optimize TraversableOnce#toList with if (isEmpty) Nil else to[List]
 Try writing List#mapConserve without indirection through (and allocation of) ListBuffer


 We pay for a lot of char[] allocation in Java generic signature generation in classSig. We could probably cache this on the ClassSymbol so we don't pay the cost each time a signature refers to a class. A stepping stone might be to allocate the StringBuilder with a larger initial capacity.
 Workaround SI-9257 in UnCurry by introducing a local val to capture the pattern binder. I think the same bug affects SpecializedTypes#transform1.
 specializedTypeVars is hot, and uses SetLike#++ which internally is allocation heavy for ObjectRef (via foldLeft. Find a faster approach.

 MAYBE: modify TraversableOnce#foldLeft with a fast path for empty collections.


 Cache specializableTypes.map(_.typeSymbol) in specializedOn.
 ContextReporter#hasErrors should return true if _errorBuffer eq null, rather than creating an empty buffer and asking if it is empty!
 Address apparent overuse of perRunCaches. Why should we register maps created for each compilatio unit, (e.g. private val liftedDefs = perRunCaches.newMap[Symbol, ListBuffer[Tree]]() in class Flattener? We end up with a long list of expired weak references in caches to clean up.

Comparative Benchmarks

I've profiled this style of head-to-head test before, but it is a good one to come back to. Script from:  https://twitter.com/codemonkeyism/status/591128993186897920
% time ~/scala/2.10.4/bin/scalac -d /tmp scala/*.scala

real	0m15.002s
user	0m34.264s
sys	0m0.920s
% time ~/scala/2.11.6/bin/scalac -d /tmp scala/*.scala

real	0m12.195s
user	0m26.461s
sys	0m0.789s

% time javac -d /tmp java/*.java

real	0m2.175s
user	0m6.087s
sys	0m0.436s
topic/trait-defaults /code/scala2/sandbox/perf cat gen.sh
#!/bin/bash

rm -rf scala
rm -rf java
mkdir scala
mkdir java
for i in `seq 1 10`;
do
   for j in `seq 1 10`;
   do
      for k in `seq 1 20`;
        do
	file="C${i}${j}${k}"
        file2="C${i}${j}"
	echo "class $file { }" > scala/$file.scala
	echo "class $file { }" > java/$file.java
      done
   done
done

UPDATE Turns out the bottleneck there is in our bash script processing the humongous command line! We can get the source file list into to the compiler via other means, such as via stdin to the resident compiler. Thta gives us performance in the same ballpark as Java: https://gist.github.com/retronym/3be212b836de6537cc1b.
% date; time (for i in {1..1}; do echo ../scala2/sandbox/perf/scala/*.scala; done) | ~/scala/2.11.6/bin/scalac -J-Xmx2G -d /tmp -Xresident 2>&1 | perl -pne 'print scalar(localtime()), " ";'
Thu Apr 23 20:53:40 AEST 2015
Thu Apr 23 20:53:40 2015
Thu Apr 23 20:53:45 2015 nsc>
Thu Apr 23 20:53:45 2015 nsc>
real	0m5.538s
user	0m19.122s
sys	0m0.766s

% date; time (for i in {1..10}; do echo ../scala2/sandbox/perf/scala/*.scala; done) | ~/scala/2.11.6/bin/scalac -J-Xmx2G -d /tmp -Xresident 2>&1 | perl -pne 'print scalar(localtime()), " ";'
Thu Apr 23 20:53:54 AEST 2015
Thu Apr 23 20:53:54 2015
Thu Apr 23 20:53:59 2015 nsc>
Thu Apr 23 20:54:01 2015 nsc>
Thu Apr 23 20:54:02 2015 nsc>
Thu Apr 23 20:54:03 2015 nsc>
Thu Apr 23 20:54:04 2015 nsc>
Thu Apr 23 20:54:05 2015 nsc>
Thu Apr 23 20:54:06 2015 nsc>
Thu Apr 23 20:54:07 2015 nsc>
Thu Apr 23 20:54:08 2015 nsc>
Thu Apr 23 20:54:09 2015 nsc>
Thu Apr 23 20:54:09 2015 nsc>
real	0m14.942s
user	0m46.607s
sys	0m2.546s