Skip to content

Instantly share code, notes, and snippets.

@retronym
Last active June 11, 2017 04:27
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save retronym/86ec6ad9ccd2c22f6148 to your computer and use it in GitHub Desktop.
Save retronym/86ec6ad9ccd2c22f6148 to your computer and use it in GitHub Desktop.
Compiler Profiling Notes

Tuning Scala[c]

Using every tool in the box to squeeze out some more compiler performance.

Tooling

Java Flight Recorder

JMC / JFR resources:

Generate a profile with:

qscalac -J-Xmx2G -J-XX:+UnlockCommercialFeatures -J-XX:+UnlockDiagnosticVMOptions -J-XX:+DebugNonSafepoints -J-XX:+FlightRecorder -J-XX:FlightRecorderOptions=defaultrecording=true,dumponexit=true,stackdepth=1024,loglevel=debug,settings=profile,dumponexitpath=/tmp/dumponexit.jfr -d /tmp $(find src/compiler/ -name '*.scala')

You can replace profile with myprofile to pick up settings from $JDK_HOME/jre/lib/jfr/myprofile.jfc. Export that file from the template editor in jmc. jmc is also used as the profile analyzer.

-XX:+PrintComplation

https://gist.github.com/rednaxelafx/1165804#file_notes.md

I managed to build OpenJDK on Mac OS Yosemite with:

cd /code
hg clone http://hg.openjdk.java.net/jdk9/jdk9
cd jdk9
bash get_source.sh
bash configure --with-debug-level=fastdebug --with-target-bits=64 --disable-zip-debug-info
make CONF=macosx-x86_64-normal-server-fastdebug
./build/macosx-x86_64-normal-server-fastdebug/jdk/bin/javac -version

This enables use of PrintOptoAssembly and friends to see the generated machine code.

Example:

% export FILES=$(for f in $(find src/scalap -name '*.scala'))
% export FILES_SPACE_SEP=(for f in $FILES); do printf "$f "; done; printf '\n'); done)
% export M=*TypeMaps\$TypeMap.mapOver
% export JAVA_HOME=/code/jdk9/build/macosx-x86_64-normal-server-fastdebug/jdk

% (for in in {1..20}; do echo $FILES_SPACE_SEP; done) |                      \
% build/quick/bin/scalac -J-XX:+TieredCompilation -J-Xmx4G                   \
                       -J-XX:+UnlockDiagnosticVMOptions                      \
                       -J-XX:+LogCompilation -J-XX:LogFile=hotspot.log       \                 
                       -J-XX:CompileCommand=option,$M,PrintNMethods          \
                       -J-XX:CompileCommand=option,$M,PrintOptoAssembly      \
                       -J-XX:+DebugNonSafepoints -d /tmp -Xresident)

JITWatch seems useful to process and visualize the log output. But I haven't got this working all that well yet.

JMH

Oracle uses JVM to run javac and java performance tests. We've had trouble in the past to deal with the large variance of runtime in the Scala compiler, but we should keep trying.

One new idea is to run the warmup code that run all the important parts of the compiler often without performing a lot of work. My idea is to create a single source file that exercises each phase (e.g. has an inner classes, a lambda, a value class, etc) and uses this as an initial round of warmup. We could then switch over to the real code under test.

Oracle plans to open source a large part of there OpenJDK performance tests. When this happens, we should see what patterns are applicable for scalac benchmarking. It will also be useful to run head to head tests so we can quantify the performance gap for comparable source files.

I understand that a useful pattern is to try to have a variety of focussed tests. So we can try running the compiler up to the typer phase; or using sources that don't incur work many phases; or directly test subtyping or member lookup.

I'm adding the benchmarks to: https://github.com/retronym/scala-jmh-suite

TODO:

  • Write more useful benchmarks
  • Figure out the best combination of warmup/iteration/jvm runs
  • Start running the benchmarchs on a dedicated box (or boxes)
  • Run suite against each tag of Scala, persist and visualize the results
  • Allow the suite to be run for a specific commit.

Oracle Solaris Performance Studio (OSPS)

I gathered some interesting data from this tool in my previous round of optimization. Installation is a little tricky IIRC. I'll go through it again and document it here. Java Performance (Hunt) offers some help in using the tool.

Intel VTune

TODO: what can we learn from hardware counters? How do scalac and javac compare for processor cache friendliness? We might be able to get similar data from OSPS.

Optimization ideas

As they show up from profiling...

  • hoist outer references in hot methods
  • Avoid boxing in settings lookups like isDeveloper, regex in isScala211.
    • caching settings in PerRunSettings
    • add a boolValue method to Setting to avoid boxing through the generic value.
  • isPrimitiveValueClass currently calls List#contains. This incurs the BoxesRuntime.equals penalty.
    • MAYBE: Create a def contains[A <: AnyRef](as: List[A], a: A): Boolean to avoid this cost
    • MAYBE: Check for isSubclass(AnyValClass) before looking through the list.
    • MAYBE: Use a different data structure
  • Replace more uses of regular maps with AnyRefMap.
    • originalOwner
    • isNumericSubClass
  • Optimize Symbol#isError / Symbol#isImplicit. Can we avoid looking up the current phase's flag mask for flags that are visible in all phases?
  • Symbol#javaBinaryName et al intern the full name of a class symbol in the name table. GenBCode caches the results, but Java generic signature generation in Erasure doesn't. Might be simpler/faster to directly create the String, rather than the Name, and to cache it transparently to callers of javaBinaryName.
  • TypeMap#mapOver is a hot method.
    • Can we do anything to make it more JIT friendly? Processing of TypeRef is likely to be the hottest, try breaking this into a nested method so that JIT could inline mapConserve.
    • Is the order of instanceOf checks optimal?
  • The scheme used in freshExistentialName seems very wasteful; each new existential has a distinct name. The creation of these names for AsSeenFrom#captureThis accounts for 0.8% of compilaton time in one of my runs!
    • Revisit https://github.com/scala/scala/commit/321338da04a6ca9bcc9b77ae663ed27f26a67d85 to understand how it fixed SI-1939.
    • Try using TypeMap local existential-ids, rather than SymbolTable global ones.
    • Symbol#originalInfo shows 0.9% of compilation time, all of which comes from Symbol#isMonomorphicType, called in turn by NoArgsTypeRef#isHigherKinded.
      • Try caching isMonomorphicType.
    • Use a higher initial capacity for the uniques WeakHashSet. Resizng this accounted for 0.7% of run time, although that figure may be higher due to the resident nature of the compiler in the benchmark.
  • transformStats contains stats mapConserve f filter (_ != EmptyTree). The filter call is a signficant source of memory pressure: it's eager allocation of List.newBuilder is shown as 5% of allocations (by size) in the Thread Local Allocation Buffer (TLAB).
    • Fuse the mapConserve and filter in a custom function to avoid allocating a ListBuffer
  • Reduce memory pressure
    • sym.typeParams zip args collect { case (tp, arg) if tp.isSpecialized => arg } in specializedTypeVars
    • val (keys, vals) = env.toList.unzip in normalizeMember (throwaway list creation)
    • Hot calls to List#apply[A](A*) (varargs boxing and construction of a list is expensive, 2% of TLAB memory pressure)
    • lookupInScope in Context need not create a throwaway list: adapt callers to use the iterator diretly.
    • Optimize TraversableOnce#toList with if (isEmpty) Nil else to[List]
    • Try writing List#mapConserve without indirection through (and allocation of) ListBuffer
  • We pay for a lot of char[] allocation in Java generic signature generation in classSig. We could probably cache this on the ClassSymbol so we don't pay the cost each time a signature refers to a class. A stepping stone might be to allocate the StringBuilder with a larger initial capacity.
  • Workaround SI-9257 in UnCurry by introducing a local val to capture the pattern binder. I think the same bug affects SpecializedTypes#transform1.
  • specializedTypeVars is hot, and uses SetLike#++ which internally is allocation heavy for ObjectRef (via foldLeft. Find a faster approach.
    • MAYBE: modify TraversableOnce#foldLeft with a fast path for empty collections.
  • Cache specializableTypes.map(_.typeSymbol) in specializedOn.
  • ContextReporter#hasErrors should return true if _errorBuffer eq null, rather than creating an empty buffer and asking if it is empty!
  • Address apparent overuse of perRunCaches. Why should we register maps created for each compilatio unit, (e.g. private val liftedDefs = perRunCaches.newMap[Symbol, ListBuffer[Tree]]() in class Flattener? We end up with a long list of expired weak references in caches to clean up.

Comparative Benchmarks

I've profiled this style of head-to-head test before, but it is a good one to come back to. Script from: https://twitter.com/codemonkeyism/status/591128993186897920

% time ~/scala/2.10.4/bin/scalac -d /tmp scala/*.scala

real	0m15.002s
user	0m34.264s
sys	0m0.920s
% time ~/scala/2.11.6/bin/scalac -d /tmp scala/*.scala

real	0m12.195s
user	0m26.461s
sys	0m0.789s

% time javac -d /tmp java/*.java

real	0m2.175s
user	0m6.087s
sys	0m0.436s
topic/trait-defaults /code/scala2/sandbox/perf cat gen.sh
#!/bin/bash

rm -rf scala
rm -rf java
mkdir scala
mkdir java
for i in `seq 1 10`;
do
   for j in `seq 1 10`;
   do
      for k in `seq 1 20`;
        do
	file="C${i}${j}${k}"
        file2="C${i}${j}"
	echo "class $file { }" > scala/$file.scala
	echo "class $file { }" > java/$file.java
      done
   done
done

UPDATE Turns out the bottleneck there is in our bash script processing the humongous command line! We can get the source file list into to the compiler via other means, such as via stdin to the resident compiler. Thta gives us performance in the same ballpark as Java: https://gist.github.com/retronym/3be212b836de6537cc1b.

% date; time (for i in {1..1}; do echo ../scala2/sandbox/perf/scala/*.scala; done) | ~/scala/2.11.6/bin/scalac -J-Xmx2G -d /tmp -Xresident 2>&1 | perl -pne 'print scalar(localtime()), " ";'
Thu Apr 23 20:53:40 AEST 2015
Thu Apr 23 20:53:40 2015
Thu Apr 23 20:53:45 2015 nsc>
Thu Apr 23 20:53:45 2015 nsc>
real	0m5.538s
user	0m19.122s
sys	0m0.766s

% date; time (for i in {1..10}; do echo ../scala2/sandbox/perf/scala/*.scala; done) | ~/scala/2.11.6/bin/scalac -J-Xmx2G -d /tmp -Xresident 2>&1 | perl -pne 'print scalar(localtime()), " ";'
Thu Apr 23 20:53:54 AEST 2015
Thu Apr 23 20:53:54 2015
Thu Apr 23 20:53:59 2015 nsc>
Thu Apr 23 20:54:01 2015 nsc>
Thu Apr 23 20:54:02 2015 nsc>
Thu Apr 23 20:54:03 2015 nsc>
Thu Apr 23 20:54:04 2015 nsc>
Thu Apr 23 20:54:05 2015 nsc>
Thu Apr 23 20:54:06 2015 nsc>
Thu Apr 23 20:54:07 2015 nsc>
Thu Apr 23 20:54:08 2015 nsc>
Thu Apr 23 20:54:09 2015 nsc>
Thu Apr 23 20:54:09 2015 nsc>
real	0m14.942s
user	0m46.607s
sys	0m2.546s
@gkossakowski
Copy link

I looked into cache frendliness of both scalac and javac. I used linux performance tools to look at counters for cache misses. I don't have the raw data handy but I remember we had similar rate of L0 and L1 cache misses to javac. My conclusion at the time was that caches misses are not the place to look for performance improvements.

@gkossakowski
Copy link

On the other hand, it would be interesting to get data on how many branch misses we get compared to javac. This would give us an idea if we break cpu instruction pipelining more than javac does (I find this plausible). See http://www.brendangregg.com/perf.html#CPUstatistics

@retronym
Copy link
Author

My initial investigation of collecting perf data for scalac and javac shows similar Instruction per Cycle (IPC), branch frequency, branch miss frequency. We could get more scientific by compiling semantically identical source code on both side, but I don't see a glaring difference here so far.

============================================================
Scalac

 Performance counter stats for 'sbt compile clean compile clean compile clean compile clean compile':

     783751.529160      task-clock (msec)         #    4.130 CPUs utilized
           103,017      context-switches          #    0.131 K/sec
             1,455      cpu-migrations            #    0.002 K/sec
           203,648      page-faults               #    0.260 K/sec
   922,163,230,415      cycles                    #    1.177 GHz                     [50.02%]
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
   467,619,440,966      instructions              #    0.51  insns per cycle         [75.01%]
    86,850,263,721      branches                  #  110.814 M/sec                   [74.98%]
     6,056,320,354      branch-misses             #    6.97% of all branches         [75.00%]

     189.757272569 seconds time elapsed


real    3m9.770s
user    12m56.348s
sys 0m8.269s

============================================================
Javac

[success] Total time: 9 s, completed Apr 17, 2015 2:01:34 AM

 Performance counter stats for 'sbt clean compile clean compile clean compile clean compile clean compile clean compile clean compile clean compile clean compile':

     419575.622624      task-clock (msec)         #    3.907 CPUs utilized
            96,935      context-switches          #    0.231 K/sec
             3,203      cpu-migrations            #    0.008 K/sec
           611,765      page-faults               #    0.001 M/sec
   493,620,046,324      cycles                    #    1.176 GHz                     [50.03%]
   <not supported>      stalled-cycles-frontend
   <not supported>      stalled-cycles-backend
   259,725,513,291      instructions              #    0.53  insns per cycle         [75.03%]
    48,592,680,422      branches                  #  115.814 M/sec                   [74.98%]
     3,679,105,438      branch-misses             #    7.57% of all branches         [74.99%]

     107.379859122 seconds time elapsed

@JohnReedLOL
Copy link

Excuse me. I want to profile the overall compile time of a file with/without explicit types and with/without implicit paramaters.

I would like to be able to do time scalac foo.scala, but I have never done something like this before and the instructions look kind of daunting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment