Qiuzhuang/spark_flame_graphs.md

## spark_flame_graphs.md

      
    Raw
  

              spark_flame_graphs.md
            
          
    Generating Flame Graphs for Apache Spark

Flame graphs are a nifty debugging tool to determine where CPU time is being spent.  Using the Java Flight recorder, you can do this for Java processes without adding significant runtime overhead.
When are flame graphs useful?

Shivaram Venkataraman and I have found these flame recordings to be useful for diagnosing coarse-grained performance problems. We started using them at the suggestion of Josh Rosen, who quickly made one for the Spark scheduler when we were talking to him about why the scheduler caps out at a throughput of a few thousand tasks per second.  Josh generated a graph similar to the one below, which illustrates that a significant amount of time is spent in serialization (if you click in the top right hand corner and search for "serialize", you can see that 78.6% of the sampled CPU time was spent in serialization).  We used this insight to speed up the scheduler by switching to use Kryo instead of the Java serializer.

Installing Oracle JDK

The flight recorder is only available in the Hotspot JVM, so if you're using OpenJDK, you'll need to replace it with the Oracle JDK (version 7u40 or higher).  These instructions describe how to replace Open JDK on Red Hat.
To see which JDK you're using, run:
java -version

If you have OpenJDK, first remove it:
yum -y remove java*

Now if you try java -version again, it should say "No such file or directory".
Next, download the Oracle JDK and use the Red Hat package manager to install it (the extra wget flags are necessary to avoid an error saying you need to accept Oracle's license):
wget --no-check-certificate --no-cookies --header "Cookie: oraclelicense=accept-securebackup-cookie" http://download.oracle.com/otn-pub/java/jdk/7u60-b19/jdk-7u60-linux-x64.rpm
rpm -ivh jdk-7u60-linux-x64.rpm

You may also need to fix JAVA_HOME, which Spark uses, to point to the correct location.  If you're using EC2 instances launched by the spark-ec2 script, the easiest way to do this is to create a symlink from the expected JAVA_HOME to the new installation:
ln -s /usr/java/jdk1.7.0_60/ /usr/lib/jvm/java-1.7.0

You may also need to add $JAVA_HOME/bin to your path so that Java commands like jps work correctly (export PATH=$PATH:$JAVA_HOME/bin).
If you're running on a cluster of machines, you'll need to update Java on all of the machines.  The easiest way to do this, if you're using the spark-ec2 scripts, is to put the 4 commands above into a file (e.g., upgrade_java.sh), copy that file to all of the machines on the cluster:
/root/spark-ec2/copy-dir /root/upgrade_java.sh

and then use the helpful slaves.sh script to run the script on all of the workers:
/root/spark/sbin/slaves.sh upgrade_java.sh

Next, you'll need to re-compile Spark, copy the updated Spark to all of the machines:
/root/spark-ec2/copy-dir --delete /root/spark

and re-start Spark:
/root/spark/sbin/stop-all.sh
/root/spark/sbin/start-all.sh

Enabling the flight recorder

To enable the flight recorder, you need to add -XX:+UnlockCommercialFeatures -XX:+FlightRecorder when you start the JVM.  For Spark, you can do this by adding the following to spark-defaults.conf:
# Enable flight recording on the driver.
spark.driver.extraJavaOptions -XX:+UnlockCommercialFeatures -XX:+FlightRecorder

# Enable flight recording on the executors.
spark.executor.extraJavaOptions -XX:+UnlockCommercialFeatures -XX:+FlightRecorder

You can do this after you've started the master, but you need to do it before starting your applicaton.
Collecting a flight recording

To collect a flight recording from your running Spark job, you'll first need to figure out the process ID.  If you want to collect a flight recording from an executor, for example, run jps on the worker machine:
$ jps
18977 Worker
19163 CoarseGrainedExecutorBackend
5465 DataNode
19224 Jps

You want the ID of CoarseGrainedExecutorBackend (this will have a different name if you're using YARN or Mesos -- look for the process name that includes ExecutorBackend).  Then use jcmd to create a flight recording (the example command below creates a 10 second recording).  Note that this runs asynchronously, so the command will return immediately, but the file won't be available for 10 seconds.
$ jcmd 19163 JFR.start filename=spark.jfr duration=10s
19163:
Started recording 1. The result will be written to:

/root/spark/work/app-20160628234739-0001/1/spark.jfr

Troubleshooting

When you run jcmd, if you get an error that says:
java.lang.IllegalArgumentException: Unknown diagnostic command

it probably means that the extra Java options didn't get passed to the JVM correctly.  To see all of the valid commands for a particular process, you can use jcmd with the help command:
$ jcmd 19163 help
The following commands are available:
VM.native_memory
VM.commercial_features
ManagementAgent.stop
ManagementAgent.start_local
ManagementAgent.start
Thread.print
GC.class_histogram
GC.heap_dump
GC.run_finalization
GC.run
VM.uptime
VM.flags
VM.system_properties
VM.command_line
VM.version
help

In the output above, JFR.start isn't listed as a valid command.  You'll notice, from the output above, that VM.flags is a valid command; you can use this to check what what options got passed to the JVM:
$ jcmd 19163 VM.flags

Double check that the output of the above command includes the Java options that you added above.  In Spark, you can also check that the Java options got added correctly by clicking the "Environment" tab in the UI and checking the value of the spark.driver.extraJavaOptions and spark.executor.extraJavaOptions variables.
Generating a flame graph

First, if you created the recording on a remote machine, copy the file back to your laptop (this will make it easier to view the flame graph).  Next, follow the instructions at this GitHub repo to generate a FlameGraph.
In order to run the install-mc-jars.sh script to install the flame graph tools, you'll need to have the JAVA_HOME environment variable set correctly.  On mac, you can figure out where Java is installed with:
/usr/libexec/java_home -v 1.7

Set JAVA_HOME to whatever was returned by that command before running the install script.  The GitHub repository linked above describes all necessary remaining steps to create a flame graph.
Limitations

While flame graphs can be useful for spotting big performance issues, we've found them to be less useful for fine-grained performance issues.  After fixing some of the major issues with the scheduler, for example, we found the flame graph to be minimally useful, because it doesn't convey anything about the critical path, and many of the later scheduler performance issues we found were subtle issues along the critical path.
Also, flame graphs only display time that the JVM spent using the CPU. They will not show time spent blocked waiting on I/O, nor will they display time spent in native code.  This means that, for example, they're not useful for diagnosing when network bandwidth to ship tasks to workers becomes a bottleneck.
References and Credits

Essentially all of the information on this page came from Josh Rosen and Eric Liang, who started using flame graphs for Spark and then showed them to me.
The Flame Graph tool was written by Brendan Gregg, and is described (including a video about how to interpret them) on his blog.  He has a blog post on how to generate flame graphs that include CPU time spent in the kernel and in system libraries (in addition to the CPU time spent in Java) here. He also wrote an ACMQueue article about flame graphs.
Isuru Perera wrote the tool to generate a Flame Graph from Java Flight Recordings, which is described in his blog.
Marcus Hirt's blog has some useful writeups about using the Java flight recorder and Java mission control -- e.g., http://hirt.se/blog/?p=364.  There's also a great writeup about Java Mission Control on Takipi's blog.
This page describes in more detail how to remove OpenJDK and install Oracle JDK.