Skip to content

Instantly share code, notes, and snippets.

@trevnorris
Last active December 9, 2017 00:39
Show Gist options
  • Save trevnorris/f0907b010c9d5e24ea97 to your computer and use it in GitHub Desktop.
Save trevnorris/f0907b010c9d5e24ea97 to your computer and use it in GitHub Desktop.
Here's a rundown of everything I use to do performance analysis in Node.

Introduction

Here is a collection of tips and tricks I've picked up about doing performance analysis on Node. Included is a build script that should get a base install of Ubuntu fully functional and ready for all the things we'll be going through.

The script pulls a lot of code from the latest master of each repository. So it's possible that something may fail, but to date I haven't had any issues.

First, go ahead and run the script. Then go take a nice long break. It'll probably be a while before everything finishes up. Once that's done go ahead and read on. There's plenty to cover. Don't forget, if running the script on a multi-core machine pass in COMPILE_CPUS=N for faster compilations.

Note: Both the script and instructions assume they're run as root.

perf

Probably my favorite tool in the tool belt is perf. Now with new support from V8 it's possible to do some pretty cool things.

Basics

Let's start with some basics. First we need to tell the kernel not to hide its pointers. So we can properly gather all metrics. Do this with the following command:

$ sysctl kernel/kptr_restrict=0

With that taken care of, we'll run our first simple benchmark: ~/source/node/:

$ perf record -e cycles:u -- ./node benchmark/http_simple.js

Note: -e cycles:u tells perf we only want to record CPU cycles relevant to our application (e.g. don't record time spent polling).

And from another terminal window run the following:

$ ab -n 100000 -c 8 'http://localhost:8000/bytes/4096/4'

Kill the node process once complete and run:

$ perf report

Here's some example output:

3.92%  node  node     [.] v8::internal::ConsString* v8::internal::String::VisitFlat...
3.91%  node  node     [.] void v8::Utf8WriterVisitor::Visit...
2.30%  node  node     [.] v8::internal::JSObject::MigrateToMap...
1.51%  node  node     [.] http_parser_execute
1.45%  node  node     [.] v8::internal::Map::LookupDescriptor...
1.43%  node  node     [.] int v8::internal::LinearSearch...
0.92%  node  node     [.] v8::internal::JSReceiver::LocalLookup...

Not very interesting on its own, but at least things are running. perf record takes samples of data at given intervals and reports what function is being executed at that time.

Call Graphs

Much more useful output is a call graph. Luckily that's simple enough to do with the additional option -g:

$ perf record -e cycles:u -g -- ./node benchmark/http_simple.js

When ab is done running you'll probably notice that a lot more data has been written out. Run perf report to see why:

-    3.93%  node  node                 [.] v8::internal::ConsString* v8::internal::String::VisitFlat...
   - v8::internal::ConsString* v8::internal::String::VisitFlat...
      - 99.55% v8::String::Utf8Length()...
           node::StringBytes::Size...
           node::Buffer::ByteLength...
         - 0x79437453ae9
            - 99.93% 0x7943746b35c
               + 0x79437306f75
-    3.80%  node  node                 [.] void v8::Utf8WriterVisitor::Visit...
   - void v8::Utf8WriterVisitor::Visit...
      - 99.66% v8::internal::ConsString* v8::internal::String::VisitFlat...
           v8::RecursivelySerializeToUtf8...
         - v8::String::WriteUtf8...
            - 98.67% node::StringBytes::Write...
               - 78.06% void node::StreamWrap::WriteStringImpl...
                  + 97.33% 0x79437453ae9

Now there's a complete call graph of each sample. This is much more useful, but not near enough to help us troubleshoot out performance needs. Above you'll see raw memory addresses in the call graph (e.g. 0x79437453ae9). This is the memory location of the JIT'd JS function that was run, but perf can't decipher these without additional information.

JS Integration

Luckily V8 now includes that support. Run the following:

$ perf record -e cycles:u -g -- ./node --perf-basic-prof benchmark/http_simple.js

After this is done running, look in the /tmp directory. You should see a file that looks like perf-<pid>.map. This is the file perf uses to decipher what each address means.

Note: There's currently a bug in perf report that may cause the tui (default) interface to fail. If that happens run perf report --stdio, or if you're on a desktop you can also run perf report --gtk. Here's some example output:

# Overhead  Command        Shared Object
# ........  .......  ...................  ................................................................
#
     4.15%     node  node                 [.] v8::internal::ConsString* v8::internal::String::VisitFlat...
               |
               --- v8::internal::ConsString* v8::internal::String::VisitFlat...
                  |
                  |--99.15%-- v8::String::Utf8Length() const
                  |          node::StringBytes::Size...
                  |          node::Buffer::ByteLength...
                  |          |
                  |          |--99.81%-- Stub:CallApiFunctionStub
                  |          |          |
                  |          |          |--99.88%-- LazyCompile:*OutgoingMessage.write _http_outgoing.js:405

The stdio output is much more verbose, but still helpful. And, as you can see on the bottom line, there's now one of the originating JS function calls. This is getting much more interesting.

Probing our Problems

perf has the additional ability to place probes on individual native functions. The most useful functions to probe are those from libuv. This is because they serve as the entry/exit point for all I/O. By placing probe on libuv functions we can trace right back to the JS call site.

To see the user API open the file ~/node/deps/uv/include/uv.h. Also, this following snippet will spew the complete contents of everything included in the binary:

$ perf probe -x ./node -F | egrep "^uv_[^_]"

Note: Not all of these can have a probe placed on them. This is because of function inlining the compiler does. So while the API call exists, the actual call to the function itself does not.

Most all I/O that occurs in libuv is some sort of stream. Which means the function uv_write() will cover almost all output. Except for file system operations. Because these operations need to be performed as a blocking syscall (though treated asynchronously because of the thread pool) file system operations have their own I/O functions. To get a quick list, run the following:

$ perf probe -x ./node -F | egrep "^uv_fs_"

Adding and Using a Probe

To start we'll add a probe to uv_write() as the most generic way of tracing all stream based I/O. This is done using the following command:

$ perf probe -x ./node -a uv_write
Added new events:
  probe_node:uv_write  (on uv_write in ./node/out/Release/node)

You can now use it in all perf tools, such as:

        perf record -e probe_node:uv_write -aR sleep 1

Now we can create a call graph for all calls to that probe:

$ perf record -i -g -e probe_node:uv_write -- ./node --perf-basic-prof ./benchmark/http_simple.js

Now go ahead and smack the server around a bit:

$ ab -n 100000 -c 8 'http://localhost:8000/bytes/4096/4'

After that's complete go ahead and kill the server and analyze the output the same as before with perf report:

-  100.00%  node  node  [.] uv_write
   - uv_write
   - node::StreamWrap::Writev(v8::FunctionCallbackInfo<v8::Value> const&)
      - 96.92% Stub:CallApiFunctionStub
         - LazyCompile:*Socket._writeGeneric net.js:613
            - 16.77% LazyCompile:*clearBuffer _stream_writable.js:341
                 LazyCompile:*OutgoingMessage.end _http_outgoing.js:486
                 Builtin:ArgumentsAdaptorTrampoline
                 LazyCompile:* /var/projects/node/benchmark/http_simple.js:25

Here we can see the call made in http_simple.js.

The demonstration here has been kept simple to outline the basic usage, but as the projects become more complex (e.g. having module dependencies, etc.) this will be much more informative to see where time is spent.

Flame Graphs

At times having a visual representation that gives an overall picture of how the application is performing can be very helpful. This is when flame graphs come in handy.

#!/bin/bash
# Variables used later in the script, Can override on command line.
# Pass in the number of CPUs used to compile.
: ${COMPILE_CPUS:=1}
: ${LLVM_VERSION:=3.5}
# This script assumes it's run as root, and has only been tested on Ubuntu.
# First get everything else up-to-date.
apt-get update
apt-get upgrade -y
# If there's a kernel upgrade then run the following to upgrade and restart
# the machine. Then resume the script.
#apt-get dist-upgrade
#shutdown -r now
# First are the basic build tools needed.
apt-get install -y make build-essential elfutils libelf-dev git llvm-3.5 \
clang-3.5
# Build items for perf (not sure all of them are actually needed)
apt-get install -y flex bison libunwind8 libunwind8-dev libaudit-dev libdw-dev \
binutils-dev libnuma-dev libslang2-dev asciidoc libc6-dev-i386 libgtk2.0-dev \
libperl-dev python-dev
# Convenience packages.
# nginx is for easy viewing of generated flame graphs (placed in ~/www)
# apache2-utils is for use of ab for simple HTTP benchmarking
apt-get install -y tmux nginx apache2-utils
# For lldb.
apt-get install -y swig libedit-dev doxygen graphviz libcloog-isl-dev libisl-dev
# Building with clang is so much faster.
echo 'export CC=clang' > .bash_aliases
echo 'export CXX=clang++' >> .bash_aliases
echo 'export GYP_DEFINES="clang=1"' >> .bash_aliases
# Collect all core dumps.
echo 'ulimit -c unlimited' >> .bash_aliases
. .bash_aliases
# Make the default nginx dir writable.
chmod 777 /usr/share/nginx/html
ln -s /usr/share/nginx/html www
# Now to build and install all the fun stuff.
cd ~
mkdir sources
cd sources
wget https://github.com/joyent/node/archive/master.tar.gz
tar xvf master.tar.gz
rm master.tar.gz
mv node-master node
cd node
# Configure to also build debug mode, enable gdbjit support and enable
# output of disassembly for IRHydra.
./configure --gdb --without-snapshot --debug -- -Dv8_enable_disassembler=1
make -j${COMPILE_CPUS}
make install
cd ~/sources
# Package to generate flame graphs.
wget https://github.com/brendangregg/FlameGraph/archive/master.tar.gz
tar xvf master.tar.gz
rm master.tar.gz
mv FlameGraph-master FlameGraph
cd ~/sources
# Grab latest kernel and build perf. This will need to be updated routinely.
wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.15.6.tar.gz
tar xvf linux-3.15.6.tar.gz
rm linux-3.15.6.tar.gz
cd linux-3.15.6/tools/perf/
make -j${COMPILE_CPUS} -f Makefile.perf install prefix=/usr
cd ~/sources
# Now get to build latest lldb w/ jit support. This won't be necessary once
# lldb-3.6 is out.
wget https://github.com/llvm-mirror/llvm/archive/master.tar.gz
tar xvf master.tar.gz
rm master.tar.gz
mv llvm-master llvm
cd llvm/tools
wget https://github.com/llvm-mirror/clang/archive/master.tar.gz
tar xvf master.tar.gz
rm master.tar.gz
mv clang-master clang
wget https://github.com/llvm-mirror/lldb/archive/master.tar.gz
tar xvf master.tar.gz
rm master.tar.gz
mv lldb-master lldb
cd ../projects
# Use this version of compiler-rt to prevent build errors.
wget https://github.com/llvm-mirror/compiler-rt/archive/release_35.tar.gz
tar xvf release_35.tar.gz
rm release_35.tar.gz
mv compiler-rt-release_35/ compiler-rt
cd ..
mkdir build
cd build
# These options are pretty close to what Ubuntu uses to build.
../configure \
--disable-assertions \
--enable-shared \
--enable-optimized \
--with-optimize-option=' -g -O2' \
--enable-pic \
--enable-libffi \
--with-ocaml-libdir=/usr/lib/ocaml/llvm-${LLVM_VERSION} \
--with-binutils-include=/usr/include \
--with-cloog --with-isl \
--enable-shared
# Just going to build the entire freaking thing. This'll take a while.
make -j${COMPILE_CPUS}
# Before installing, remove clang and llvm
apt-get remove --purge llvm-3.5 clang-3.5
apt-get autoremove
make install
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment