trevnorris/all-my-knowledge.md

## all-my-knowledge.md

      
    Raw
  

              all-my-knowledge.md
            
          
    Introduction

Here is a collection of tips and tricks I've picked up about doing performance
analysis on Node. Included is a build script that should get a base install of
Ubuntu fully functional and ready for all the things we'll be going through.
The script pulls a lot of code from the latest master of each repository. So
it's possible that something may fail, but to date I haven't had any issues.
First, go ahead and run the script. Then go take a nice long break. It'll
probably be a while before everything finishes up. Once that's done go ahead
and read on. There's plenty to cover. Don't forget, if running the script on
a multi-core machine pass in COMPILE_CPUS=N for faster compilations.
Note: Both the script and instructions assume they're run as root.
perf

Probably my favorite tool in the tool belt is perf. Now with new support from
V8 it's possible to do some pretty cool things.
Basics

Let's start with some basics. First we need to tell the kernel not to hide its
pointers. So we can properly gather all metrics. Do this with the following
command:
$ sysctl kernel/kptr_restrict=0

With that taken care of, we'll run our first simple benchmark:
~/source/node/:
$ perf record -e cycles:u -- ./node benchmark/http_simple.js

Note: -e cycles:u tells perf we only want to record CPU cycles relevant to
our application (e.g. don't record time spent polling).
And from another terminal window run the following:
$ ab -n 100000 -c 8 'http://localhost:8000/bytes/4096/4'

Kill the node process once complete and run:
$ perf report

Here's some example output:
3.92%  node  node     [.] v8::internal::ConsString* v8::internal::String::VisitFlat...
3.91%  node  node     [.] void v8::Utf8WriterVisitor::Visit...
2.30%  node  node     [.] v8::internal::JSObject::MigrateToMap...
1.51%  node  node     [.] http_parser_execute
1.45%  node  node     [.] v8::internal::Map::LookupDescriptor...
1.43%  node  node     [.] int v8::internal::LinearSearch...
0.92%  node  node     [.] v8::internal::JSReceiver::LocalLookup...

Not very interesting on its own, but at least things are running. perf record
takes samples of data at given intervals and reports what function is being
executed at that time.
Call Graphs

Much more useful output is a call graph. Luckily that's simple enough to do
with the additional option -g:
$ perf record -e cycles:u -g -- ./node benchmark/http_simple.js

When ab is done running you'll probably notice that a lot more data has been
written out. Run perf report to see why:
-    3.93%  node  node                 [.] v8::internal::ConsString* v8::internal::String::VisitFlat...
   - v8::internal::ConsString* v8::internal::String::VisitFlat...
      - 99.55% v8::String::Utf8Length()...
           node::StringBytes::Size...
           node::Buffer::ByteLength...
         - 0x79437453ae9
            - 99.93% 0x7943746b35c
               + 0x79437306f75
-    3.80%  node  node                 [.] void v8::Utf8WriterVisitor::Visit...
   - void v8::Utf8WriterVisitor::Visit...
      - 99.66% v8::internal::ConsString* v8::internal::String::VisitFlat...
           v8::RecursivelySerializeToUtf8...
         - v8::String::WriteUtf8...
            - 98.67% node::StringBytes::Write...
               - 78.06% void node::StreamWrap::WriteStringImpl...
                  + 97.33% 0x79437453ae9

Now there's a complete call graph of each sample. This is much more useful, but
not near enough to help us troubleshoot out performance needs. Above you'll see
raw memory addresses in the call graph (e.g. 0x79437453ae9). This is the
memory location of the JIT'd JS function that was run, but perf can't
decipher these without additional information.
JS Integration

Luckily V8 now includes that support. Run the following:
$ perf record -e cycles:u -g -- ./node --perf-basic-prof benchmark/http_simple.js

After this is done running, look in the /tmp directory. You should see a file
that looks like perf-<pid>.map. This is the file perf uses to decipher what
each address means.
Note: There's currently a bug in perf report that may cause the tui
(default) interface to fail. If that happens run perf report --stdio, or if
you're on a desktop you can also run perf report --gtk. Here's some example
output:
# Overhead  Command        Shared Object
# ........  .......  ...................  ................................................................
#
     4.15%     node  node                 [.] v8::internal::ConsString* v8::internal::String::VisitFlat...
               |
               --- v8::internal::ConsString* v8::internal::String::VisitFlat...
                  |
                  |--99.15%-- v8::String::Utf8Length() const
                  |          node::StringBytes::Size...
                  |          node::Buffer::ByteLength...
                  |          |
                  |          |--99.81%-- Stub:CallApiFunctionStub
                  |          |          |
                  |          |          |--99.88%-- LazyCompile:*OutgoingMessage.write _http_outgoing.js:405

The stdio output is much more verbose, but still helpful. And, as you can see
on the bottom line, there's now one of the originating JS function calls. This
is getting much more interesting.
Probing our Problems

perf has the additional ability to place probes on individual native
functions. The most useful functions to probe are those from libuv. This is
because they serve as the entry/exit point for all I/O. By placing probe on
libuv functions we can trace right back to the JS call site.
To see the user API open the file ~/node/deps/uv/include/uv.h. Also, this
following snippet will spew the complete contents of everything included in the
binary:
$ perf probe -x ./node -F | egrep "^uv_[^_]"

Note: Not all of these can have a probe placed on them. This is because of
function inlining the compiler does. So while the API call exists, the actual
call to the function itself does not.
Most all I/O that occurs in libuv is some sort of stream. Which means the
function uv_write() will cover almost all output. Except for file system
operations. Because these operations need to be performed as a blocking
syscall (though treated asynchronously because of the thread pool) file system
operations have their own I/O functions. To get a quick list, run the following:
$ perf probe -x ./node -F | egrep "^uv_fs_"

Adding and Using a Probe

To start we'll add a probe to uv_write() as the most generic way of tracing
all stream based I/O. This is done using the following command:
$ perf probe -x ./node -a uv_write
Added new events:
  probe_node:uv_write  (on uv_write in ./node/out/Release/node)

You can now use it in all perf tools, such as:

        perf record -e probe_node:uv_write -aR sleep 1

Now we can create a call graph for all calls to that probe:
$ perf record -i -g -e probe_node:uv_write -- ./node --perf-basic-prof ./benchmark/http_simple.js

Now go ahead and smack the server around a bit:
$ ab -n 100000 -c 8 'http://localhost:8000/bytes/4096/4'

After that's complete go ahead and kill the server and analyze the output the
same as before with perf report:
-  100.00%  node  node  [.] uv_write
   - uv_write
   - node::StreamWrap::Writev(v8::FunctionCallbackInfo<v8::Value> const&)
      - 96.92% Stub:CallApiFunctionStub
         - LazyCompile:*Socket._writeGeneric net.js:613
            - 16.77% LazyCompile:*clearBuffer _stream_writable.js:341
                 LazyCompile:*OutgoingMessage.end _http_outgoing.js:486
                 Builtin:ArgumentsAdaptorTrampoline
                 LazyCompile:* /var/projects/node/benchmark/http_simple.js:25

Here we can see the call made in http_simple.js.
The demonstration here has been kept simple to outline the basic usage, but as
the projects become more complex (e.g. having module dependencies, etc.) this
will be much more informative to see where time is spent.
Flame Graphs

At times having a visual representation that gives an overall picture of how
the application is performing can be very helpful. This is when flame graphs
come in handy.

  
## setup-script.sh
#!/bin/bash

# Variables used later in the script, Can override on command line.

# Pass in the number of CPUs used to compile.
: ${COMPILE_CPUS:=1}
: ${LLVM_VERSION:=3.5}


# This script assumes it's run as root, and has only been tested on Ubuntu.

# First get everything else up-to-date.
apt-get update
apt-get upgrade -y

# If there's a kernel upgrade then run the following to upgrade and restart
# the machine. Then resume the script.
#apt-get dist-upgrade
#shutdown -r now

# First are the basic build tools needed.
apt-get install -y make build-essential elfutils libelf-dev  git llvm-3.5 \
  clang-3.5

# Build items for perf (not sure all of them are actually needed)
apt-get install -y flex bison libunwind8 libunwind8-dev libaudit-dev libdw-dev \
  binutils-dev libnuma-dev libslang2-dev asciidoc libc6-dev-i386 libgtk2.0-dev \
  libperl-dev python-dev

# Convenience packages.
# nginx is for easy viewing of generated flame graphs (placed in ~/www)
# apache2-utils is for use of ab for simple HTTP benchmarking
apt-get install -y tmux nginx apache2-utils

# For lldb.
apt-get install -y swig libedit-dev doxygen graphviz libcloog-isl-dev libisl-dev

# Building with clang is so much faster.
echo 'export CC=clang' > .bash_aliases
echo 'export CXX=clang++' >> .bash_aliases
echo 'export GYP_DEFINES="clang=1"' >> .bash_aliases

# Collect all core dumps.
echo 'ulimit -c unlimited' >> .bash_aliases

. .bash_aliases

# Make the default nginx dir writable.
chmod 777 /usr/share/nginx/html
ln -s /usr/share/nginx/html www

# Now to build and install all the fun stuff.

cd ~

mkdir sources

cd sources

wget https://github.com/joyent/node/archive/master.tar.gz

tar xvf master.tar.gz

rm master.tar.gz

mv node-master node

cd node

# Configure to also build debug mode, enable gdbjit support and enable
# output of disassembly for IRHydra.
./configure --gdb --without-snapshot --debug -- -Dv8_enable_disassembler=1

make -j${COMPILE_CPUS}

make install

cd ~/sources

# Package to generate flame graphs.
wget https://github.com/brendangregg/FlameGraph/archive/master.tar.gz

tar xvf master.tar.gz

rm master.tar.gz

mv FlameGraph-master FlameGraph

cd ~/sources

# Grab latest kernel and build perf. This will need to be updated routinely.
wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.15.6.tar.gz

tar xvf linux-3.15.6.tar.gz

rm linux-3.15.6.tar.gz

cd linux-3.15.6/tools/perf/

make -j${COMPILE_CPUS} -f Makefile.perf install prefix=/usr

cd ~/sources

# Now get to build latest lldb w/ jit support. This won't be necessary once
# lldb-3.6 is out.
wget https://github.com/llvm-mirror/llvm/archive/master.tar.gz

tar xvf master.tar.gz

rm master.tar.gz

mv llvm-master llvm

cd llvm/tools

wget https://github.com/llvm-mirror/clang/archive/master.tar.gz

tar xvf master.tar.gz

rm master.tar.gz

mv clang-master clang

wget https://github.com/llvm-mirror/lldb/archive/master.tar.gz

tar xvf master.tar.gz

rm master.tar.gz

mv lldb-master lldb

cd ../projects

# Use this version of compiler-rt to prevent build errors.
wget https://github.com/llvm-mirror/compiler-rt/archive/release_35.tar.gz

tar xvf release_35.tar.gz

rm release_35.tar.gz

mv compiler-rt-release_35/ compiler-rt

cd ..

mkdir build

cd build

# These options are pretty close to what Ubuntu uses to build.
../configure \
  --disable-assertions \
  --enable-shared \
  --enable-optimized \
  --with-optimize-option=' -g -O2' \
  --enable-pic \
  --enable-libffi \
  --with-ocaml-libdir=/usr/lib/ocaml/llvm-${LLVM_VERSION} \
  --with-binutils-include=/usr/include \
  --with-cloog --with-isl \
  --enable-shared

# Just going to build the entire freaking thing. This'll take a while.
make -j${COMPILE_CPUS}

# Before installing, remove clang and llvm
apt-get remove --purge llvm-3.5 clang-3.5

apt-get autoremove

make install
	#!/bin/bash

	# Variables used later in the script, Can override on command line.

	# Pass in the number of CPUs used to compile.
	: ${COMPILE_CPUS:=1}
	: ${LLVM_VERSION:=3.5}


	# This script assumes it's run as root, and has only been tested on Ubuntu.

	# First get everything else up-to-date.
	apt-get update
	apt-get upgrade -y

	# If there's a kernel upgrade then run the following to upgrade and restart
	# the machine. Then resume the script.
	#apt-get dist-upgrade
	#shutdown -r now

	# First are the basic build tools needed.
	apt-get install -y make build-essential elfutils libelf-dev git llvm-3.5 \
	clang-3.5

	# Build items for perf (not sure all of them are actually needed)
	apt-get install -y flex bison libunwind8 libunwind8-dev libaudit-dev libdw-dev \
	binutils-dev libnuma-dev libslang2-dev asciidoc libc6-dev-i386 libgtk2.0-dev \
	libperl-dev python-dev

	# Convenience packages.
	# nginx is for easy viewing of generated flame graphs (placed in ~/www)
	# apache2-utils is for use of ab for simple HTTP benchmarking
	apt-get install -y tmux nginx apache2-utils

	# For lldb.
	apt-get install -y swig libedit-dev doxygen graphviz libcloog-isl-dev libisl-dev

	# Building with clang is so much faster.
	echo 'export CC=clang' > .bash_aliases
	echo 'export CXX=clang++' >> .bash_aliases
	echo 'export GYP_DEFINES="clang=1"' >> .bash_aliases

	# Collect all core dumps.
	echo 'ulimit -c unlimited' >> .bash_aliases

	. .bash_aliases

	# Make the default nginx dir writable.
	chmod 777 /usr/share/nginx/html
	ln -s /usr/share/nginx/html www

	# Now to build and install all the fun stuff.

	cd ~

	mkdir sources

	cd sources

	wget https://github.com/joyent/node/archive/master.tar.gz

	tar xvf master.tar.gz

	rm master.tar.gz

	mv node-master node

	cd node

	# Configure to also build debug mode, enable gdbjit support and enable
	# output of disassembly for IRHydra.
	./configure --gdb --without-snapshot --debug -- -Dv8_enable_disassembler=1

	make -j${COMPILE_CPUS}

	make install

	cd ~/sources

	# Package to generate flame graphs.
	wget https://github.com/brendangregg/FlameGraph/archive/master.tar.gz

	tar xvf master.tar.gz

	rm master.tar.gz

	mv FlameGraph-master FlameGraph

	cd ~/sources

	# Grab latest kernel and build perf. This will need to be updated routinely.
	wget https://www.kernel.org/pub/linux/kernel/v3.x/linux-3.15.6.tar.gz

	tar xvf linux-3.15.6.tar.gz

	rm linux-3.15.6.tar.gz

	cd linux-3.15.6/tools/perf/

	make -j${COMPILE_CPUS} -f Makefile.perf install prefix=/usr

	cd ~/sources

	# Now get to build latest lldb w/ jit support. This won't be necessary once
	# lldb-3.6 is out.
	wget https://github.com/llvm-mirror/llvm/archive/master.tar.gz

	tar xvf master.tar.gz

	rm master.tar.gz

	mv llvm-master llvm

	cd llvm/tools

	wget https://github.com/llvm-mirror/clang/archive/master.tar.gz

	tar xvf master.tar.gz

	rm master.tar.gz

	mv clang-master clang

	wget https://github.com/llvm-mirror/lldb/archive/master.tar.gz

	tar xvf master.tar.gz

	rm master.tar.gz

	mv lldb-master lldb

	cd ../projects

	# Use this version of compiler-rt to prevent build errors.
	wget https://github.com/llvm-mirror/compiler-rt/archive/release_35.tar.gz

	tar xvf release_35.tar.gz

	rm release_35.tar.gz

	mv compiler-rt-release_35/ compiler-rt

	cd ..

	mkdir build

	cd build

	# These options are pretty close to what Ubuntu uses to build.
	../configure \
	--disable-assertions \
	--enable-shared \
	--enable-optimized \
	--with-optimize-option=' -g -O2' \
	--enable-pic \
	--enable-libffi \
	--with-ocaml-libdir=/usr/lib/ocaml/llvm-${LLVM_VERSION} \
	--with-binutils-include=/usr/include \
	--with-cloog --with-isl \
	--enable-shared

	# Just going to build the entire freaking thing. This'll take a while.
	make -j${COMPILE_CPUS}

	# Before installing, remove clang and llvm
	apt-get remove --purge llvm-3.5 clang-3.5

	apt-get autoremove

	make install