Here is a collection of tips and tricks I've picked up about doing performance analysis on Node. Included is a build script that should get a base install of Ubuntu fully functional and ready for all the things we'll be going through.
The script pulls a lot of code from the latest master
of each repository. So
it's possible that something may fail, but to date I haven't had any issues.
First, go ahead and run the script. Then go take a nice long break. It'll
probably be a while before everything finishes up. Once that's done go ahead
and read on. There's plenty to cover. Don't forget, if running the script on
a multi-core machine pass in COMPILE_CPUS=N
for faster compilations.
Note: Both the script and instructions assume they're run as root
.
Probably my favorite tool in the tool belt is perf
. Now with new support from
V8 it's possible to do some pretty cool things.
Let's start with some basics. First we need to tell the kernel not to hide its pointers. So we can properly gather all metrics. Do this with the following command:
$ sysctl kernel/kptr_restrict=0
With that taken care of, we'll run our first simple benchmark:
~/source/node/
:
$ perf record -e cycles:u -- ./node benchmark/http_simple.js
Note: -e cycles:u
tells perf
we only want to record CPU cycles relevant to
our application (e.g. don't record time spent polling).
And from another terminal window run the following:
$ ab -n 100000 -c 8 'http://localhost:8000/bytes/4096/4'
Kill the node
process once complete and run:
$ perf report
Here's some example output:
3.92% node node [.] v8::internal::ConsString* v8::internal::String::VisitFlat...
3.91% node node [.] void v8::Utf8WriterVisitor::Visit...
2.30% node node [.] v8::internal::JSObject::MigrateToMap...
1.51% node node [.] http_parser_execute
1.45% node node [.] v8::internal::Map::LookupDescriptor...
1.43% node node [.] int v8::internal::LinearSearch...
0.92% node node [.] v8::internal::JSReceiver::LocalLookup...
Not very interesting on its own, but at least things are running. perf record
takes samples of data at given intervals and reports what function is being
executed at that time.
Much more useful output is a call graph. Luckily that's simple enough to do
with the additional option -g
:
$ perf record -e cycles:u -g -- ./node benchmark/http_simple.js
When ab
is done running you'll probably notice that a lot more data has been
written out. Run perf report
to see why:
- 3.93% node node [.] v8::internal::ConsString* v8::internal::String::VisitFlat...
- v8::internal::ConsString* v8::internal::String::VisitFlat...
- 99.55% v8::String::Utf8Length()...
node::StringBytes::Size...
node::Buffer::ByteLength...
- 0x79437453ae9
- 99.93% 0x7943746b35c
+ 0x79437306f75
- 3.80% node node [.] void v8::Utf8WriterVisitor::Visit...
- void v8::Utf8WriterVisitor::Visit...
- 99.66% v8::internal::ConsString* v8::internal::String::VisitFlat...
v8::RecursivelySerializeToUtf8...
- v8::String::WriteUtf8...
- 98.67% node::StringBytes::Write...
- 78.06% void node::StreamWrap::WriteStringImpl...
+ 97.33% 0x79437453ae9
Now there's a complete call graph of each sample. This is much more useful, but
not near enough to help us troubleshoot out performance needs. Above you'll see
raw memory addresses in the call graph (e.g. 0x79437453ae9
). This is the
memory location of the JIT'd JS function that was run, but perf
can't
decipher these without additional information.
Luckily V8 now includes that support. Run the following:
$ perf record -e cycles:u -g -- ./node --perf-basic-prof benchmark/http_simple.js
After this is done running, look in the /tmp
directory. You should see a file
that looks like perf-<pid>.map
. This is the file perf
uses to decipher what
each address means.
Note: There's currently a bug in perf report
that may cause the tui
(default) interface to fail. If that happens run perf report --stdio
, or if
you're on a desktop you can also run perf report --gtk
. Here's some example
output:
# Overhead Command Shared Object
# ........ ....... ................... ................................................................
#
4.15% node node [.] v8::internal::ConsString* v8::internal::String::VisitFlat...
|
--- v8::internal::ConsString* v8::internal::String::VisitFlat...
|
|--99.15%-- v8::String::Utf8Length() const
| node::StringBytes::Size...
| node::Buffer::ByteLength...
| |
| |--99.81%-- Stub:CallApiFunctionStub
| | |
| | |--99.88%-- LazyCompile:*OutgoingMessage.write _http_outgoing.js:405
The stdio
output is much more verbose, but still helpful. And, as you can see
on the bottom line, there's now one of the originating JS function calls. This
is getting much more interesting.
perf
has the additional ability to place probes on individual native
functions. The most useful functions to probe are those from libuv
. This is
because they serve as the entry/exit point for all I/O. By placing probe on
libuv
functions we can trace right back to the JS call site.
To see the user API open the file ~/node/deps/uv/include/uv.h
. Also, this
following snippet will spew the complete contents of everything included in the
binary:
$ perf probe -x ./node -F | egrep "^uv_[^_]"
Note: Not all of these can have a probe placed on them. This is because of function inlining the compiler does. So while the API call exists, the actual call to the function itself does not.
Most all I/O that occurs in libuv
is some sort of stream. Which means the
function uv_write()
will cover almost all output. Except for file system
operations. Because these operations need to be performed as a blocking
syscall (though treated asynchronously because of the thread pool) file system
operations have their own I/O functions. To get a quick list, run the following:
$ perf probe -x ./node -F | egrep "^uv_fs_"
To start we'll add a probe to uv_write()
as the most generic way of tracing
all stream based I/O. This is done using the following command:
$ perf probe -x ./node -a uv_write
Added new events:
probe_node:uv_write (on uv_write in ./node/out/Release/node)
You can now use it in all perf tools, such as:
perf record -e probe_node:uv_write -aR sleep 1
Now we can create a call graph for all calls to that probe:
$ perf record -i -g -e probe_node:uv_write -- ./node --perf-basic-prof ./benchmark/http_simple.js
Now go ahead and smack the server around a bit:
$ ab -n 100000 -c 8 'http://localhost:8000/bytes/4096/4'
After that's complete go ahead and kill the server and analyze the output the
same as before with perf report
:
- 100.00% node node [.] uv_write
- uv_write
- node::StreamWrap::Writev(v8::FunctionCallbackInfo<v8::Value> const&)
- 96.92% Stub:CallApiFunctionStub
- LazyCompile:*Socket._writeGeneric net.js:613
- 16.77% LazyCompile:*clearBuffer _stream_writable.js:341
LazyCompile:*OutgoingMessage.end _http_outgoing.js:486
Builtin:ArgumentsAdaptorTrampoline
LazyCompile:* /var/projects/node/benchmark/http_simple.js:25
Here we can see the call made in http_simple.js
.
The demonstration here has been kept simple to outline the basic usage, but as the projects become more complex (e.g. having module dependencies, etc.) this will be much more informative to see where time is spent.
At times having a visual representation that gives an overall picture of how the application is performing can be very helpful. This is when flame graphs come in handy.