Here is a collection of tips and tricks I've picked up about doing performance analysis on Node. Included is a build script that should get a base install of Ubuntu fully functional and ready for all the things we'll be going through.
The script pulls a lot of code from the latest
master of each repository. So
it's possible that something may fail, but to date I haven't had any issues.
First, go ahead and run the script. Then go take a nice long break. It'll
probably be a while before everything finishes up. Once that's done go ahead
and read on. There's plenty to cover. Don't forget, if running the script on
a multi-core machine pass in
COMPILE_CPUS=N for faster compilations.
Note: Both the script and instructions assume they're run as
Probably my favorite tool in the tool belt is
perf. Now with new support from
V8 it's possible to do some pretty cool things.
Let's start with some basics. First we need to tell the kernel not to hide its pointers. So we can properly gather all metrics. Do this with the following command:
$ sysctl kernel/kptr_restrict=0
With that taken care of, we'll run our first simple benchmark:
$ perf record -e cycles:u -- ./node benchmark/http_simple.js
-e cycles:u tells
perf we only want to record CPU cycles relevant to
our application (e.g. don't record time spent polling).
And from another terminal window run the following:
$ ab -n 100000 -c 8 'http://localhost:8000/bytes/4096/4'
node process once complete and run:
$ perf report
Here's some example output:
3.92% node node [.] v8::internal::ConsString* v8::internal::String::VisitFlat... 3.91% node node [.] void v8::Utf8WriterVisitor::Visit... 2.30% node node [.] v8::internal::JSObject::MigrateToMap... 1.51% node node [.] http_parser_execute 1.45% node node [.] v8::internal::Map::LookupDescriptor... 1.43% node node [.] int v8::internal::LinearSearch... 0.92% node node [.] v8::internal::JSReceiver::LocalLookup...
Not very interesting on its own, but at least things are running.
takes samples of data at given intervals and reports what function is being
executed at that time.
Much more useful output is a call graph. Luckily that's simple enough to do
with the additional option
$ perf record -e cycles:u -g -- ./node benchmark/http_simple.js
ab is done running you'll probably notice that a lot more data has been
written out. Run
perf report to see why:
- 3.93% node node [.] v8::internal::ConsString* v8::internal::String::VisitFlat... - v8::internal::ConsString* v8::internal::String::VisitFlat... - 99.55% v8::String::Utf8Length()... node::StringBytes::Size... node::Buffer::ByteLength... - 0x79437453ae9 - 99.93% 0x7943746b35c + 0x79437306f75 - 3.80% node node [.] void v8::Utf8WriterVisitor::Visit... - void v8::Utf8WriterVisitor::Visit... - 99.66% v8::internal::ConsString* v8::internal::String::VisitFlat... v8::RecursivelySerializeToUtf8... - v8::String::WriteUtf8... - 98.67% node::StringBytes::Write... - 78.06% void node::StreamWrap::WriteStringImpl... + 97.33% 0x79437453ae9
Now there's a complete call graph of each sample. This is much more useful, but
not near enough to help us troubleshoot out performance needs. Above you'll see
raw memory addresses in the call graph (e.g.
0x79437453ae9). This is the
memory location of the JIT'd JS function that was run, but
decipher these without additional information.
Luckily V8 now includes that support. Run the following:
$ perf record -e cycles:u -g -- ./node --perf-basic-prof benchmark/http_simple.js
After this is done running, look in the
/tmp directory. You should see a file
that looks like
perf-<pid>.map. This is the file
perf uses to decipher what
each address means.
Note: There's currently a bug in
perf report that may cause the
(default) interface to fail. If that happens run
perf report --stdio, or if
you're on a desktop you can also run
perf report --gtk. Here's some example
# Overhead Command Shared Object # ........ ....... ................... ................................................................ # 4.15% node node [.] v8::internal::ConsString* v8::internal::String::VisitFlat... | --- v8::internal::ConsString* v8::internal::String::VisitFlat... | |--99.15%-- v8::String::Utf8Length() const | node::StringBytes::Size... | node::Buffer::ByteLength... | | | |--99.81%-- Stub:CallApiFunctionStub | | | | | |--99.88%-- LazyCompile:*OutgoingMessage.write _http_outgoing.js:405
stdio output is much more verbose, but still helpful. And, as you can see
on the bottom line, there's now one of the originating JS function calls. This
is getting much more interesting.
Probing our Problems
perf has the additional ability to place probes on individual native
functions. The most useful functions to probe are those from
libuv. This is
because they serve as the entry/exit point for all I/O. By placing probe on
libuv functions we can trace right back to the JS call site.
To see the user API open the file
~/node/deps/uv/include/uv.h. Also, this
following snippet will spew the complete contents of everything included in the
$ perf probe -x ./node -F | egrep "^uv_[^_]"
Note: Not all of these can have a probe placed on them. This is because of function inlining the compiler does. So while the API call exists, the actual call to the function itself does not.
Most all I/O that occurs in
libuv is some sort of stream. Which means the
uv_write() will cover almost all output. Except for file system
operations. Because these operations need to be performed as a blocking
syscall (though treated asynchronously because of the thread pool) file system
operations have their own I/O functions. To get a quick list, run the following:
$ perf probe -x ./node -F | egrep "^uv_fs_"
Adding and Using a Probe
To start we'll add a probe to
uv_write() as the most generic way of tracing
all stream based I/O. This is done using the following command:
$ perf probe -x ./node -a uv_write Added new events: probe_node:uv_write (on uv_write in ./node/out/Release/node) You can now use it in all perf tools, such as: perf record -e probe_node:uv_write -aR sleep 1
Now we can create a call graph for all calls to that probe:
$ perf record -i -g -e probe_node:uv_write -- ./node --perf-basic-prof ./benchmark/http_simple.js
Now go ahead and smack the server around a bit:
$ ab -n 100000 -c 8 'http://localhost:8000/bytes/4096/4'
After that's complete go ahead and kill the server and analyze the output the
same as before with
- 100.00% node node [.] uv_write - uv_write - node::StreamWrap::Writev(v8::FunctionCallbackInfo<v8::Value> const&) - 96.92% Stub:CallApiFunctionStub - LazyCompile:*Socket._writeGeneric net.js:613 - 16.77% LazyCompile:*clearBuffer _stream_writable.js:341 LazyCompile:*OutgoingMessage.end _http_outgoing.js:486 Builtin:ArgumentsAdaptorTrampoline LazyCompile:* /var/projects/node/benchmark/http_simple.js:25
Here we can see the call made in
The demonstration here has been kept simple to outline the basic usage, but as the projects become more complex (e.g. having module dependencies, etc.) this will be much more informative to see where time is spent.
At times having a visual representation that gives an overall picture of how the application is performing can be very helpful. This is when flame graphs come in handy.