Skip to content

Instantly share code, notes, and snippets.

@brendangregg
Created May 13, 2012 02:42
Show Gist options
  • Star 3 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save brendangregg/2670942 to your computer and use it in GitHub Desktop.
Save brendangregg/2670942 to your computer and use it in GitHub Desktop.
node.js fs.readFile analysis
Here's some more perf analysis: what was the limiter for node.js in the original fs.readFile benchmark,
causing large performance difference?
Examining thread microstates using prstat (SmartOS):
# prstat -mLcp 92432 1
Please wait...
[...]
PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
92432 root 90 10 0.0 0.0 0.0 0.0 0.0 0.0 7 9 42K 0 node/1
One thread is 100% on-CPU (USR + SYS), which is likely to be the limiter (this isn't enough information to
confirm, however, given the node.js architecture it is very likely). That thread is spending 90% of its
time in user-land. Some straightforward profiling with DTrace can tell us what it's doing.
# dtrace -n 'profile-99 /pid == $1/ { @[jstack(100, 8000)] = count(); }' 92432
dtrace: description 'profile-99 ' matched 1 probe
^C
[...]
node`_ZN2v88internal19FlexibleBodyVisitorINS0_20StaticMarkingVisitorENS0_10FixedArray14BodyDescriptorEvE5VisitEPNS0_3MapEPNS0_10HeapObjectE+0x5b
node`_ZN2v88internal20MarkCompactCollector17EmptyMarkingStackEv+0x6f
node`_ZN2v88internal13GlobalHandles18IterateStrongRootsEPNS0_13ObjectVisitorE+0x51
node`_ZN2v88internal4Heap18IterateStrongRootsEPNS0_13ObjectVisitorENS0_9VisitModeE+0x139
node`_ZN2v88internal20MarkCompactCollector15MarkLiveObjectsEv+0x83
node`_ZN2v88internal20MarkCompactCollector14CollectGarbageEv+0x24
node`_ZN2v88internal4Heap11MarkCompactEPNS0_8GCTracerE+0x92
node`_ZN2v88internal4Heap24PerformGarbageCollectionENS0_16GarbageCollectorEPNS0_8GCTracerE+0x33b
node`_ZN2v88internal4Heap14CollectGarbageENS0_15AllocationSpaceENS0_16GarbageCollectorE+0x23c
node`_ZN2v88internal7Factory11NewJSObjectENS0_6HandleINS0_10JSFunctionEEENS0_13PretenureFlagE+0x90
node`_ZN2v88internal17Runtime_NewObjectENS0_9ArgumentsEPNS0_7IsolateE+0x1a2
<< internal code >>
<< constructor >>
allocNewPool at fs.js position 53610
(anon) as ReadStream._read at fs.js position 57236
(anon) as fs.ReadStream at fs.js position 56236
<< internal >>
<< entry >>
node`_ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPPPS4_Pb+0x1d5
node`_ZN2v88internal9Execution4CallENS0_6HandleINS0_6ObjectEEES4_iPPPS3_Pbb+0x57
node`_ZN2v88Function4CallENS_6HandleINS_6ObjectEEEiPNS1_INS_5ValueEEE+0xd5
node`_ZN4nodeL5AfterEP7uv_fs_s+0x14d
node`uv__fs_after+0x271
node`eio_finish+0x28
node`etp_poll+0x130
node`eio_poll+0xb
node`uv_eio_do_poll+0x31
node`uv__idle+0x29
node`ev_invoke_pending+0x71
node`ev_run+0x452
node`uv_run+0x17
node`_ZN4node5StartEiPPc+0xa6
node`main+0x1b
node`_start+0x83
15
[...]
I've included one of the common stacks. It shows node.js is spending its time in garbage collect.
It shows some other interesting details I'll get to in a moment.
Quantifying garbage collect time, per second:
# dtrace -n 'node$target:::gc-start { self->start = timestamp; }
node$target:::gc-done /self->start/ { @["ms"] = sum(timestamp - self->start); self->start = 0; }
tick-1s { normalize(@, 1000000); printa("GC: %@d ms", @); trunc(@); }' -p 92432
dtrace: description 'node$target:::gc-start ' matched 3 probes
CPU ID FUNCTION:NAME
12 64930 :tick-1s GC: 649 ms
12 64930 :tick-1s GC: 155 ms
12 64930 :tick-1s GC: 115 ms
12 64930 :tick-1s GC: 551 ms
12 64930 :tick-1s GC: 552 ms
12 64930 :tick-1s GC: 132 ms
12 64930 :tick-1s GC: 219 ms
12 64930 :tick-1s GC: 633 ms
12 64930 :tick-1s GC: 264 ms
The time in GC is indeed significant, and varies from second to second. Using both sampling and tracing,
I measured the GC time to be between between 32% and 54% of runtime. Since this is caused by the allocation
of fs objects (eg, the "allocNewPool at fs.js" line in the above stack), there is additional overhead from
object creation.
This means this is "a" problem, and one that may lead to a 2x improvement, should the GC/object time be
eliminated. There may well be other problems in how the fs package actually operates, leading to further
wins in addition.
I've heard it was already known than the fs package doesn't perform well. I'm guessing it wasn't fixed
sooner as it hadn't really shown up in the real world. I've DTrace'd production node.js many times, and
haven't come across this before myself.
I'll take a quick look at the stack line shown earlier by "allocNewPool at fs.js" (lib/fs.js):
[...]
38 var kPoolSize = 40 * 1024;
[...]
1014 function allocNewPool() {
1015 pool = new Buffer(kPoolSize);
1016 pool.used = 0;
1017 }
[...]
1096 ReadStream.prototype._read = function() {
[...]
1102 if (!pool || pool.length - pool.used < kMinPoolSpace) {
1103 // discard the old pool. Can't add to the free list because
1104 // users might have refernces to slices on it.
1105 pool = null;
1106 allocNewPool();
1107 }
[...]
Line 38 sets the buffer size to 40k. Setting it to 1k improved performance (for this benchmark, on my system)
by about 30%.
Before, with kPoolSize @ 40k:
134986 Rate: count/sec: 5298.013245033113 Average rate: 4726.416072777918
138006 Rate: count/sec: 5298.013245033113 Average rate: 4738.92439459154
141026 Rate: count/sec: 5298.013245033113 Average rate: 4750.8969977167335
144046 Rate: count/sec: 0.0 Average rate: 4651.291948405371
147066 Rate: count/sec: 5298.013245033113 Average rate: 4664.572368868399
150086 Rate: count/sec: 5298.013245033113 Average rate: 4677.3183374865075
153106 Rate: count/sec: 5298.013245033113 Average rate: 4689.561480281635
It's averaging around 4700/sec (last column; the count/sec column is erratic due to how the vertx RateCounter
is updated).
Now with 1k:
193475 Rate: count/sec: 5298.013245033113 Average rate: 6109.316449153637
196495 Rate: count/sec: 5298.013245033113 Average rate: 6096.847248021578
199515 Rate: count/sec: 5298.013245033113 Average rate: 6084.755532165502
202535 Rate: count/sec: 10596.026490066226 Average rate: 6152.023107117288
205555 Rate: count/sec: 5298.013245033113 Average rate: 6139.4760526379805
208575 Rate: count/sec: 5298.013245033113 Average rate: 6127.292340884574
211595 Rate: count/sec: 5298.013245033113 Average rate: 6115.456414376521
The average is now about 6100/sec.
Someone will need to look at fs.js carefully to see what the best way to improve it is (which may include a
different value for kPoolSize, or a different strategy altogether). That is, if this is considered a
priority to fix (real world usage).
...
Some were wondering if the main issue was different FS I/O latency. It's not the case on my system, with ZFS,
which can be easily confirmed using DTrace:
node.js:
# dtrace -n 'fbt::zfs_read:entry /pid == $1/ { self->start = timestamp; }
fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' 22900
dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
^C
ns
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@@@@@@@@@@ 20410
2048 |@@@@@@@@@@@@@@@ 16156
4096 |@@@@@ 5475
8192 | 231
16384 | 9
32768 | 2
65536 | 0
The output shows the distribution of FS read latency, as nanoseconds.
vert.x:
# dtrace -n 'fbt::zfs_read:entry /pid == $1/ { self->start = timestamp; }
fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' 22900
dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
^C
ns
value ------------- Distribution ------------- count
512 | 0
1024 |@@@@@@@@@@@@@@@@@@@ 56674
2048 |@@@@@@@@@@@@@@@@ 48668
4096 |@@@@ 12820
8192 | 555
16384 | 21
32768 | 2
65536 | 0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment