brendangregg/gist:2670942

## gistfile1.txt
Here's some more perf analysis: what was the limiter for node.js in the original fs.readFile benchmark,
causing large performance difference?

Examining thread microstates using prstat (SmartOS):

# prstat -mLcp 92432 1
Please wait...
[...]
   PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
 92432 root      90  10 0.0 0.0 0.0 0.0 0.0 0.0   7   9 42K   0 node/1

One thread is 100% on-CPU (USR + SYS), which is likely to be the limiter (this isn't enough information to
confirm, however, given the node.js architecture it is very likely).  That thread is spending 90% of its
time in user-land.  Some straightforward profiling with DTrace can tell us what it's doing.

# dtrace -n 'profile-99 /pid == $1/ { @[jstack(100, 8000)] = count(); }' 92432
dtrace: description 'profile-99 ' matched 1 probe
^C
[...]
              node`_ZN2v88internal19FlexibleBodyVisitorINS0_20StaticMarkingVisitorENS0_10FixedArray14BodyDescriptorEvE5VisitEPNS0_3MapEPNS0_10HeapObjectE+0x5b
              node`_ZN2v88internal20MarkCompactCollector17EmptyMarkingStackEv+0x6f
              node`_ZN2v88internal13GlobalHandles18IterateStrongRootsEPNS0_13ObjectVisitorE+0x51
              node`_ZN2v88internal4Heap18IterateStrongRootsEPNS0_13ObjectVisitorENS0_9VisitModeE+0x139
              node`_ZN2v88internal20MarkCompactCollector15MarkLiveObjectsEv+0x83
              node`_ZN2v88internal20MarkCompactCollector14CollectGarbageEv+0x24
              node`_ZN2v88internal4Heap11MarkCompactEPNS0_8GCTracerE+0x92
              node`_ZN2v88internal4Heap24PerformGarbageCollectionENS0_16GarbageCollectorEPNS0_8GCTracerE+0x33b
              node`_ZN2v88internal4Heap14CollectGarbageENS0_15AllocationSpaceENS0_16GarbageCollectorE+0x23c
              node`_ZN2v88internal7Factory11NewJSObjectENS0_6HandleINS0_10JSFunctionEEENS0_13PretenureFlagE+0x90
              node`_ZN2v88internal17Runtime_NewObjectENS0_9ArgumentsEPNS0_7IsolateE+0x1a2
              << internal code >>
              << constructor >>
              allocNewPool at fs.js position 53610
              (anon) as ReadStream._read at fs.js position 57236
              (anon) as fs.ReadStream at fs.js position 56236
              << internal >>
              << entry >>
              node`_ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPPPS4_Pb+0x1d5
              node`_ZN2v88internal9Execution4CallENS0_6HandleINS0_6ObjectEEES4_iPPPS3_Pbb+0x57
              node`_ZN2v88Function4CallENS_6HandleINS_6ObjectEEEiPNS1_INS_5ValueEEE+0xd5
              node`_ZN4nodeL5AfterEP7uv_fs_s+0x14d
              node`uv__fs_after+0x271
              node`eio_finish+0x28
              node`etp_poll+0x130
              node`eio_poll+0xb
              node`uv_eio_do_poll+0x31
              node`uv__idle+0x29
              node`ev_invoke_pending+0x71
              node`ev_run+0x452
              node`uv_run+0x17
              node`_ZN4node5StartEiPPc+0xa6
              node`main+0x1b
              node`_start+0x83
               15
[...]

I've included one of the common stacks.  It shows node.js is spending its time in garbage collect.
It shows some other interesting details I'll get to in a moment.

Quantifying garbage collect time, per second:

# dtrace -n 'node$target:::gc-start { self->start = timestamp; }
    node$target:::gc-done /self->start/ { @["ms"] = sum(timestamp - self->start); self->start = 0; }
    tick-1s { normalize(@, 1000000); printa("GC: %@d ms", @); trunc(@); }' -p 92432
dtrace: description 'node$target:::gc-start ' matched 3 probes
CPU     ID                    FUNCTION:NAME
 12  64930                         :tick-1s GC: 649 ms
 12  64930                         :tick-1s GC: 155 ms
 12  64930                         :tick-1s GC: 115 ms
 12  64930                         :tick-1s GC: 551 ms
 12  64930                         :tick-1s GC: 552 ms
 12  64930                         :tick-1s GC: 132 ms
 12  64930                         :tick-1s GC: 219 ms
 12  64930                         :tick-1s GC: 633 ms
 12  64930                         :tick-1s GC: 264 ms

The time in GC is indeed significant, and varies from second to second.  Using both sampling and tracing,
I measured the GC time to be between between 32% and 54% of runtime.  Since this is caused by the allocation
of fs objects (eg, the "allocNewPool at fs.js" line in the above stack), there is additional overhead from
object creation.

This means this is "a" problem, and one that may lead to a 2x improvement, should the GC/object time be
eliminated.  There may well be other problems in how the fs package actually operates, leading to further
wins in addition.

I've heard it was already known than the fs package doesn't perform well.  I'm guessing it wasn't fixed
sooner as it hadn't really shown up in the real world.  I've DTrace'd production node.js many times, and
haven't come across this before myself.

I'll take a quick look at the stack line shown earlier by "allocNewPool at fs.js" (lib/fs.js):

[...]
  38 var kPoolSize = 40 * 1024;
[...]
1014 function allocNewPool() {
1015   pool = new Buffer(kPoolSize);
1016   pool.used = 0;
1017 }
[...]
1096 ReadStream.prototype._read = function() {
[...]
1102   if (!pool || pool.length - pool.used < kMinPoolSpace) {
1103     // discard the old pool. Can't add to the free list because
1104     // users might have refernces to slices on it.
1105     pool = null;
1106     allocNewPool();
1107   }
[...]

Line 38 sets the buffer size to 40k.  Setting it to 1k improved performance (for this benchmark, on my system)
by about 30%.

Before, with kPoolSize @ 40k:

134986 Rate: count/sec: 5298.013245033113 Average rate: 4726.416072777918
138006 Rate: count/sec: 5298.013245033113 Average rate: 4738.92439459154
141026 Rate: count/sec: 5298.013245033113 Average rate: 4750.8969977167335
144046 Rate: count/sec: 0.0 Average rate: 4651.291948405371
147066 Rate: count/sec: 5298.013245033113 Average rate: 4664.572368868399
150086 Rate: count/sec: 5298.013245033113 Average rate: 4677.3183374865075
153106 Rate: count/sec: 5298.013245033113 Average rate: 4689.561480281635

It's averaging around 4700/sec (last column; the count/sec column is erratic due to how the vertx RateCounter
is updated).

Now with 1k:

193475 Rate: count/sec: 5298.013245033113 Average rate: 6109.316449153637
196495 Rate: count/sec: 5298.013245033113 Average rate: 6096.847248021578
199515 Rate: count/sec: 5298.013245033113 Average rate: 6084.755532165502
202535 Rate: count/sec: 10596.026490066226 Average rate: 6152.023107117288
205555 Rate: count/sec: 5298.013245033113 Average rate: 6139.4760526379805
208575 Rate: count/sec: 5298.013245033113 Average rate: 6127.292340884574
211595 Rate: count/sec: 5298.013245033113 Average rate: 6115.456414376521

The average is now about 6100/sec.

Someone will need to look at fs.js carefully to see what the best way to improve it is (which may include a
different value for kPoolSize, or a different strategy altogether).  That is, if this is considered a
priority to fix (real world usage).

...

Some were wondering if the main issue was different FS I/O latency.  It's not the case on my system, with ZFS,
which can be easily confirmed using DTrace:

node.js:

# dtrace -n 'fbt::zfs_read:entry /pid == $1/ { self->start = timestamp; }
    fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' 22900
dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
^C

  ns
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@                      20410
            2048 |@@@@@@@@@@@@@@@                          16156
            4096 |@@@@@                                    5475
            8192 |                                         231
           16384 |                                         9
           32768 |                                         2
           65536 |                                         0

The output shows the distribution of FS read latency, as nanoseconds.

vert.x:

# dtrace -n 'fbt::zfs_read:entry /pid == $1/ { self->start = timestamp; }
    fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' 22900
dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
^C

  ns
           value  ------------- Distribution ------------- count
             512 |                                         0
            1024 |@@@@@@@@@@@@@@@@@@@                      56674
            2048 |@@@@@@@@@@@@@@@@                         48668
            4096 |@@@@                                     12820
            8192 |                                         555
           16384 |                                         21
           32768 |                                         2
           65536 |                                         0
	Here's some more perf analysis: what was the limiter for node.js in the original fs.readFile benchmark,
	causing large performance difference?

	Examining thread microstates using prstat (SmartOS):

	# prstat -mLcp 92432 1
	Please wait...
	[...]
	PID USERNAME USR SYS TRP TFL DFL LCK SLP LAT VCX ICX SCL SIG PROCESS/LWPID
	92432 root 90 10 0.0 0.0 0.0 0.0 0.0 0.0 7 9 42K 0 node/1

	One thread is 100% on-CPU (USR + SYS), which is likely to be the limiter (this isn't enough information to
	confirm, however, given the node.js architecture it is very likely). That thread is spending 90% of its
	time in user-land. Some straightforward profiling with DTrace can tell us what it's doing.

	# dtrace -n 'profile-99 /pid == $1/ { @[jstack(100, 8000)] = count(); }' 92432
	dtrace: description 'profile-99 ' matched 1 probe
	^C
	[...]
	node`_ZN2v88internal19FlexibleBodyVisitorINS0_20StaticMarkingVisitorENS0_10FixedArray14BodyDescriptorEvE5VisitEPNS0_3MapEPNS0_10HeapObjectE+0x5b
	node`_ZN2v88internal20MarkCompactCollector17EmptyMarkingStackEv+0x6f
	node`_ZN2v88internal13GlobalHandles18IterateStrongRootsEPNS0_13ObjectVisitorE+0x51
	node`_ZN2v88internal4Heap18IterateStrongRootsEPNS0_13ObjectVisitorENS0_9VisitModeE+0x139
	node`_ZN2v88internal20MarkCompactCollector15MarkLiveObjectsEv+0x83
	node`_ZN2v88internal20MarkCompactCollector14CollectGarbageEv+0x24
	node`_ZN2v88internal4Heap11MarkCompactEPNS0_8GCTracerE+0x92
	node`_ZN2v88internal4Heap24PerformGarbageCollectionENS0_16GarbageCollectorEPNS0_8GCTracerE+0x33b
	node`_ZN2v88internal4Heap14CollectGarbageENS0_15AllocationSpaceENS0_16GarbageCollectorE+0x23c
	node`_ZN2v88internal7Factory11NewJSObjectENS0_6HandleINS0_10JSFunctionEEENS0_13PretenureFlagE+0x90
	node`_ZN2v88internal17Runtime_NewObjectENS0_9ArgumentsEPNS0_7IsolateE+0x1a2
	<< internal code >>
	<< constructor >>
	allocNewPool at fs.js position 53610
	(anon) as ReadStream._read at fs.js position 57236
	(anon) as fs.ReadStream at fs.js position 56236
	<< internal >>
	<< entry >>
	node`_ZN2v88internalL6InvokeEbNS0_6HandleINS0_10JSFunctionEEENS1_INS0_6ObjectEEEiPPPS4_Pb+0x1d5
	node`_ZN2v88internal9Execution4CallENS0_6HandleINS0_6ObjectEEES4_iPPPS3_Pbb+0x57
	node`_ZN2v88Function4CallENS_6HandleINS_6ObjectEEEiPNS1_INS_5ValueEEE+0xd5
	node`_ZN4nodeL5AfterEP7uv_fs_s+0x14d
	node`uv__fs_after+0x271
	node`eio_finish+0x28
	node`etp_poll+0x130
	node`eio_poll+0xb
	node`uv_eio_do_poll+0x31
	node`uv__idle+0x29
	node`ev_invoke_pending+0x71
	node`ev_run+0x452
	node`uv_run+0x17
	node`_ZN4node5StartEiPPc+0xa6
	node`main+0x1b
	node`_start+0x83
	15
	[...]

	I've included one of the common stacks. It shows node.js is spending its time in garbage collect.
	It shows some other interesting details I'll get to in a moment.

	Quantifying garbage collect time, per second:

	# dtrace -n 'node$target:::gc-start { self->start = timestamp; }
	node$target:::gc-done /self->start/ { @["ms"] = sum(timestamp - self->start); self->start = 0; }
	tick-1s { normalize(@, 1000000); printa("GC: %@d ms", @); trunc(@); }' -p 92432
	dtrace: description 'node$target:::gc-start ' matched 3 probes
	CPU ID FUNCTION:NAME
	12 64930 :tick-1s GC: 649 ms
	12 64930 :tick-1s GC: 155 ms
	12 64930 :tick-1s GC: 115 ms
	12 64930 :tick-1s GC: 551 ms
	12 64930 :tick-1s GC: 552 ms
	12 64930 :tick-1s GC: 132 ms
	12 64930 :tick-1s GC: 219 ms
	12 64930 :tick-1s GC: 633 ms
	12 64930 :tick-1s GC: 264 ms

	The time in GC is indeed significant, and varies from second to second. Using both sampling and tracing,
	I measured the GC time to be between between 32% and 54% of runtime. Since this is caused by the allocation
	of fs objects (eg, the "allocNewPool at fs.js" line in the above stack), there is additional overhead from
	object creation.

	This means this is "a" problem, and one that may lead to a 2x improvement, should the GC/object time be
	eliminated. There may well be other problems in how the fs package actually operates, leading to further
	wins in addition.

	I've heard it was already known than the fs package doesn't perform well. I'm guessing it wasn't fixed
	sooner as it hadn't really shown up in the real world. I've DTrace'd production node.js many times, and
	haven't come across this before myself.

	I'll take a quick look at the stack line shown earlier by "allocNewPool at fs.js" (lib/fs.js):

	[...]
	38 var kPoolSize = 40 * 1024;
	[...]
	1014 function allocNewPool() {
	1015 pool = new Buffer(kPoolSize);
	1016 pool.used = 0;
	1017 }
	[...]
	1096 ReadStream.prototype._read = function() {
	[...]
	1102 if (!pool \|\| pool.length - pool.used < kMinPoolSpace) {
	1103 // discard the old pool. Can't add to the free list because
	1104 // users might have refernces to slices on it.
	1105 pool = null;
	1106 allocNewPool();
	1107 }
	[...]

	Line 38 sets the buffer size to 40k. Setting it to 1k improved performance (for this benchmark, on my system)
	by about 30%.

	Before, with kPoolSize @ 40k:

	134986 Rate: count/sec: 5298.013245033113 Average rate: 4726.416072777918
	138006 Rate: count/sec: 5298.013245033113 Average rate: 4738.92439459154
	141026 Rate: count/sec: 5298.013245033113 Average rate: 4750.8969977167335
	144046 Rate: count/sec: 0.0 Average rate: 4651.291948405371
	147066 Rate: count/sec: 5298.013245033113 Average rate: 4664.572368868399
	150086 Rate: count/sec: 5298.013245033113 Average rate: 4677.3183374865075
	153106 Rate: count/sec: 5298.013245033113 Average rate: 4689.561480281635

	It's averaging around 4700/sec (last column; the count/sec column is erratic due to how the vertx RateCounter
	is updated).

	Now with 1k:

	193475 Rate: count/sec: 5298.013245033113 Average rate: 6109.316449153637
	196495 Rate: count/sec: 5298.013245033113 Average rate: 6096.847248021578
	199515 Rate: count/sec: 5298.013245033113 Average rate: 6084.755532165502
	202535 Rate: count/sec: 10596.026490066226 Average rate: 6152.023107117288
	205555 Rate: count/sec: 5298.013245033113 Average rate: 6139.4760526379805
	208575 Rate: count/sec: 5298.013245033113 Average rate: 6127.292340884574
	211595 Rate: count/sec: 5298.013245033113 Average rate: 6115.456414376521

	The average is now about 6100/sec.

	Someone will need to look at fs.js carefully to see what the best way to improve it is (which may include a
	different value for kPoolSize, or a different strategy altogether). That is, if this is considered a
	priority to fix (real world usage).

	...

	Some were wondering if the main issue was different FS I/O latency. It's not the case on my system, with ZFS,
	which can be easily confirmed using DTrace:

	node.js:

	# dtrace -n 'fbt::zfs_read:entry /pid == $1/ { self->start = timestamp; }
	fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' 22900
	dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
	^C

	ns
	value ------------- Distribution ------------- count
	512 \| 0
	1024 \|@@@@@@@@@@@@@@@@@@@ 20410
	2048 \|@@@@@@@@@@@@@@@ 16156
	4096 \|@@@@@ 5475
	8192 \| 231
	16384 \| 9
	32768 \| 2
	65536 \| 0

	The output shows the distribution of FS read latency, as nanoseconds.

	vert.x:

	# dtrace -n 'fbt::zfs_read:entry /pid == $1/ { self->start = timestamp; }
	fbt::zfs_read:return /self->start/ { @["ns"] = quantize(timestamp - self->start); self->start = 0; }' 22900
	dtrace: description 'fbt::zfs_read:entry ' matched 2 probes
	^C

	ns
	value ------------- Distribution ------------- count
	512 \| 0
	1024 \|@@@@@@@@@@@@@@@@@@@ 56674
	2048 \|@@@@@@@@@@@@@@@@ 48668
	4096 \|@@@@ 12820
	8192 \| 555
	16384 \| 21
	32768 \| 2
	65536 \| 0