FastPipe
After many hours of trial and error, self-optimizing streams and such, I have finally found the fastest (?) and most efficient (?) way to copy an image from a given source to a given target.
There where several constraints that had to be as optimal as possible:
- Speed
- Memory consumption
- CPU utilization
Of course this comes with the classical tradeoff-problem by default.
Previous Approaches
Across these previous approaches to the problem, figures in the following ranges have been observed:
Memory: up to ~600MB
CPU: 20-70%
Node Core Streams
Depending on the highWaterMark
s, either very CPU or memory bound.
High highWaterMark
s result in low CPU usage, but an extreme amount of Buffer
allocations
and larger sizes of stream-internal buffer pools, which in turn leads to the GC going rampage.
Low highWaterMark
s result in very high CPU usage, and lower memory consumption
(also depending on how many buffers can be sliced off of the internal pre-allocated buffer pool),
though still very volatile memory load due to the GC.
Hand-rolled (core based) Streams
Several different hand-rolled streams were written and tested:
Self-optimizing block size & highWaterMark:
Performed quite badly due to being based on the core streams (same issues apply here), and not making up the time it takes to find a good block size.
Custom write coupled to readable stream:
The idea was to .read(blockSize)
which turned out very inefficient,
as this seemed to result in a high quantity of smaller reads, and internal buffer growth.
Also, surprisingly slow.
Custom readable & writable with fixed block sizes:
Made things a lot faster, but suffered from immense memory consumption and GC issues.
Also tried a pre-allocated SlowBuffer approach with this one, which failed horribly due
to the nature of core streams buffering internally even when hitting the highWaterMark
prematurely;
leading to buffer re-use and thus garbage data being written.
Current Approach
Simple functions all the way down; no this
.
The reason behind that is to avoid having to bind to contexts etc. in callbacks,
which would just make things unnecessarily complicated here.
The basic setup is to open & stat the source & target, determine a good block size (multiplied with a factor that turned out to be the sweet spot experimentally), then proceed to sequentially read and write buffers of the given block size. This allows for the use of a one-off allocation of a SlowBuffer and GC-free re-use of that.
All of that brings the memory consumption down to the lowest level possible, and also keeps the CPU utilization to a minimum, as most time is spent in the lower levels (reading & writing).
With the following v8 options also passed to the process, very low resource consumption has been observed:
node --max-executable-size=8 --max-old-space-size=16 --max-semi-space-size=1
While flashing a 4.3 GB image to an SD-Card, the following value ranges have been observed:
Memory: 16-22MB
CPU: 2-4%
Throughput: 23.5 MB/s (it's a slow one)
Note that the fluctuations in memory consumption are likely mostly due to peripheral code, i.e. logging, speed measurements and other modules.
While copying said image to an external HDD (magnetic) connected via USB 2.0:
Memory: 16-22MB
CPU: 12-18%
Throughput: 108 MB/s
To Do:
- Better benchmarks with more devices
- Emit each block to support alter/read-before-write (could also be easily done with support for streaming sources)
- Support streams as sources
- Improve API
- Get rid of
async
Usage:
var fastPipe = require( './fast-pipe' )
var source = {
path: '/path/to/source.img',
flags: 'r',
}
// NOTE: Make sure you use the raw disk descriptor,
// otherwise it's going to be a lot slower
var target = {
path: '/dev/rdisk4',
flags: 'w',
}
var emitter = fastPipe( source, target, {
// Optional, defaults to block size determined by `fs.stat()`
blockSize: 512 * 1024,
})
emitter.once( 'error', () => {})
emitter.once( 'end', () => {})