Node's streams are designed to tackle the problem of asynchronous data spread out over time — that data might "happen" zero to many times, and it may terminate either in an errored state or a successful state.
No single dimension of how streams work is particularly hard to mentally model, but taken altogether it quickly becomes hard to accurately model the entire system. Nevertheless they are a useful pattern; this document will attempt to note where liberal abstractions may be safely made.
Streams, by necessity, operate on an asynchronous substrate: events, promises,
or callbacks. Node's streams use EventEmitter
events as a substrate. WHATWG
streams operate on Promise
's.
The APIs available on streams serve many masters. There are, roughly speaking, two sets of APIs available:
.read
.write
.end
and:
.push
.pipe
It is not wise to mix the two APIs. The first API of a stream should only be used if the stream is not currently, and will never be, piped to another stream. The second API may be used for any stream in a pipeline.
You can cook an egg on an car's engine, or you can drive the car on the highway. These are both public APIs of a car. It is unwise to attempt to use both at once.
This document will concentrate on the second API, since most of the power and complexity of streams are contained in those two methods.
highWaterMark
exists to mask the effects of latency. Backpressure exists
to deal with bandwidth. If those terms don't make sense to you, read on.
Backpressure is the mechanism by which a destination stream may communicate to a source stream that it is currently busy and should not be sent any more information.
This mechanism is essential to the composability of streams. Without it, there is no guarantee that a pipeline would handle differential data rates appropriately: the author of each link in the pipeline would have to get the (often tricky) specifics of buffering and communicating the buffered status to every other stream in the pipeline.
The story you get out of the box with Node streams does not end there.
One of the primary goals of Node is to be fast. Many of Node's streams are backed by resources that operate by turning them "on" and turning them "off." You can think of them like a rusty outdoor spigot. When they're on, they constantly flow; when they're off, nothing comes out.
Turning a rusty spigot on or off takes considerable amounts of time.
Node's streams optimize for this: they attempt to minimize the number of times
a resource will be turned "off" due to backpressure. A stream that has been
told to "back off" will continue to fill up an internal buffer of items until
it's reached a highWaterMark
of contents. This happens on both the
readable and writable side of streams.
A writable stream that receives a write while it's currently processing other
data will attempt to pause if highWaterMark
is 0. If its highWaterMark
is
greater than 0, it buffers the write as a "write request."
When a readable stream receives word from a destination writable that it should
back off from producing data, it won't stop pulling data from its underlying
resource until it hits highWaterMark
.
This mitigates the effects of stream latency — that is, a stream that can consume or produce the same rate of data as other points in the pipeline, but offsets it in time by a given delay.
Understanding these concepts will help you understand how important events in a streams lifecycle are communicated.
As mentioned before, streams operate on a substrate made up of another asynchronous primitive; you could design your own streams using nothing but node-style callbacks (it's been done before!)
Node's streams are based on EventEmitters by way of inheritance. Anyone with a reference to an EventEmitter instance may attach listeners for a given topic on the emitter, or they may emit an event on a topic with some associated data. Early versions of streams encouraged authors to directly use EventEmitter machinery. Unfortunately, this was like putting the on-switch of a garbage disposer inside the drain with the garbage disposer: convenient for the person installing the garbage disposer, okay for folks who were savvy to the official garbage disposer activation tools, and very easy to get disasterously wrong for everyone else.
Even in the newest outings of the Stream API, there are some vestiges of this
Little Shop of Horrors garbage disposal system. As such, you should avoid
calling emit
on any stream instance, whether you control it or not. The
pipeline machinery is making use of that mechanism, and it's very easy to jam
that machinery up in novel (and more importantly, hard to diagnose!) ways by
emitting your own events.
Here are the event topics that are part of streams:
topics | PIPE | READABLE | WRITABLE | USER CAN EMIT |
---|---|---|---|---|
data | ✅ | ✅ | ||
end | ✅ | ✅ | ||
pause | ✅ | |||
resume | ✅ | |||
readable | ✅ | ✅ | ||
drain | ✅ | ✅ | ||
finish | ✅ | ✅ | ||
prefinish | ✅ | |||
pipe | ✅ | |||
unpipe | ✅ | |||
close | ✅ | |||
error | ✅ | ✅ |