Skip to content

Instantly share code, notes, and snippets.

@chrisdickinson
Created July 30, 2015 03:02
Show Gist options
  • Star 7 You must be signed in to star a gist
  • Fork 3 You must be signed in to fork a gist
  • Save chrisdickinson/0a236ce62097c806113d to your computer and use it in GitHub Desktop.
Save chrisdickinson/0a236ce62097c806113d to your computer and use it in GitHub Desktop.

Streams: Here is What is Supposed to Happen

Node's streams are designed to tackle the problem of asynchronous data spread out over time — that data might "happen" zero to many times, and it may terminate either in an errored state or a successful state.

No single dimension of how streams work is particularly hard to mentally model, but taken altogether it quickly becomes hard to accurately model the entire system. Nevertheless they are a useful pattern; this document will attempt to note where liberal abstractions may be safely made.

Streams, by necessity, operate on an asynchronous substrate: events, promises, or callbacks. Node's streams use EventEmitter events as a substrate. WHATWG streams operate on Promise's.

The APIs available on streams serve many masters. There are, roughly speaking, two sets of APIs available:

  • .read
  • .write
  • .end

and:

  • .push
  • .pipe

It is not wise to mix the two APIs. The first API of a stream should only be used if the stream is not currently, and will never be, piped to another stream. The second API may be used for any stream in a pipeline.

You can cook an egg on an car's engine, or you can drive the car on the highway. These are both public APIs of a car. It is unwise to attempt to use both at once.

This document will concentrate on the second API, since most of the power and complexity of streams are contained in those two methods.

What you need to know about backpressure and watermarks

highWaterMark exists to mask the effects of latency. Backpressure exists to deal with bandwidth. If those terms don't make sense to you, read on.

Backpressure is the mechanism by which a destination stream may communicate to a source stream that it is currently busy and should not be sent any more information.

This mechanism is essential to the composability of streams. Without it, there is no guarantee that a pipeline would handle differential data rates appropriately: the author of each link in the pipeline would have to get the (often tricky) specifics of buffering and communicating the buffered status to every other stream in the pipeline.

The story you get out of the box with Node streams does not end there.

One of the primary goals of Node is to be fast. Many of Node's streams are backed by resources that operate by turning them "on" and turning them "off." You can think of them like a rusty outdoor spigot. When they're on, they constantly flow; when they're off, nothing comes out.

Turning a rusty spigot on or off takes considerable amounts of time.

Node's streams optimize for this: they attempt to minimize the number of times a resource will be turned "off" due to backpressure. A stream that has been told to "back off" will continue to fill up an internal buffer of items until it's reached a highWaterMark of contents. This happens on both the readable and writable side of streams.

A writable stream that receives a write while it's currently processing other data will attempt to pause if highWaterMark is 0. If its highWaterMark is greater than 0, it buffers the write as a "write request."

When a readable stream receives word from a destination writable that it should back off from producing data, it won't stop pulling data from its underlying resource until it hits highWaterMark.

This mitigates the effects of stream latency — that is, a stream that can consume or produce the same rate of data as other points in the pipeline, but offsets it in time by a given delay.

Understanding these concepts will help you understand how important events in a streams lifecycle are communicated.

Events and methods

As mentioned before, streams operate on a substrate made up of another asynchronous primitive; you could design your own streams using nothing but node-style callbacks (it's been done before!)

Node's streams are based on EventEmitters by way of inheritance. Anyone with a reference to an EventEmitter instance may attach listeners for a given topic on the emitter, or they may emit an event on a topic with some associated data. Early versions of streams encouraged authors to directly use EventEmitter machinery. Unfortunately, this was like putting the on-switch of a garbage disposer inside the drain with the garbage disposer: convenient for the person installing the garbage disposer, okay for folks who were savvy to the official garbage disposer activation tools, and very easy to get disasterously wrong for everyone else.

Even in the newest outings of the Stream API, there are some vestiges of this Little Shop of Horrors garbage disposal system. As such, you should avoid calling emit on any stream instance, whether you control it or not. The pipeline machinery is making use of that mechanism, and it's very easy to jam that machinery up in novel (and more importantly, hard to diagnose!) ways by emitting your own events.

Here are the event topics that are part of streams:

topics PIPE READABLE WRITABLE USER CAN EMIT
data
end
pause
resume
readable
drain
finish
prefinish
pipe
unpipe
close
error
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment