Skip to content

Instantly share code, notes, and snippets.

@thomaswilburn
Last active April 18, 2024 11:56
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save thomaswilburn/98b5dd5a6d7f4bb108090d809f2f823b to your computer and use it in GitHub Desktop.
Save thomaswilburn/98b5dd5a6d7f4bb108090d809f2f823b to your computer and use it in GitHub Desktop.
Exploring generators and iteration through zip()

Win zip()

I don't often use zip(), but by coincidence this week I ran into it a few times, using both Python and JavaScript. In the former, it's a built-in, but in the latter it's typically provided by a library like D3. And it struck me as kind of a fun warm-up challenge. How would you write this function in modern JavaScript?

Well, let's see how D3 does it. Oh, it's a wrapper around transpose(), here we go...

import min from "./min.js";

function length(d) {
  return d.length;
}

export default function transpose(matrix) {
  if (!(n = matrix.length)) return [];
  for (var i = -1, m = min(matrix, length), transpose = new Array(m); ++i < m;) {
    for (var j = -1, n, row = transpose[i] = new Array(n); ++j < n;) {
      row[j] = matrix[j][i];
    }
  }
  return transpose;
}

WHAT.

(Side note: I'm not a fan of D3, in part because I think it encourages a culture of copy/paste coding instead of building understanding and good code habits. And I believe that's evident even in the source, which is almost uniformly written in a style that looks like someone fed it through a C++ obfuscator. I would never tell a new developer to learn by reading D3's source, and that's a shame.)

Never mind, then. We'll write zip() ourselves, starting with a naive implementation and then working toward something fancier.

The basics

A zip operation means taking multiple arrays, and combining them into a single array of arrays, such that each index contains all the items at that index in the original lists. For example:

var a = [1, 2, 3];
var b = ["a", "b", "c"];

zip(a, b);
// [[1, "a"], [2, "b"], [3, "c"]]

The basic way to implement this is to find the longest array length, and then loop through all the indexes, pushing those into an output array. Apart from the spread operators, this is basically how you'd write that in old-school 2010-era JavaScript.

function zip(...arrays) {
  var length = Math.max(...arrays.map(a => a.length));
  var output = [];
  
  for (var i = 0; i < length; i++) {
    // grab the items at index i across arrays
    var merged = [];
    for (var a of arrays) {
      merged.push(a[i]);
    }
    // add the slice to the output
    output.push(merged);
  }
  return output;
}

Rewrite #1: Generator

Our basic function is cool but it requires us to build the array as a complete structure before returning it. If the lists we want to zip are very large, that might have memory implications. It also means that we might do a lot of work that we don't necessarily want to do if we're only actually interested in the first n items of the array.

We can make our function lazy, meaning that it only zips together items as they're requested, by using JavaScript's iteration protocol, which is a standard way for something to declare "you can loop over me" (it's how for...of loops work). And the easiest way to do that is with a generator function:

function* zip(...arrays) {
  var length = Math.max(...arrays.map(a => a.length));
  
  for (var i = 0; i < length; i++) {
    var merged = [];
    for (var a of arrays) {
      merged.push(a[i]);
    }
    yield merged;
  }
}

We declare a generator by adding a * between the function keyword and the name. JavaScript isn't picky about the exact configuration, you can do either function* zip or function *zip, but I prefer the former because it's easier to read the name that way.

Inside our generator, we can yield, which is kind of like a return but for a single item -- it technically pauses function execution, to be resumed later. Generators can't just return, since they need to provide multiple values. Instead, the result of calling a generator function like this is an iterator object, which has a next() method. Each time you call next() on that iterator, the VM resumes the generator until it either yields a new value or returns for good. So using it manually would look something like this:

var a = [1, 2, 3];
var b = ["a", "b", "c"];

var iter = zip(a, b);
iter.next(); // { value: [1, "a"], done: false }
iter.next(); // { value: [2, "b"], done: false }
iter.next(); // { value: [3, "c"], done: false }
iter.next(); // { value: undefined, done: true }

Luckily, most of the time you don't use an iterator directly like this. Instead, you use for...of loops, which will unpack the values for you, or the spread operator to convert it into an array.

for (var pair of zip(a, b)) {
  console.log(pair); // we'll see each pair here
}

// or we can get all values
var zipped = [...zip(a, b)];

As I said, the advantage of iterators like this is that they're lazy. For example, let's write a take() function, which is a standard way to ask for the first n items of an iterable like an array.

function* take(iterable, count) {
  // get an iterator for whatever we fed to take()
  var iterator = iterable[Symbol.iterator]();
  // get the first n items
  for (var i = 0; i < count; i++) {
    var item = iterator.next();
    if (item.done) return;
    yield item.value;
  }
}

If an object in JavaScript supports the iteration protocol, it'll provide an iterator function at [Symbol.iterator]. Generator functions do this for us automatically. Strings, Arrays, Maps, and Sets also have their own iterators this way, so they could also be used with take(). And since our generator version of zip() is lazy, if we modify it to log out the index, we'll see that it only loops through as many items as we take(), instead of processing the whole list.

var a = [1, 2, 3, 4, 5, 6, 7, 8, 9, 0];
var b = "abcdefg";

function* zip(...arrays) {
  var length = Math.max(...arrays.map(a => a.length));
  
  for (var i = 0; i < length; i++) {
    // log out the index:
    console.log(i);
    var merged = [];
    for (var a of arrays) {
      merged.push(a[i]);
    }
    yield merged;
  }
}

var tooken = take(zip(a, b), 3);
// logs out 0, 1, 2 from inside zip()
console.log(...tooken);
// [1, "a"], [2, "b"], [3, "c"]

That's kind of neat! It's not such a big deal with something like zip(), but if the operation we wanted to perform on a large dataset was extremely expensive, like hashing or a complicated analysis step, we could use this to only do that work for the items we're actually going to use.

Here's another interesting result of using lazy computation in generators: if you use destructuring in JS, it effectively works like take(), in that it will only iterate through as many items as you request in the assignment. So using the zip() implementation above, this code will only log out 0 and 1 from inside zip(), because our destructuring only uses the first two items from the iterable.

var [pair1, pair2] = zip(a, b);

Okay, now pull on your skinny jeans and get on your fixed-gear bike, we're getting really fancy.

Rewrite #2: Iteration protocol compliance

Our take() function has some interesting properties: it keeps track of its output length, but it doesn't actually care how long the inputs are. Instead, it just uses the iteration protocol, and the input iterator tells it when there are no more items remaining by returning { done: true }. That means that you can take() from iterable objects that may or may not have a predetermined length.

For example, we might have a generator that gives us exponential backoff values to prevent collisions on a network -- it just keeps yielding bigger and bigger numbers forever if you keep asking for them, and at some point you either exceed your retry limit or the network will clear up. Alternately, we could have a generator that yields values as long as it keeps getting heads on a flipped coin:

function* rosencrantz() {
  while (true) {
    if (Math.random() > .5) return;
    yield "heads!";
  }
}

for (var flip of rosencrantz()) {
  console.log(flip); // this will continue until tails
}

Of course, in addition to opening the way to existential dread, relying on the iteration protocol has another advantage, which is that it works even on objects that keep their bounds somewhere other than .length -- Maps and Sets have .size instead. Let's change zip() so that it fully relies on the protocol instead of hard-coding the length check.

function* zip(...arrays) {
  var iterators = arrays.map(a => a[Symbol.iterator]());
  while (true) {
    var nexts = iterators.map(i => i.next());
    var done = nexts.every(n => n.done);
    if (done) return;
    var values = nexts.map(n => n.value);
    yield values;
  }
}

Our new generator gets an iterator from each object you pass in, and then repeatedly asks those iterators for their next values. When all of them say that they're done, it returns, otherwise it lazily yields out the available values. You can pass in strings, arrays, Maps, Sets, or user-defined objects -- as long as it obeys the protocol by exposing [Symbol.iterator], our new version of zip() will handle it.

Note that this version will run until all inputs are empty -- if you pass it a generator that runs forever, it'll happily freeze your browser tab. However, it's an easy tweak to change so that it will only run as long as the shortest item: just change every to some in the done-ness check, so that any iterator signalling completion will return from the function.

YYZ

Although I know it makes me seem like a crank, this is what I like about writing my own utility code instead of relying on libraries and micromodules. Even relatively simple problems are a chance to learn something and improve our skills. Something as simple as zip() takes us on a tour of iteration, generator functions, and lazy evaluation.

We also start to see how these things can combine with each other to enable expressive new syntax and performance techniques. Once you have a lazy zip() and take(), you might start thinking about other inputs that can be consumed as a list of values, like lines of text or values in a CSV. And you might imagine places where you could add iterator protocol support to your own objects, thus letting them play in that same space.

And now we know we should never read the D3 source. Valuable lessons all around.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment