Skip to content

Instantly share code, notes, and snippets.

@davidbrochart
Created December 14, 2020 17:44
Show Gist options
  • Save davidbrochart/54344ea33771a988b0588bedebd45227 to your computer and use it in GitHub Desktop.
Save davidbrochart/54344ea33771a988b0588bedebd45227 to your computer and use it in GitHub Desktop.

xtensor-zarr: xtensor is going to the cloud!

xtensor now stands as a well-established ND-array computing library in the C++ ecosystem. It has many things going for it:

  • a NumPy-like API (which has become a de-facto standard),
  • a lazy evaluation system, which avoids allocating temporary results,
  • it comes with batteries included, thanks to a bunch of extensions (xsimd, xtensor-io, xtensor-fftw, xtensor-blas, etc.),
  • it has bindings for major languages used in scientific computing (Python, Julia and R).

However, it is pretty much limited to a local, in-memory execution model, which means that your data has to fit in memory. While this might be enough for a wide range of applications, it is becoming a limitation when processing big data, such as satellite imagery or high resolution model outputs. Python has developed an entire ecosystem of libraries to tackle this issue, including Dask, xarray and Zarr. They are based on a simple concept: if your data doesn't fit in memory, chunk it up in smaller pieces! Today we are pleased to annonce that xtensor also supports chunked arrays, as well as the ever more popular Zarr storage format.

Chunked arrays

A chunked array can conceptually be thought of as an array of arrays. The inner arrays are called chunks, and they are hold in an outer array. The inner and outer arrays share the same dimensionality, but of course not the same shape. For instance, let's consider a non-chunked 2D array of shape (10000, 10000). Chunking it along both dimensions with a chunk shape of (1000, 1000) results in 100 chunks (10 along each dimension). And now we have each chunk taking up 100 times less memory than the original array, making them more tractable.

Chunking arrays is most useful when the chunks are stored on a persistent storage system, such as a hard drive or in the cloud. But because these systems are much slower than the main memory, you want to limit accessing them as much as you can. In xtensor, we have implemented a chunk manager that is able to hold several chunks in memory at the same time. When a chunk is first requested for reading, it is loaded into a memory pool, making further accesses faster. When the pool of chunks is full, a chunk is offloaded to the storage system.

We realized that this mechanism of loading/offloading an array could also be useful outside the context of chunked arrays, so we abstracted it out into a new type of array, called a file array.

File arrays

A file array is a file-backed, cached array. It exists in main memory and on a storage system, and it is automatically kept in sync: when created, the array is loaded in memory from the storage (if it already exists), and when destroyed, the in-memory array is offloaded to the storage. The offloading can also be triggered manually if necessary.

File arrays have two abstractions:

  • an IO handler: where to store the data? We currently support IO handlers for the local file system, Google Cloud Storage, and AWS S3.
  • a file format: how to store the data? We currently support formats in raw binary, GZip and Blosc.

Going back to chunked arrays, they can be of two types:

  • in-memory chunked arrays: not so useful at first sight, since they don't seem to solve our initial problem of data being too big to hold in memory. But chunking arrays also allows to process each chunk separately. What this means is that if you have a mathematical operation or an algorithm that can be broken down into independant tasks, you can process each chunk in parallel. Something we will come back to later in this blog post.
  • stored chunked arrays: they are made up using a pool of file arrays, and allow to process bigger-than-memory data. Because several chunks can be hold in memory at the same time, they can also be processed simultaneously. Again, more on that later.

Zarr arrays

In a stored chunked array, each file array has an associated path that can be constructed from the chunk index. For instance, if a file array currently holds the chunk at index (3, 4), we can have it point to a file at some_path/3.4, and have all the chunks stored under a common directory. It turns out this simple idea is at the core of the Zarr format, and the way we designed chunked arrays can directly be used as a building block for supporting Zarr in xtensor.

The new xtensor-zarr library implements the Zarr format. Besides chunk storage, Zarr also specifies some metadata which is used to describe the chunked array: the array shape, the chunk shape, the data type, the storage format, etc. This allows to discover the array characteristics by reading the metadata before reading the array data. Which brings us to a pretty big challenge for a C++ library: the ability to dynamically create an array at run-time, and especially without knowing its data type at compile-time. This goes against the template-based xtensor model of parameterizing tensor types and operations with the data type. There are solutions to this problem though, known as type erasure, and multiple dynamic dispatch. And this is what we came up with in xtensor, through a new (seemingly) untyped array that we called zarray (which you can see as the contraction of "Zarr" and "array"). The new zarray library implements this type of array as well as a new dynamic expression system. We will dive more into zarray in an upcoming blog post.

The roadmap

xtensor-zarr is opening new doors for scientific computing in C++. By making it possible to access data stored in the cloud, a new brand of applications can now be built on top of xtensor-zarr. As the xtensor stack is trying to fill the gap between the C++ and the Python scientific ecosystems, let's enumerate the key Python libraries, their potential C++ counterparts if they already exist, and if not, what's left to be done:

Python C++ Description of the C++ library
NumPy xtensor An implementation of ND-arrays with broadcasting and lazy computing.
Zarr xtensor-zarr An implementation of the Zarr core protocol based on xtensor.
xarray xframe An implementation of multi-dimensional labeled arrays and dataframes based on xtensor.
Dask N/A

Dask has no equivalent in the xtensor stack at the moment. But let's look deeper.

Dask (and especially Dask Array) makes it possible to scale up computations, by processing chunks in parallel using a scheduler. This scheduler can be local (using multi-threading or multi-processing) or distributed (on a cluster). While xtensor-zarr cannot be compared to Dask, its tight integration with xtensor gives it computational capabilities that are lacking in zarr-python. Indeed, zarr-python is only concerned with the storage functionality, and immediatly delegates to NumPy as soon as you access (a part of) a chunked array. In short, operating on zarr-python arrays is not possible:

# a and b are zarr-python arrays
a + b        # this is not possible in zarr-python
a[:] + b[:]  # this is possible as accessing Zarr arrays
             # converts them to in-memory NumPy arrays

While this is possible in xtensor-zarr:

// a and b are xtensor-zarr arrays
a + b;  // this is possible, lazily evaluated, and
        // on a per-chunk basis

The reason is that xtensor-zarr arrays inherit from the same functionalities as "regular" xtensor arrays, which means they get a lot for free: views, lazy evaluation, etc. This also means that they come with some kind of implicit scheduler: operations are performed iteratively on each element of an array, chunk by chunk. Actually, this is comparable to Dask's single-threaded, sequential scheduler (which is only used for debugging or profiling).

Now, what's missing in the xtensor stack to offer a Dask-like experience? We believe that the taskflow C++ library would be a great fit for this. Its task scheduling system will allow us to process chunks in parallel. It will also make it possible to scale up to a distributed cluster. If you are as excited as we are with the new possibilities this will offer, and would like to contribute to or fund this project, please get in touch!

Acknowledgements

This work was made possible thanks to a CZI fund.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment