Skip to content

Instantly share code, notes, and snippets.

@davidbrochart davidbrochart/blog.md Secret
Created Apr 18, 2019

Embed
What would you like to do?

Continuously extending Zarr datasets

Pangeo is growing its dataset catalog. The preferred format of the project for storing data in the cloud is Zarr. As new needs appear, such as having datasets which extend with time, uploading these datasets to the cloud becomes a real challenge. In this post, we will show how we can play with Zarr to append to an archive as new data becomes available.

The problem with live data

Earth observation data which originates from e.g. satellite-based remote sensing is produced continuously, usually with a latency that depends on the amount of processing that is required to generate something useful for the end user. When storing this kind of data, we obviously don't want to create a new archive from scratch each time new data is produced, but instead append the new data to the same archive. If this is big data, we might not even want to stage the whole dataset on our local hard drive before uploading it to the cloud, but rather directly stream it there. The nice thing about Zarr is that the simplicity of its store file structure allows us to hack around and address this kind of issue. Recent improvements to Xarray will also ease this process.

Download the data

Let's take TRMM 3B42RT as an example dataset (near real time, satellite-based precipitation estimates from NASA). It is a precipitation array ranging from latitudes 60°N-S with resolution 0.25°, 3-hour, from March 2000 to present. It's a good example of a rather obscure binary format, hidden behind a raw FTP server.

Files are organized on the server in a particular way that is specific to this dataset, so we must have some prior knowledge of the directory structure in order to fetch them. The following function uses the aria2 utility to download files in parallel.

https://gist.github.com/fad080059ba622226d800fd630f2b844

Create an Xarray Dataset

In order to create an Xarray Dataset from the downloaded files, we must know how to decode the content of the files (the binary layout of the data, its shape, type, etc.). The following function does just that:

https://gist.github.com/d002878cccd72fb6e44fe46b3953b997

Now we can have a nice representation of (a part of) our dataset:

https://gist.github.com/19cb231cf49428fb6faf623c28188a09

https://gist.github.com/9ed4b3d28783afe856d41cb0ad05faf8

And plot e.g. the accumulated precipitation:

https://gist.github.com/579a76767668f8f70a147a164ff35afb

alt text

Store the Dataset to local Zarr

This is where things start to get a bit tricky. Because the Zarr archive will be uploaded to the cloud, it must already be chunked reasonably. There is a ~100 ms overhead associated with every read from cloud storage. To amortize this overhead, chunks must be bigger than 10 MiB. If we want to have several chunks fit comfortably in memory so that they can be processed in parallel, they must not be too big either. With today's machines, 100 MiB chunks are adviced. This means that for our dataset, we can concatenate 100 / (480 * 1440 * 4 / 1024 / 1024) ~ 40 dates into one chunk. The Zarr will be created with that chunk size.

Also, Xarray will choose some encodings for each variable when creating the Zarr archive. The most special one is for the time variable, which will look something like that (content of the .zattrs file):

https://gist.github.com/64b25e009222940fadb42c8ae022e723

It means that the time coordinate will actually be encoded as an integer representing the number of "hours since 2000-03-01 12:00:00". When we create new Zarr archives for new datasets, we must keep the original encodings. The create_zarr function takes care of all that:

https://gist.github.com/6912b9f417cc94dbe86f6588412017cf

Upload the Zarr to the cloud

The first time the Zarr is created, it contains the very beginning of our dataset, so it must be uploaded as is to the cloud. But as we download more data, we only want to upload the new data. That's where the clear and simple implementation of data and metadata as separate files in Zarr comes handy: as long as the data is not accessed, we can delete the data files without corrupting the archive. We can then append to the "empty" Zarr (but still valid and appearing to contain the previous dataset), and upload only the necessary files to the cloud.

One thing to keep in mind is that some coordinates (here lat and lon) won't be affected by the append operation. Only the time coordinate and the DataArray which depends on the time dimension (here precipitation) need to be extended. Also, we can see that there will be a problem with the time coordinate: its chunks will have a size of 40. That was the intention for the precipitation variable, but because the time variable is a 1-D array, it will be much too small. So we empty the time variable of its data for now, and it will be uploaded later with the right chunks.

https://gist.github.com/0aab58d607fe62ab46dd67dad8e19783

Repeat

Now that we have all the pieces, it is just a matter of putting them together in a loop. We take care of the time coordinate by uploading in one chunk at the end.

The following code allows to resume an upload, so that you can wait for new data to appear on the FTP server and launch the script again:

https://gist.github.com/9a6f7d36dfdb81c7bdc8d2214919efe6

Conclusion

This post showed how to stream data directly from a provider to Pangeo's data store. It actually serves two purposes:

  • for data that is produced continuously, we hacked around the Zarr data store format to efficiently append to an existing dataset.
  • for data that is bigger than your hard drive, we only stage a part of the dataset locally and have the cloud store the totality.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.