Continuously extending Zarr datasets
Pangeo is growing its dataset catalog. The preferred format of the project for storing data in the cloud is Zarr. As new needs appear, such as having datasets which extend with time, uploading these datasets to the cloud becomes a real challenge. In this post, we will show how we can play with Zarr to append to an archive as new data becomes available.
The problem with live data
Earth observation data which originates from e.g. satellite-based remote sensing is produced continuously, usually with a latency that depends on the amount of processing that is required to generate something useful for the end user. When storing this kind of data, we obviously don't want to create a new archive from scratch each time new data is produced, but instead append the new data to the same archive. If this is big data, we might not even want to stage the whole dataset on our local hard drive before uploading it to the cloud, but rather directly stream it there. The nice thing about Zarr is that the simplicity of its store file structure allows us to hack around and address this kind of issue. Recent improvements to Xarray will also ease this process.
Download the data
Let's take TRMM 3B42RT as an example dataset (near real time, satellite-based precipitation estimates from NASA). It is a precipitation array ranging from latitudes 60°N-S with resolution 0.25°, 3-hour, from March 2000 to present. It's a good example of a rather obscure binary format, hidden behind a raw FTP server.
Files are organized on the server in a particular way that is specific to this dataset, so we must have some prior knowledge of the directory structure in order to fetch them. The following function uses the aria2 utility to download files in parallel.
Create an Xarray Dataset
In order to create an Xarray Dataset from the downloaded files, we must know how to decode the content of the files (the binary layout of the data, its shape, type, etc.). The following function does just that:
Now we can have a nice representation of (a part of) our dataset:
And plot e.g. the accumulated precipitation:
Store the Dataset to local Zarr
This is where things start to get a bit tricky. Because the Zarr archive will be uploaded to the cloud, it must already be chunked reasonably. There is a ~100 ms overhead associated with every read from cloud storage. To amortize this overhead, chunks must be bigger than 10 MiB. If we want to have several chunks fit comfortably in memory so that they can be processed in parallel, they must not be too big either. With today's machines, 100 MiB chunks are adviced. This means that for our dataset, we can concatenate 100 / (480 * 1440 * 4 / 1024 / 1024) ~ 40 dates into one chunk. The Zarr will be created with that chunk size.
Also, Xarray will choose some encodings for each variable when creating the Zarr archive. The most special one is for the time variable, which will look something like that (content of the .zattrs file):
It means that the time coordinate will actually be encoded as an integer
representing the number of "hours since 2000-03-01 12:00:00". When we create new
Zarr archives for new datasets, we must keep the original encodings. The
create_zarr function takes care of all that:
Upload the Zarr to the cloud
The first time the Zarr is created, it contains the very beginning of our dataset, so it must be uploaded as is to the cloud. But as we download more data, we only want to upload the new data. That's where the clear and simple implementation of data and metadata as separate files in Zarr comes handy: as long as the data is not accessed, we can delete the data files without corrupting the archive. We can then append to the "empty" Zarr (but still valid and appearing to contain the previous dataset), and upload only the necessary files to the cloud.
One thing to keep in mind is that some coordinates (here lat and lon) won't be affected by the append operation. Only the time coordinate and the DataArray which depends on the time dimension (here precipitation) need to be extended. Also, we can see that there will be a problem with the time coordinate: its chunks will have a size of 40. That was the intention for the precipitation variable, but because the time variable is a 1-D array, it will be much too small. So we empty the time variable of its data for now, and it will be uploaded later with the right chunks.
Now that we have all the pieces, it is just a matter of putting them together in a loop. We take care of the time coordinate by uploading in one chunk at the end.
The following code allows to resume an upload, so that you can wait for new data to appear on the FTP server and launch the script again:
This post showed how to stream data directly from a provider to Pangeo's data store. It actually serves two purposes:
- for data that is produced continuously, we hacked around the Zarr data store format to efficiently append to an existing dataset.
- for data that is bigger than your hard drive, we only stage a part of the dataset locally and have the cloud store the totality.