davidbrochart/blog.md Secret

## blog.md

      
    Raw
  

              blog.md
            
          
    Continuously extending Zarr datasets

Pangeo is growing its dataset catalog. The preferred format of the project for
storing data in the cloud is Zarr. As new needs appear, such as having datasets
which extend with time, uploading these datasets to the cloud becomes a real
challenge. In this post, we will show how we can play with Zarr to append to an
archive as new data becomes available.
The problem with live data

Earth observation data which originates from e.g. satellite-based remote sensing
is produced continuously, usually with a latency that depends on the amount of
processing that is required to generate something useful for the end user. When
storing this kind of data, we obviously don't want to create a new archive from
scratch each time new data is produced, but instead append the new data to the
same archive. If this is big data, we might not even want to stage the whole
dataset on our local hard drive before uploading it to the cloud, but rather
directly stream it there. The nice thing about Zarr is that the simplicity of
its store file structure allows us to hack around and address this kind of
issue. Recent improvements to Xarray will also ease this process.
Download the data

Let's take TRMM 3B42RT as an
example dataset (near real time, satellite-based precipitation estimates from
NASA). It is a precipitation array ranging from latitudes 60°N-S with resolution
0.25°, 3-hour, from March 2000 to present.  It's a good example of a rather
obscure binary format, hidden behind a raw FTP server.
Files are organized on the server in a particular way that is specific to this
dataset, so we must have some prior knowledge of the directory structure in
order to fetch them. The following function uses the aria2 utility to download
files in parallel.
https://gist.github.com/fad080059ba622226d800fd630f2b844
Create an Xarray Dataset

In order to create an Xarray Dataset from the downloaded files, we must know how
to decode the content of the files (the binary layout of the data, its shape,
type, etc.). The following function does just that:
https://gist.github.com/d002878cccd72fb6e44fe46b3953b997
Now we can have a nice representation of (a part of) our dataset:
https://gist.github.com/19cb231cf49428fb6faf623c28188a09
https://gist.github.com/9ed4b3d28783afe856d41cb0ad05faf8
And plot e.g. the accumulated precipitation:
https://gist.github.com/579a76767668f8f70a147a164ff35afb

Store the Dataset to local Zarr

This is where things start to get a bit tricky. Because the Zarr archive will be
uploaded to the cloud, it must already be chunked reasonably. There is a ~100 ms
overhead associated with every read from cloud storage. To amortize this
overhead, chunks must be bigger than 10 MiB. If we want to have several chunks
fit comfortably in memory so that they can be processed in parallel, they must
not be too big either. With today's machines, 100 MiB chunks are adviced. This
means that for our dataset, we can concatenate 100 / (480 * 1440 * 4 / 1024 /
1024) ~ 40 dates into one chunk. The Zarr will be created with that chunk size.
Also, Xarray will choose some encodings for each variable when creating the Zarr
archive. The most special one is for the time variable, which will look
something like that (content of the .zattrs file):
https://gist.github.com/64b25e009222940fadb42c8ae022e723
It means that the time coordinate will actually be encoded as an integer
representing the number of "hours since 2000-03-01 12:00:00". When we create new
Zarr archives for new datasets, we must keep the original encodings. The
create_zarr function takes care of all that:
https://gist.github.com/6912b9f417cc94dbe86f6588412017cf
Upload the Zarr to the cloud

The first time the Zarr is created, it contains the very beginning of our
dataset, so it must be uploaded as is to the cloud. But as we download more
data, we only want to upload the new data. That's where the clear and simple
implementation of data and metadata as separate files in Zarr comes handy: as
long as the data is not accessed, we can delete the data files without
corrupting the archive. We can then append to the "empty" Zarr (but still valid
and appearing to contain the previous dataset), and upload only the necessary
files to the cloud.
One thing to keep in mind is that some coordinates (here lat and lon) won't be
affected by the append operation. Only the time coordinate and the DataArray
which depends on the time dimension (here precipitation) need to be extended.
Also, we can see that there will be a problem with the time coordinate: its
chunks will have a size of 40. That was the intention for the precipitation
variable, but because the time variable is a 1-D array, it will be much too
small. So we empty the time variable of its data for now, and it will be
uploaded later with the right chunks.
https://gist.github.com/0aab58d607fe62ab46dd67dad8e19783
Repeat

Now that we have all the pieces, it is just a matter of putting them together in
a loop. We take care of the time coordinate by uploading in one chunk at the
end.
The following code allows to resume an upload, so that you can wait for
new data to appear on the FTP server and launch the script again:
https://gist.github.com/9a6f7d36dfdb81c7bdc8d2214919efe6
Conclusion

This post showed how to stream data directly from a provider to Pangeo's data
store. It actually serves two purposes:

for data that is produced continuously, we hacked around the Zarr data store
format to efficiently append to an existing dataset.
for data that is bigger than your hard drive, we only stage a part of the
dataset locally and have the cloud store the totality.