Skip to content

Instantly share code, notes, and snippets.

@ivirshup
Created October 18, 2023 18:14
Show Gist options
  • Save ivirshup/9ba2b570d541ff1393990f632bc7a6ea to your computer and use it in GitHub Desktop.
Save ivirshup/9ba2b570d541ff1393990f632bc7a6ea to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@NikosAlexandris
Copy link

My use case is unequal chunking shapes across data within the same series of products. For example, the chunking shape of :

  • NetCDF 20200101 is [1, 2600, 2600] which can look like [1] [1, 2, 3, 4, .., 2600] [1, 2, 3, 4, .., 2600]
  • NetCDF 20230101 is [1, 1300, 1300] which can look like [1] [1, 2, .., 1300] [1, 2, .., 1300]

These cannot be combined, obviously, with the current implementation. What I thought is possible to do, is to virtually expand as required chunking shapes (in this example, expand the second NetCDF file) so as to let them all match without the need to rechunk all data before-hand. Specifically, let them virtually have the same chunking shape [1, 2600, 2600] in some way like following :

  • NetCDF 20200101 : [1] [1, 2, 3, 4, .., 2600] [1, 2, 3, 4, .., 2600]
  • NetCDF 20230101 : [1] [1, 2, .., 1300, ----] [1, 2, .., 1300, ----]

Hence, the latter two fit and can be combined. ?

@martindurant
Copy link

@martindurant, would you agree with this assessment?

Yes, I agree with the diagrams' conclusion.

Let them virtually have the same chunking shape [1, 2600, 2600] in some way like following :
Hence, the latter two fit and can be combined. ?

This does sound doable, but not as we currently have things designed. There is some previous from https://github.com/d70-t/preffs , which concatenates multiple buffers at the bytes level for each chunk. Again, as in https://github.com/orgs/zarr-developers/discussions/52#discussioncomment-7435293 , this is something that could be done in the kerchunk/referenceFS layer, but we need to be mindful of where the limit of our scope should be.

You can do this currently with dask.array reading from zarr(-like), where you can declare any chunksize over the input zarr(s). I suppose if you construct a dask.array this way and a set of concat/stack calls, you can eventually create a valid xarray.

@NikosAlexandris
Copy link

Thank you @martindurant. Here some argumentation on why such a "feature" would be useful.

Everyone has specific needs and eventually we all do mistakes. Same applies to DWD in producing their SARAH3 products -- my best guess is that : their workflow does not treat the complete time series archive all at once. And for a good reason. Adding a year of data to an archive should not need a complete reproduction of the archive. Nonetheless, deciding for a "good" chunking shape is not an easy decision, it seems. Numerous articles and a few different algorithms to determine an optimal chunking shape (and overall layout) are out there. I have in mind, for example the algorithm by Russ Rew. Likely, at some step of the SARAH3 data production workflow, a suggested chunking shape is used as the best one ? Or something similar between the production of the archive data and then what they call interim or operational data, all part of the same product yet storing observations from the past vs from the near-past/present. This ends up in the mixture of different chunking shapes within the same time series. I wouldn't be surprised if more of this big data archives "suffer" from the same difficulties regarding a good chunking strategy. Rechunking, is probably the right way to go. Yet it is so costly, in time and resources. Having an option to work with such "mixed" chunkings, can serve to perform rather faster some tests with the data.

Well, this is the best argument I can come up with at the moment.

@NikosAlexandris
Copy link

Thank you @martindurant. Here some argumentation on why such a "feature" would be useful.

Everyone has specific needs and eventually we all do mistakes. Same applies to DWD in producing their SARAH3 products -- my best guess is that : their workflow does not treat the complete time series archive all at once. And for a good reason. Adding a year of data to an archive should not need a complete reproduction of the archive. Nonetheless, deciding for a "good" chunking shape is not an easy decision, it seems. Numerous articles and a few different algorithms to determine an optimal chunking shape (and overall layout) are out there. I have in mind, for example the algorithm by Russ Rew. Likely, at some step of the SARAH3 data production workflow, a suggested chunking shape is used as the best one ? Or something similar between the production of the archive data and then what they call interim or operational data, all part of the same product yet storing observations from the past vs from the near-past/present. This ends up in the mixture of different chunking shapes within the same time series. I wouldn't be surprised if more of this big data archives "suffer" from the same difficulties regarding a good chunking strategy. Rechunking, is probably the right way to go. Yet it is so costly, in time and resources. Having an option to work with such "mixed" chunkings, can serve to perform rather faster some tests with the data.

Well, this is the best argument I can come up with at the moment.

Well, it's all "motivated" already in https://github.com/orgs/zarr-developers/discussions/52#discussioncomment-7435293 :-)

@tinaok
Copy link

tinaok commented Nov 17, 2023

Hello,
I'm trying to apply your method to NetCDF files.

!pip install git+https://github.com/ivirshup/kerchunk.git@concat-varchunks


#Create example separated files
import xarray as xr
ds = xr.tutorial.load_dataset('air_temperature')
ds.isel(lon=slice(0,5)).to_netcdf('lon0.nc')
ds.isel(lon=slice(5,10)).to_netcdf('lon1.nc')
ds.isel(lon=slice(10,13)).to_netcdf('lon2.nc')


import glob
dir_url = "/Users/todaka/"
file_pattern = "/data_xios/lon*.nc"
file_paths = glob.glob(dir_url + file_pattern)
file_paths=file_paths[0:2]

import fsspec
from kerchunk.hdf import SingleHdf5ToZarr
def translate_dask(file):
    url = "file://" + file
    with fsspec.open(url) as inf:
        h5chunks = SingleHdf5ToZarr(inf, url, inline_threshold=100)
        return h5chunks.translate()
result=[translate_dask(file) for file in file_paths]

from kerchunk.combine import MultiZarrToZarr

mzz = MultiZarrToZarr(
    result,
    concat_dims=["lon"],
)
a = mzz.translate()
xr.open_dataset(a,engine='kerchunk',chunks={})
result = result_indask.compute()

from kerchunk.combine import MultiZarrToZarr
mzz = MultiZarrToZarr(
    result,
    concat_dims=["y"],
)
a = mzz.translate()
xr.open_dataset(a,engine='kerchunk',chunks={})

I tried to apply your function

concatenate_zarr_csr_arrays(result)

But I age err :

AttributeError: 'dict' object has no attribute 'attrs'

What should I do if I want to connect not the zarr files, but NetCDF files using your approach?

Thank you very much for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment