Skip to content

Instantly share code, notes, and snippets.

@ivirshup
Created October 18, 2023 18:14
Show Gist options
  • Star 1 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save ivirshup/9ba2b570d541ff1393990f632bc7a6ea to your computer and use it in GitHub Desktop.
Save ivirshup/9ba2b570d541ff1393990f632bc7a6ea to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@NikosAlexandris
Copy link

NikosAlexandris commented Oct 30, 2023

Thank you @ivirshup!

Continuing from fsspec/kerchunk#374 (comment) :

So, indptr are offsets into the indices and data arrays. Say ind stands for index what is ptr? Pointer(s)? Also CSR took me a while to get to the definition (compressed sparse row). Maybe in some future documentation it'll be useful to explain all acronyms.

I am trying to adapt your example to my use-case, which, I think, is an easier one. I work with series of netCDF files with varying chunking layouts (by mistake so, as far as I can tell and not out of the producer's intention). Say for example some years are chunked [1, 1300, 1300] and others [1, 2600, 2600] where the spatial resolution is 2600 by 2600. I still need some reading to understand the "critical" part I need to use : is the merge_vars() function, right?

@ivirshup
Copy link
Author

Ah, I misunderstood what you were asking about. This example is very specific to sparse arrays.

I'm actually not sure if our proposal here would make your case possible, unless those netCDFs are uncompressed. Our proposal allows irregular chunking across individual dimensions. So for kerchunk that means if we have arrays like:

image

We can combine those across axis 0 to get something like:

image

But we can't represent the array created by concatenating these along axis 1, e.g.:

image

@martindurant, would you agree with this assessment?

@NikosAlexandris
Copy link

@ivirshup Great visualisation == Clearest explanation! Thank you.

@NikosAlexandris
Copy link

My use case is unequal chunking shapes across data within the same series of products. For example, the chunking shape of :

  • NetCDF 20200101 is [1, 2600, 2600] which can look like [1] [1, 2, 3, 4, .., 2600] [1, 2, 3, 4, .., 2600]
  • NetCDF 20230101 is [1, 1300, 1300] which can look like [1] [1, 2, .., 1300] [1, 2, .., 1300]

These cannot be combined, obviously, with the current implementation. What I thought is possible to do, is to virtually expand as required chunking shapes (in this example, expand the second NetCDF file) so as to let them all match without the need to rechunk all data before-hand. Specifically, let them virtually have the same chunking shape [1, 2600, 2600] in some way like following :

  • NetCDF 20200101 : [1] [1, 2, 3, 4, .., 2600] [1, 2, 3, 4, .., 2600]
  • NetCDF 20230101 : [1] [1, 2, .., 1300, ----] [1, 2, .., 1300, ----]

Hence, the latter two fit and can be combined. ?

@martindurant
Copy link

@martindurant, would you agree with this assessment?

Yes, I agree with the diagrams' conclusion.

Let them virtually have the same chunking shape [1, 2600, 2600] in some way like following :
Hence, the latter two fit and can be combined. ?

This does sound doable, but not as we currently have things designed. There is some previous from https://github.com/d70-t/preffs , which concatenates multiple buffers at the bytes level for each chunk. Again, as in https://github.com/orgs/zarr-developers/discussions/52#discussioncomment-7435293 , this is something that could be done in the kerchunk/referenceFS layer, but we need to be mindful of where the limit of our scope should be.

You can do this currently with dask.array reading from zarr(-like), where you can declare any chunksize over the input zarr(s). I suppose if you construct a dask.array this way and a set of concat/stack calls, you can eventually create a valid xarray.

@NikosAlexandris
Copy link

Thank you @martindurant. Here some argumentation on why such a "feature" would be useful.

Everyone has specific needs and eventually we all do mistakes. Same applies to DWD in producing their SARAH3 products -- my best guess is that : their workflow does not treat the complete time series archive all at once. And for a good reason. Adding a year of data to an archive should not need a complete reproduction of the archive. Nonetheless, deciding for a "good" chunking shape is not an easy decision, it seems. Numerous articles and a few different algorithms to determine an optimal chunking shape (and overall layout) are out there. I have in mind, for example the algorithm by Russ Rew. Likely, at some step of the SARAH3 data production workflow, a suggested chunking shape is used as the best one ? Or something similar between the production of the archive data and then what they call interim or operational data, all part of the same product yet storing observations from the past vs from the near-past/present. This ends up in the mixture of different chunking shapes within the same time series. I wouldn't be surprised if more of this big data archives "suffer" from the same difficulties regarding a good chunking strategy. Rechunking, is probably the right way to go. Yet it is so costly, in time and resources. Having an option to work with such "mixed" chunkings, can serve to perform rather faster some tests with the data.

Well, this is the best argument I can come up with at the moment.

@NikosAlexandris
Copy link

Thank you @martindurant. Here some argumentation on why such a "feature" would be useful.

Everyone has specific needs and eventually we all do mistakes. Same applies to DWD in producing their SARAH3 products -- my best guess is that : their workflow does not treat the complete time series archive all at once. And for a good reason. Adding a year of data to an archive should not need a complete reproduction of the archive. Nonetheless, deciding for a "good" chunking shape is not an easy decision, it seems. Numerous articles and a few different algorithms to determine an optimal chunking shape (and overall layout) are out there. I have in mind, for example the algorithm by Russ Rew. Likely, at some step of the SARAH3 data production workflow, a suggested chunking shape is used as the best one ? Or something similar between the production of the archive data and then what they call interim or operational data, all part of the same product yet storing observations from the past vs from the near-past/present. This ends up in the mixture of different chunking shapes within the same time series. I wouldn't be surprised if more of this big data archives "suffer" from the same difficulties regarding a good chunking strategy. Rechunking, is probably the right way to go. Yet it is so costly, in time and resources. Having an option to work with such "mixed" chunkings, can serve to perform rather faster some tests with the data.

Well, this is the best argument I can come up with at the moment.

Well, it's all "motivated" already in https://github.com/orgs/zarr-developers/discussions/52#discussioncomment-7435293 :-)

@tinaok
Copy link

tinaok commented Nov 17, 2023

Hello,
I'm trying to apply your method to NetCDF files.

!pip install git+https://github.com/ivirshup/kerchunk.git@concat-varchunks


#Create example separated files
import xarray as xr
ds = xr.tutorial.load_dataset('air_temperature')
ds.isel(lon=slice(0,5)).to_netcdf('lon0.nc')
ds.isel(lon=slice(5,10)).to_netcdf('lon1.nc')
ds.isel(lon=slice(10,13)).to_netcdf('lon2.nc')


import glob
dir_url = "/Users/todaka/"
file_pattern = "/data_xios/lon*.nc"
file_paths = glob.glob(dir_url + file_pattern)
file_paths=file_paths[0:2]

import fsspec
from kerchunk.hdf import SingleHdf5ToZarr
def translate_dask(file):
    url = "file://" + file
    with fsspec.open(url) as inf:
        h5chunks = SingleHdf5ToZarr(inf, url, inline_threshold=100)
        return h5chunks.translate()
result=[translate_dask(file) for file in file_paths]

from kerchunk.combine import MultiZarrToZarr

mzz = MultiZarrToZarr(
    result,
    concat_dims=["lon"],
)
a = mzz.translate()
xr.open_dataset(a,engine='kerchunk',chunks={})
result = result_indask.compute()

from kerchunk.combine import MultiZarrToZarr
mzz = MultiZarrToZarr(
    result,
    concat_dims=["y"],
)
a = mzz.translate()
xr.open_dataset(a,engine='kerchunk',chunks={})

I tried to apply your function

concatenate_zarr_csr_arrays(result)

But I age err :

AttributeError: 'dict' object has no attribute 'attrs'

What should I do if I want to connect not the zarr files, but NetCDF files using your approach?

Thank you very much for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment