Skip to content

Instantly share code, notes, and snippets.

@ivirshup
Created October 18, 2023 18:14
Show Gist options
  • Save ivirshup/9ba2b570d541ff1393990f632bc7a6ea to your computer and use it in GitHub Desktop.
Save ivirshup/9ba2b570d541ff1393990f632bc7a6ea to your computer and use it in GitHub Desktop.
Display the source blob
Display the rendered blob
Raw
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
@NikosAlexandris
Copy link

Thank you @martindurant. Here some argumentation on why such a "feature" would be useful.

Everyone has specific needs and eventually we all do mistakes. Same applies to DWD in producing their SARAH3 products -- my best guess is that : their workflow does not treat the complete time series archive all at once. And for a good reason. Adding a year of data to an archive should not need a complete reproduction of the archive. Nonetheless, deciding for a "good" chunking shape is not an easy decision, it seems. Numerous articles and a few different algorithms to determine an optimal chunking shape (and overall layout) are out there. I have in mind, for example the algorithm by Russ Rew. Likely, at some step of the SARAH3 data production workflow, a suggested chunking shape is used as the best one ? Or something similar between the production of the archive data and then what they call interim or operational data, all part of the same product yet storing observations from the past vs from the near-past/present. This ends up in the mixture of different chunking shapes within the same time series. I wouldn't be surprised if more of this big data archives "suffer" from the same difficulties regarding a good chunking strategy. Rechunking, is probably the right way to go. Yet it is so costly, in time and resources. Having an option to work with such "mixed" chunkings, can serve to perform rather faster some tests with the data.

Well, this is the best argument I can come up with at the moment.

@NikosAlexandris
Copy link

Thank you @martindurant. Here some argumentation on why such a "feature" would be useful.

Everyone has specific needs and eventually we all do mistakes. Same applies to DWD in producing their SARAH3 products -- my best guess is that : their workflow does not treat the complete time series archive all at once. And for a good reason. Adding a year of data to an archive should not need a complete reproduction of the archive. Nonetheless, deciding for a "good" chunking shape is not an easy decision, it seems. Numerous articles and a few different algorithms to determine an optimal chunking shape (and overall layout) are out there. I have in mind, for example the algorithm by Russ Rew. Likely, at some step of the SARAH3 data production workflow, a suggested chunking shape is used as the best one ? Or something similar between the production of the archive data and then what they call interim or operational data, all part of the same product yet storing observations from the past vs from the near-past/present. This ends up in the mixture of different chunking shapes within the same time series. I wouldn't be surprised if more of this big data archives "suffer" from the same difficulties regarding a good chunking strategy. Rechunking, is probably the right way to go. Yet it is so costly, in time and resources. Having an option to work with such "mixed" chunkings, can serve to perform rather faster some tests with the data.

Well, this is the best argument I can come up with at the moment.

Well, it's all "motivated" already in https://github.com/orgs/zarr-developers/discussions/52#discussioncomment-7435293 :-)

@tinaok
Copy link

tinaok commented Nov 17, 2023

Hello,
I'm trying to apply your method to NetCDF files.

!pip install git+https://github.com/ivirshup/kerchunk.git@concat-varchunks


#Create example separated files
import xarray as xr
ds = xr.tutorial.load_dataset('air_temperature')
ds.isel(lon=slice(0,5)).to_netcdf('lon0.nc')
ds.isel(lon=slice(5,10)).to_netcdf('lon1.nc')
ds.isel(lon=slice(10,13)).to_netcdf('lon2.nc')


import glob
dir_url = "/Users/todaka/"
file_pattern = "/data_xios/lon*.nc"
file_paths = glob.glob(dir_url + file_pattern)
file_paths=file_paths[0:2]

import fsspec
from kerchunk.hdf import SingleHdf5ToZarr
def translate_dask(file):
    url = "file://" + file
    with fsspec.open(url) as inf:
        h5chunks = SingleHdf5ToZarr(inf, url, inline_threshold=100)
        return h5chunks.translate()
result=[translate_dask(file) for file in file_paths]

from kerchunk.combine import MultiZarrToZarr

mzz = MultiZarrToZarr(
    result,
    concat_dims=["lon"],
)
a = mzz.translate()
xr.open_dataset(a,engine='kerchunk',chunks={})
result = result_indask.compute()

from kerchunk.combine import MultiZarrToZarr
mzz = MultiZarrToZarr(
    result,
    concat_dims=["y"],
)
a = mzz.translate()
xr.open_dataset(a,engine='kerchunk',chunks={})

I tried to apply your function

concatenate_zarr_csr_arrays(result)

But I age err :

AttributeError: 'dict' object has no attribute 'attrs'

What should I do if I want to connect not the zarr files, but NetCDF files using your approach?

Thank you very much for your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment