Skip to content

Instantly share code, notes, and snippets.

@jnhansen
Created August 3, 2018 18:59
Show Gist options
  • Save jnhansen/fa474a536201561653f60ea33045f4e2 to your computer and use it in GitHub Desktop.
Save jnhansen/fa474a536201561653f60ea33045f4e2 to your computer and use it in GitHub Desktop.
Auto-merge xarray datasets along multiple dimensions
import glob
import xarray as xr
import itertools
import numpy as np
def auto_merge(datasets):
"""
Automatically merge a split xarray Dataset. This is designed to behave like
`xarray.open_mfdataset`, except it supports concatenation along multiple
dimensions.
Parameters
----------
datasets : str or list of str or list of xarray.Dataset
Either a glob expression or list of paths as you would pass to
xarray.open_mfdataset, or a list of xarray datasets. If a list of
datasets is passed, you should make sure that they are represented
as dask arrays to avoid reading the whole dataset into memory.
Returns
-------
xarray.Dataset
The merged dataset.
"""
# Treat `datasets` as a glob expression
if isinstance(datasets, str):
datasets = glob.glob(datasets)
# Treat `datasets` as a list of file paths
if isinstance(datasets[0], str):
# Pass chunks={} to ensure the dataset is read as a dask array
datasets = [xr.open_dataset(path, chunks={}) for path in datasets]
def _combine_along_last_dim(datasets):
merged = []
# Determine the dimension along which the dataset is split
split_dims = [d for d in datasets[0].dims if
len(np.unique([ds[d].values[0] for ds in datasets])) > 1]
# Concatenate along one of the split dimensions
concat_dim = split_dims[-1]
# Group along the remaining dimensions and concatenate within each
# group.
sorted_ds = sorted(datasets, key=lambda ds: tuple(ds[d].values[0]
for d in split_dims))
for _, group in itertools.groupby(
sorted_ds,
key=lambda ds: tuple(ds[d].values[0] for d in split_dims[:-1])
):
merged.append(xr.auto_combine(group, concat_dim=concat_dim))
return merged
merged = datasets
while len(merged) > 1:
merged = _combine_along_last_dim(merged)
return merged[0]
@jnhansen
Copy link
Author

jnhansen commented Aug 3, 2018

I make the following assumptions (which are reasonable for my use case):

  • the data variables in each part are identical
  • equality of the first element of two coordinate arrays is sufficient to assume equality of the two coordinate arrays

@FaustinCarter
Copy link

This is great! Would be good to add a passthrough **kwargs for sending extra options to open_dataset (specifically the autoclose=True for when you want to open a couple thousand files...)!

@ShukhratSh
Copy link

Thank you very much for writing this function. I tried to run it but got error "AttributeError: module 'xarray' has no attribute 'auto_combine'". I am using 'xarray 0.19.0'. I want to merge xarray time-series dataset tiles by xy dimension. Thanks

@jnhansen
Copy link
Author

Hi @ShukhratSh, this function was written for an older version of xarray. In the newer versions, combining datasets along multiple dimensions is supported natively through xarray.combine_by_coords().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment