Before diving too deeply into the various friction points when working with archives of earth observation data in xarray, let's look at a more optimal case from the earth systems world. In the notebook here we demonstrate how using zarr's consolidated metadata option to access the dimensional and chunk reference information, a massive dataset's dimensions and variables can be loaded extremely quickly. With this consolidated metadata available to reference chunks on disk, we can leverage xarray's dask integration to use normal xarray operations to lazily load chunks in parallel and perform our calculations using dask's blocked algorithm implementations. Gravy.
But the earth observation story is more complicated... Not everything lives in standardized file containers and more importantly our grid coordinate systems are "all over the map" :] Here are some of the current challenges.
-
Consolidated metdata. Arrays of STAC items using projection extension actually provide a rough facsimile of zarr's consolidated metadata for referencing bytes in cloud storage. When referencincg COGs this is an even closer approximation as we can use windowed reads to access smaller chunks even for large file containers. Consistent STAC metadata is key for providing lazy chunked access to large EO data archives.
-
CRS Representation. xarray's current representational model does not provide an explicit approach for storing CRS information for a dimension coordinates for use with an index. One of xarray's most valuable features is alignment on index coordinates but its utility is limited with disparate CRSs due to the requirement for pre-processing into common coordinate systems. Several projects have used bolt on approaches to manage CRS information but they either have limitations, such as rioxarray's reliance on rasterio and additional convenience methods which require geospatial knowledge. Or geoxarray which paused development due to technical difficulties but there are some exciting new developments on this front, more on this later.
-
GDAL. There is a nightmare mismash of formats for EO data that have proliferated over the years. Luckily we have GDAL to provide us a performant, consistent access model for these formats. But despite it's robustness and flexibility GDAL/Rasterio have 2 issues which make them less than ideal for use in large parallelized read scenarios with Dask. The first is the thread safety restriction that GDAL Dataset handles cannot be shared between threads. This issue is especially relevant with dask as in many cases it needs to opaquely support underlying
ThreadPoolExecutor
models and a variety of multi-processing models (which in turn themselves might use threads internally) leading to a whole host of hard to reproduce and interesting failure scenarios. I'll let @vincentsarago weigh in with more info here. Gabe Joseph from Coiled has been providing some lower level thread management in the excellent stackstac but there may be other options to consider here as there is overhead with multiple dataset handle management. Probably more important and closer to @vincentsarago's heart is the lack of true asyncio mechanisms with GDAL/Rasterio. Though this is a bit of a moot point without a non-GDAL I/O library given the blocking issues at the GDAL level see here for more in depth discussion.
So this is a long list of problems with not much talk about solutions. Where do we want to go? I initially wrote some of these notes a few weeks ago, but I got scooped by some smarter folks in this discussion. My vision is that a scientist can
- Use a list of STAC items with the projection extension with an xarray convenience method to create an xarray dataset with concatenated spatial dimensions.
- Seamlessly use a pure asyncio I/O library like aicogeo to support dynamic reads in parallel dask workers.
- Use automated CRS aware coordinate index alignment via native xarray resampling that users do not have to be explictly aware of.
- Use CRS aware indexes for selection and use with core xarray operations.
- Potentially have the ability to use a classic non-asyncio driver such as rasterio to save analysis data in EO formats as
rioxarray
does but with blocked parallel algorithms that could be constructed in a dask graph.
I think we are at a very exciting spot right now and there are new developments that can facilitate this. I would like to see us focus our efforts and thinking around the slightly longer game of helping these developments mature so in the end we can have the community using a consistent mental model and set of tools. I also feel that given the nature of the issues it might be better to focus on smaller modular tooling than more monolithic solutions like rioxarray
for the future. What is happening and needs to happen?
-
Most vital is the current index refactoring project occurring in xarray that will facilitate the option of CRS aware indexes. Read this and this for further context. I don't have the skills to support the
xarray
efforts but I do think we can assist @djhoese as he works to implement CRS aware indexes and grid resampling. I would love us to coordinate more with him on this. -
Work to battle harden
aicogeo
and make it usable across a variety of authentication and storage schemes like GDAL can. I think in the near term all of this work should focus on building a pure non Rasterio I/O implementation specifically for COGs while we use fallback Raterio/GDAL I/O operations for legacy formats and continue to push the community to COGify the world. -
Investigate helping Microsoft coordinate funding for this work. We at DevSeed are probably not the best ones to actually implement this code but I think this approach is key for making the Planetary Computer vision of seamless hyper large scale anaylsis of EO data a frictionless reality for scientists.
-
Focus the community. There is a lot of uncertainty around the path forward in this space at the moment. I would love to see someone smarter than me take this distillation and turn it into something more cohesive that could be used to guide development planning and work going forward.
Thank you so much for putting this together. This was a good read and a good summary of some things going on. I have a lot of thoughts (maybe too many) on all of this as you've pointed out, but I'll try to summarize some things related to what you've said and some other things I think that are missing. Normally in other github issues I've filtered what I've said as being off topic, but not sure I should do that here 😜 . I'm not sure how to structure these thoughts, but I'll try to avoid complete stream of consciousness.
Challenges
File formats
An important project I work on related to reading EO files that don't always fit into the nice neat bounds of a COG or even a GDAL-compatible format is Satpy. We try our best to provide an easy way to load data from these formats as xarray DataArrays and then provide interfaces for common tasks like compositing into RGBs, various types of corrections/modifications/enhancements, resampling, and saving to formats like geotiff and netcdf.
Geolocation and CRS
Side note: rioxarray has actually switched to depending on pyproj CRS objects in the in-xarray object representation.
Something we deal with in Satpy and pyresample a lot is swath data or data that is not strictly on a uniformly spaced grid. A GDAL-compatible case is GCPs that define a limited set of points that describe the overall geolocation for a raster. Other cases are those like the VIIRS and MODIS instruments that are not uniformly spaced and require full 2D longitude and latitude arrays to be accurate. Data like this can't be efficiently represented in a COG unless it is resampled to a grid first.
Also, the grids we usually deal with in our space (Satpy and Pyresample are maintained by the Pytroll community) are not defined by any authority like EPSG. Not only is the CRS for these data "custom", but the extents and per-pixel spatial resolution may differ from case to case.
Where Do We Go From Here
geoxarray
So re-reading some of the discussions you've linked to re-enforces that I think geoxarray is the library that is the connecting piece. If not in implementation, then at least in spirit/purpose. I just recently merged a PR in geoxarray to do very very (very very) basic dimension handling/renaming to try to automatically (or manually) know which dimensions are spatial (x, y, vertical) and time. I've been working on documention for a bit now, but next I plan on re-implementing the CRS storage solution from rioxarray. One thing that popped into my head when reading your post above is that I still don't necessarily like how this CRS/grid_mapping looks to a new user, but right now it is what it has to be.
On top of the ideal case of gridded data with a CRS + 1D
x
coordinate + 1Dy
coordinate, I want geoxarray to handle the other cases I listed above (2D lon + 2D lat, GCPs, etc). I want geoxarray to provide this information easily.Resampling, Pyresample 2.0, and Geometry objects
We are actively designing, discussing, and implementing a Pyresample 2.0. You can see the concepts as we create them in this milestone: https://github.com/pytroll/pyresample/milestone/3. One of the big ones we've been working on is "Resampler" objects. Another one is "Geometry" objects (see below). I'd like geoxarray to provide a
.geo.resample
interface to pyresample's resamplers as well as those algorithms that are in GDAL and other libraries.However, I'm not sure exactly how users will want to use this. For us in pyresample-land, we typically have a "geometry" object that represents a CRS + extents (read: bounding box) + size. This specific case is currently called an "AreaDefinition". When we want to resample data to be on the same coordinate system we resample to the same AreaDefinition. So something like:
But maybe other communities will want to do:
And still others, especially based on your mention of it above, would want to use
.sel
and index into the data at specific coordinates. This is the hardest use case for me to wrap my head around as it seems like a special case of my resampling examples above and is generally limited by the "index" and.sel
interfaces. The.sel
interfaces hides what you really want to do or what you might want to do which in its most basic case is nearest neighbor "resampling". Here resampling would be indexing of pixels of interest + aggregation/merging of those pixels (nearest neighbor, bilinear, etc).I/O in Geoxarray
I generally haven't considered doing any I/O in geoxarray beyond wrapping xarray's
open_dataset
to parse out CRS/dimension information and maybe ato_netcdf
that handles the CF conversion for CRS/coordinate information. Other than that, I guess it could wrap other libraries if needed to make things easier, but if those other libraries use geoxarray then maybe that doesn't make sense. Tough call.Anything else for geoxarray?
What else should geoxarray do?