Skip to content

Instantly share code, notes, and snippets.

@sharkinsspatial
Last active February 10, 2023 01:06
Show Gist options
  • Save sharkinsspatial/f1c3a8f871b58416fa30c377178b5f9c to your computer and use it in GitHub Desktop.
Save sharkinsspatial/f1c3a8f871b58416fa30c377178b5f9c to your computer and use it in GitHub Desktop.

Holy grail

Before diving too deeply into the various friction points when working with archives of earth observation data in xarray, let's look at a more optimal case from the earth systems world. In the notebook here we demonstrate how using zarr's consolidated metadata option to access the dimensional and chunk reference information, a massive dataset's dimensions and variables can be loaded extremely quickly. With this consolidated metadata available to reference chunks on disk, we can leverage xarray's dask integration to use normal xarray operations to lazily load chunks in parallel and perform our calculations using dask's blocked algorithm implementations. Gravy.

Challenges

But the earth observation story is more complicated... Not everything lives in standardized file containers and more importantly our grid coordinate systems are "all over the map" :] Here are some of the current challenges.

  1. Consolidated metdata. Arrays of STAC items using projection extension actually provide a rough facsimile of zarr's consolidated metadata for referencing bytes in cloud storage. When referencincg COGs this is an even closer approximation as we can use windowed reads to access smaller chunks even for large file containers. Consistent STAC metadata is key for providing lazy chunked access to large EO data archives.

  2. CRS Representation. xarray's current representational model does not provide an explicit approach for storing CRS information for a dimension coordinates for use with an index. One of xarray's most valuable features is alignment on index coordinates but its utility is limited with disparate CRSs due to the requirement for pre-processing into common coordinate systems. Several projects have used bolt on approaches to manage CRS information but they either have limitations, such as rioxarray's reliance on rasterio and additional convenience methods which require geospatial knowledge. Or geoxarray which paused development due to technical difficulties but there are some exciting new developments on this front, more on this later.

  3. GDAL. There is a nightmare mismash of formats for EO data that have proliferated over the years. Luckily we have GDAL to provide us a performant, consistent access model for these formats. But despite it's robustness and flexibility GDAL/Rasterio have 2 issues which make them less than ideal for use in large parallelized read scenarios with Dask. The first is the thread safety restriction that GDAL Dataset handles cannot be shared between threads. This issue is especially relevant with dask as in many cases it needs to opaquely support underlying ThreadPoolExecutor models and a variety of multi-processing models (which in turn themselves might use threads internally) leading to a whole host of hard to reproduce and interesting failure scenarios. I'll let @vincentsarago weigh in with more info here. Gabe Joseph from Coiled has been providing some lower level thread management in the excellent stackstac but there may be other options to consider here as there is overhead with multiple dataset handle management. Probably more important and closer to @vincentsarago's heart is the lack of true asyncio mechanisms with GDAL/Rasterio. Though this is a bit of a moot point without a non-GDAL I/O library given the blocking issues at the GDAL level see here for more in depth discussion.

Where Do We Go From Here

So this is a long list of problems with not much talk about solutions. Where do we want to go? I initially wrote some of these notes a few weeks ago, but I got scooped by some smarter folks in this discussion. My vision is that a scientist can

  1. Use a list of STAC items with the projection extension with an xarray convenience method to create an xarray dataset with concatenated spatial dimensions.
  2. Seamlessly use a pure asyncio I/O library like aicogeo to support dynamic reads in parallel dask workers.
  3. Use automated CRS aware coordinate index alignment via native xarray resampling that users do not have to be explictly aware of.
  4. Use CRS aware indexes for selection and use with core xarray operations.
  5. Potentially have the ability to use a classic non-asyncio driver such as rasterio to save analysis data in EO formats as rioxarray does but with blocked parallel algorithms that could be constructed in a dask graph.

I think we are at a very exciting spot right now and there are new developments that can facilitate this. I would like to see us focus our efforts and thinking around the slightly longer game of helping these developments mature so in the end we can have the community using a consistent mental model and set of tools. I also feel that given the nature of the issues it might be better to focus on smaller modular tooling than more monolithic solutions like rioxarray for the future. What is happening and needs to happen?

  1. Most vital is the current index refactoring project occurring in xarray that will facilitate the option of CRS aware indexes. Read this and this for further context. I don't have the skills to support the xarray efforts but I do think we can assist @djhoese as he works to implement CRS aware indexes and grid resampling. I would love us to coordinate more with him on this.

  2. Work to battle harden aicogeo and make it usable across a variety of authentication and storage schemes like GDAL can. I think in the near term all of this work should focus on building a pure non Rasterio I/O implementation specifically for COGs while we use fallback Raterio/GDAL I/O operations for legacy formats and continue to push the community to COGify the world.

  3. Investigate helping Microsoft coordinate funding for this work. We at DevSeed are probably not the best ones to actually implement this code but I think this approach is key for making the Planetary Computer vision of seamless hyper large scale anaylsis of EO data a frictionless reality for scientists.

  4. Focus the community. There is a lot of uncertainty around the path forward in this space at the moment. I would love to see someone smarter than me take this distillation and turn it into something more cohesive that could be used to guide development planning and work going forward.

@djhoese
Copy link

djhoese commented Jun 2, 2021

Thank you so much for putting this together. This was a good read and a good summary of some things going on. I have a lot of thoughts (maybe too many) on all of this as you've pointed out, but I'll try to summarize some things related to what you've said and some other things I think that are missing. Normally in other github issues I've filtered what I've said as being off topic, but not sure I should do that here 😜 . I'm not sure how to structure these thoughts, but I'll try to avoid complete stream of consciousness.

Challenges

File formats

An important project I work on related to reading EO files that don't always fit into the nice neat bounds of a COG or even a GDAL-compatible format is Satpy. We try our best to provide an easy way to load data from these formats as xarray DataArrays and then provide interfaces for common tasks like compositing into RGBs, various types of corrections/modifications/enhancements, resampling, and saving to formats like geotiff and netcdf.

Geolocation and CRS

Side note: rioxarray has actually switched to depending on pyproj CRS objects in the in-xarray object representation.

Something we deal with in Satpy and pyresample a lot is swath data or data that is not strictly on a uniformly spaced grid. A GDAL-compatible case is GCPs that define a limited set of points that describe the overall geolocation for a raster. Other cases are those like the VIIRS and MODIS instruments that are not uniformly spaced and require full 2D longitude and latitude arrays to be accurate. Data like this can't be efficiently represented in a COG unless it is resampled to a grid first.

Also, the grids we usually deal with in our space (Satpy and Pyresample are maintained by the Pytroll community) are not defined by any authority like EPSG. Not only is the CRS for these data "custom", but the extents and per-pixel spatial resolution may differ from case to case.

Where Do We Go From Here

geoxarray

So re-reading some of the discussions you've linked to re-enforces that I think geoxarray is the library that is the connecting piece. If not in implementation, then at least in spirit/purpose. I just recently merged a PR in geoxarray to do very very (very very) basic dimension handling/renaming to try to automatically (or manually) know which dimensions are spatial (x, y, vertical) and time. I've been working on documention for a bit now, but next I plan on re-implementing the CRS storage solution from rioxarray. One thing that popped into my head when reading your post above is that I still don't necessarily like how this CRS/grid_mapping looks to a new user, but right now it is what it has to be.

On top of the ideal case of gridded data with a CRS + 1D x coordinate + 1D y coordinate, I want geoxarray to handle the other cases I listed above (2D lon + 2D lat, GCPs, etc). I want geoxarray to provide this information easily.

Resampling, Pyresample 2.0, and Geometry objects

We are actively designing, discussing, and implementing a Pyresample 2.0. You can see the concepts as we create them in this milestone: https://github.com/pytroll/pyresample/milestone/3. One of the big ones we've been working on is "Resampler" objects. Another one is "Geometry" objects (see below). I'd like geoxarray to provide a .geo.resample interface to pyresample's resamplers as well as those algorithms that are in GDAL and other libraries.

However, I'm not sure exactly how users will want to use this. For us in pyresample-land, we typically have a "geometry" object that represents a CRS + extents (read: bounding box) + size. This specific case is currently called an "AreaDefinition". When we want to resample data to be on the same coordinate system we resample to the same AreaDefinition. So something like:

new_data = resample(orig_data, area_def, **kwargs)

But maybe other communities will want to do:

new_data = resample(orig_data, some_other_xarray_object, **kwargs)

And still others, especially based on your mention of it above, would want to use .sel and index into the data at specific coordinates. This is the hardest use case for me to wrap my head around as it seems like a special case of my resampling examples above and is generally limited by the "index" and .sel interfaces. The .sel interfaces hides what you really want to do or what you might want to do which in its most basic case is nearest neighbor "resampling". Here resampling would be indexing of pixels of interest + aggregation/merging of those pixels (nearest neighbor, bilinear, etc).

I/O in Geoxarray

I generally haven't considered doing any I/O in geoxarray beyond wrapping xarray's open_dataset to parse out CRS/dimension information and maybe a to_netcdf that handles the CF conversion for CRS/coordinate information. Other than that, I guess it could wrap other libraries if needed to make things easier, but if those other libraries use geoxarray then maybe that doesn't make sense. Tough call.

Anything else for geoxarray?

What else should geoxarray do?

@geospatial-jeff
Copy link

@sharkinsspatial Thanks for writing this up! I have plenty of other ideas bouncing around about many of the technologies you've mentioned but I'm going to focus mostly from the perspective of optimized COG readers (and how STAC/dask can help support this). Been playing around with COG readers for a while now starting with rio-tiler (actually how I first ran into @vincentsarago) and ultimately led to writing "truly asynchronous" readers like aiocogeo.

Why Asyncio

I ultimately went to asyncio because its 3-4x more performant than GDAL with respect to overall throughput, there are several reasons why and some other benefits to this:

  • async/await is a better concurrency model for IO bound work (reading COGs is just a bunch of HTTP range requests). Async doesn't make the code faster (one async request will take about the same time as one sync one), but it lets you push to higher overall throughputs with lower hardware utilization than the sync model which means its not only faster at scale but also WAY cheaper. Async/await has very good economies of scale, as you scale up the serving cost decreases.
  • Writing an almost python-native reader was really important to me (not just because it supports asyncio). The geospatial community in general has a lot of great python tooling. Dask has a bunch of libraries and STAC community is using python for most of their tooling as well. GDAL supports a ton of different levers for configuring how it behaves but when you are calling the code through rasterio for a use case where you inherently care about optimizing the IO bound work its tough to configure GDAL to perform optimally. There are several things in particular that are hard to configure with GDAL, but not so difficult to expose through a python-native reader:
    • Cache configuration. GDAL's caching support is actually quite good but it's entirely local which doesn't translate well into horizontally scaled services which most people are using nowadays (and what distributed dask uses). Distributed header caching has HUGE performance increases when reading COGs as it provides a 100% reduction in number of HTTP requests required to read a block in the optimized use case (2 -> 1 request).
    • Connection management. Connection pooling and concurrency limits are available through several methods like asyncio.Semaphore or aiohttp.TCPConnector. Snipping slow or stale connections from the connection pool can help a lot too.
    • Asyncio provides a ton of different ways to manage how HTTP requests are performed. asyncio.gather will process all coroutines at the same time, block until they are all done, returning the results in the same order as the inputs. asyncio.as_completed returns an iterator that yields coroutines as they finish which helps reduce the impact of slow reads which are bound to happen no matter how fast the reader is; you can use this to build just-in-time data pipelines which are very efficient to run.

How STAC/COG/xarray can help

There are a couple ways they can help make I/O more optimized for a library like aiocogeo.

  1. Those who are maintaining STAC catalogs of COGs should ALWAYS use the file extension which in particular contains the file:header_size field allowing the reader to always read/cache the COG header in a single HTTP request.

  2. Aligning the chunk size and spatial extent of the xarray to the block size and spatial extent of the image (or group of images) is the best way to optimize I/O because each chunk in the array translates to a single HTTP request from a single coroutine. I have to read up more on the current efforts to support CRS in xarray, but I'm assuming improved support would help with this as we can use xarray to do coordinate transformations while the reader just does I/O bound work. Building tooling around converting from STAC to xarrays optimized for this alignment will be critical for scaling. Note this does require the underlying data to be structured in a way that makes this possible (more on this below).

  3. Data storage and partitioning is really important. A directory of strip-aligned GTiffs sitting on an FTP server isn't going to be fast to access regardless of how fast the reader is. Aligning datasets to well known grids (ex. MRS) and implementing proper data partitioning to maximize storage-specific rate limits (ex. aws earth landsat-8) is a great way to increase the scalability of the data. Aligning the internal blocks of each COG to a well known tiling scheme (think OGC TMS) is very scalable and should work nicely with chunked datasets (because the grid you are "chunking" against is already known). Aligning internal blocks to a well known grid lets you do cool things like dynamically serve data without ever reading a geokey.

Aiocogeo Next Steps

I learned a lot writing aiocogeo and I don't think it provides what we need in its current state, it has a lot of issues which I don't care to elaborate on in this thread because I think it's a little out of scope. I am working on a refactor which should be public soonish, so instead i'll focus on what I'm trying to do differently:

  1. Define a functional and lower-level interface for reading COGs in an asynchronous fashion. All of the functions are pure (no side-effects and cacheable) which makes the library more configurable and easier to support all of the COG variations one might see in the wild. The object oriented interface presented by the current aiocogeo (inspired by rio-tiler which is quite abstracted by essence, its calling GDAL!) is nice but it is too strongly coupled with the low-level HTTP requests which forces the library to compromise between providing a super optimized COG reader and supporting many different flavors of COG. This low-level functional interface makes it easier to support writing COGs as well!

  2. Don't abstract too early. I made the mistake of (1) trying to support the same configuration options provided by GDAL in the same way and (2) adding too many custom filesystems (that I wrote) too early. The aiocogeo refactor currently only works with S3 to keep the problem constrained, and when it does support multiple filesystems I'd like to use FSSPEC which does support several async http filesystems. It will also only include certain functionality if it adds value to the library, not for the sake of reaching feature parity with GDAL (an impossible task).

  3. In the current aiocogeo implementation, CPU and IO bound are inherently coupled. The same interfaces that define IO bound code (ex. send a GET request) are the same things that also do CPU bound things like decompression. These interfaces should be decoupled so we spend less time context switching.

  4. Aiocogeo is currently very opinionated in how it performs HTTP requests. The lower-level functional interface should return coroutines instead of image bytes, giving the caller more control over how those requests are performed (and more room to optimize!). We can wrap up the lower level code in an abstracted interface that allows less technical users to use the library, but the lower-level code will still be available for people to implement as they need.

  5. Metatiling (merging consecutive range requests) is a really powerful way of increasing overall throughput (particularly for cloud object stores like S3) by reducing the number of HTTP requests it requires to read the same amount of imagery. Aiocogeo's current implementation is very abstracted, the refactored version will aim to have a more transparent interface for metatiling because it's really really important to get right for maximizing scalability.

I think in the near term all of this work should focus on building a pure non Rasterio I/O implementation specifically for COGs while we use fallback Raterio/GDAL I/O operations for legacy formats and continue to push the community to COGify the world.

I hope the refactored version of aiocogeo can help drive towards this!

@djhoese
Copy link

djhoese commented Jun 2, 2021

Just in case this group isn't aware, I started a PR for rasterio to read open file objects (like those from fsspec) into GDAL directly: rasterio/rasterio#2141. Obviously aiocogeo sounds way more performant, but this might be a step in the middle of that "COGify the world" process.

@sharkinsspatial
Copy link
Author

Including a reference to the xarray flexible index work for tracking purposes https://github.com/pydata/xarray/blob/main/design_notes/flexible_indexes_notes.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment