Skip to content

Instantly share code, notes, and snippets.

@vincentsarago
Created February 10, 2020 03:23
Show Gist options
  • Save vincentsarago/4e99843a154287f80feca8dbacb414a5 to your computer and use it in GitHub Desktop.
Save vincentsarago/4e99843a154287f80feca8dbacb414a5 to your computer and use it in GitHub Desktop.

COG Talk - Part 4 - Enabling Spatio-temporal data processing at scale

This blog is the fourth in a series called COG Talk, which looks at ways to use Cloud Optimized GeoTIFFs to efficiently render and analyze planetary data at massive scale.

After a refresh of what COGs are in Part 1, the introduction of mosaics in Part 2, and a fun experiment in Part 3, today we are going to see how COGs can be useful for large scale spatio-temporal dataset.

Introduction slide from a talk given at GéoMTL conference (slides).

Cloud Optimized GeoTIFFs (COGs)

First, the basics. As of today, the Cloud Optimized GeoTIFF specification can be summarized as a tiny list of requirements:

  • the data has to be tiled (internally split into chunks of regular size)
  • the file has a header with the location of each tile
  • the file can have internal overview

Basically, you take a well known open format (created in the 90's), enforce good usage and internal architecture, and then have a binary file optimized for remote access. Because the header has a map to the internal tiles and the geographic information, libraries like GDAL can easily understand which tiles to fetch (using GET Range-Requests) for a given area of interest (AOI), minimizing data transfer and HTTP requests.

Like this cute raccoon, GDAL is able to take just what it needs from a COG and runs really fast.

Once the data is in this format, one of the most common and useful next steps is to visualize it as a web map. Because we can read partial parts of the data in an optimized way and, if present, obtain a preview of the high resolution data from internal overviews, we can dynamically generate web map tiles from COGs at request time.

Web maps are often based on static raster tiles stored as jpeg or png. A full set of tiles is created for each zoom level and stored in a tree-based file structure that allows users to zoom and pan a map. This approach requires you to pre-generate a tile tree consisting of millions of files. With a COG you can use internal overviews to create multiple zooms and internal tiles to stand in for map tiles, and thus, only have one file to manage for a large area. We call this process "dynamic tiling" because we access the raw data (e.g. surface reflectance or elevation) and then apply algorithms on it before creating the png/jpeg tile to display in the browser.

How to create valid COGs

Common Geographic Information System (GIS) software like QGIS supports exporting raster data to COG natively but if you want to do it programmatically you can use GDAL commands

# First add internal overviews
gdaladdo my-file.tif

# Then translate the geotiff to a COG (`TILED=YES`) and keep overviews
gdal_translate -of GTiff -co TILED=YES -co COPY_SRC_OVERVIEWS=YES -co COMPRESS=DEFLATE my-file.tif my-cog.tif

or use rio-cogeo (see COG Talk - Part 1)

rio cogeo my-file.tif my-cog.tif --cog-profile deflate

COGs everywhere

More organizations are storing their data as a COG (Landsat level 2, USGS DEM, MODIS on AWS), and while it's not the most storage-efficient format (in comparison to JPEG2000), a COG is a more user-friendly format that enables fast and cheap access to the data (see this blog for a comparison).

Having access to more data creates another kind of problem (a good one): due to the increase in data availability, we need to implement easier ways to access/process/share them at scale. Development Seed regularly advocates for open datasets, but what we love even more is to enable people to access and use the data. Take for instance, the example of opening up Landsat data. It was a big win for the open data community, but it posed several challenges using it with Earth Explorer. Libra was born as a result of needing a better tool and process for using this data, and still to this day has 3000 visitors per month 😱 !

One scene or a million!

The combination of distributed cloud services and an increase in datasets being stored as COGs, means users are now pivoting from single scene workflows to large scale processing (e.g. state, country wide). To support this, we created the mosaic-json specification, an open standard for representing metadata about sets of Cloud-Optimized GeoTIFF (see COG Talk - Part-2). With simplicity and performance in mind, the specification uses a simple quadkey based spatial index and allows overlapping scenes to create spatio-temporal mosaics.

In a dynamic tiling workflow, the mosaic-json is a simple JSON file that acts as a proxy between Web Map tile requests (using Z-X-Y Slippy map tilenames) and the list of files intersecting with this tile.

Spatio-temporal mosaic-json.

Real World example

ABoVE: Landsat-derived Annual Dominant Land Cover Across ABoVE Core Domain, 1984-2014

Ref: https://daac.ornl.gov/cgi-bin/dsviewer.pl?ds_id=1691 Format: GeoTIFF + internal overview

ABoVE dataset footprint

The ABoVE dataset is comprised of 175 different GeoTIFFs at 30m resolution derived from Landsat (5 and 7) surface reflectance values. While it covers a pretty large area, the other interesting part of this dataset is the temporal aspect, because each file has 31 different bands, one for each year between 1984 to 2014.

The dataset is distributed as a GeoTIFF with internal overview. Sadly, they are not aligned with the Cloud Optimized GeoTIFF format because they are not internally tiled and the overviews are located at the end of the files.

$ rio cogeo validate s3://ds-satellite/raw/ABoVE_LandCover_1984_2014/ABoVE_LandCover_Simplified_Bh17v10.tif
The following errors were found:
- The file is greater than 512xH or 512xW, but is not tiled
- The offset of the main IFD should be 8 for ClassicTIFF or 16 for BigTIFF. It is 32340442 instead
- The offset of the first block of overview of index 1 should be after the one of the overview of index 2
- The offset of the first block of overview of index 0 should be after the one of the overview of index 1
- The offset of the first block of the main resolution image should be after the one of the overview of index 2
s3://ds-satellite/raw/ABoVE_LandCover_1984_2014/ABoVE_LandCover_Simplified_Bh17v10.tif is NOT a valid cloud optimized GeoTIFF

GeoTIFF → COG → mosaicJSON

To create user interfaces (UI) and tools that are responsive as possible, we need the files stored as proper Cloud Optimized GeoTIFFs.

Here are the steps we took to convert the files**:**

  1. Download the whole dataset and upload the GeoTIFFs to S3.

    $ wget {url of the zip} $ unzip Annual_Landcover_ABoVE_1691.zip $ aws s3 sync . s3://my-bucket/raw/ABoVe/

  2. Deploy cogeo-watchbot-light (a serverless stack based on AWS Lambda to run rio-cogeo at scale).

    $ git clone https://github.com/developmentseed/cogeo-watchbot-light $ cd cogeo-watchbot-light $ sls deploy --stage production --bucket my-bucket --region us-east-1

  3. Create COGs.

    create list of files to translate

    $ aws s3 ls s3://my-bucket/raw/ABoVE/ | grep "Simplified" | awk '{print "s3://my-bucket/cogs/ABoVE/"$NF}' > list_raw_files.txt

    Send jobs to the stack

    $ pip install rio-cogeo rio-tiler $ cd scripts/ $ cat list_raw_files.txt | python -m create_jobs -
    -p webp
    --co blockxsize=256
    --co blockysize=256
    --op overview_level=6
    --op overview_resampling=bilinear
    --prefix cogs/ABoVE
    --topic arn:aws:sns:us-east-1:{AWS_ACCOUNT_ID}:cogeo-watchbot-light-production-WatchbotTopic

  4. Create a MosaicJSON (using cogeo-mosaic).

    $ aws s3 ls s3://my-bucket/cogs/ABoVE/ | awk '{print "s3://my-bucket/cogs/ABoVE/"$NF}' | cogeo-mosaic create - -o mosaic.json

Explore

When the COGs and the mosaicJSON are formatted correctly we can use the cogeo-mosaic-tiler stack to create web map tiles dynamically and visualize the data in a web map.

See it live: https://cogeo.xyz/projects/ABoVE/index.html

Modified version of cogeo-mosaic-viewer UI to visualize the ABoVE dataset.

The temporal side With the introduction of the mosaicJSON specification (see COG Talk - Part 2), we released an open source python module rio-tiler-mosaic to handle the creation of tiles using multiple files. One important feature of this plugin is its ability to do pixel selection dynamically, meaning the user can choose how each pixel value is created (e.g. take the pixel value from the first image or from the last one). It also enables custom pixel selection methods:

https://gist.github.com/vincentsarago/b8c649ac876241cc05a1da3c098a9670

The pixel selection method above was specifically design for the ABoVE dataset. On each tile request the dynamic tiler using this method will return the standard deviation value for the stack of bands, enabling us to find where the land cover classification values have changed over the 31 year span.

See it live: https://cogeo.xyz/projects/ABoVE/stddev.html

Standard deviation values for the full stack of 31 values (years).

❤️ STAC + COG

DevelopmentSeed is advancing the adoption of the new STAC metadata specification (checkout our latest sat-api-pg project). Standardized and well-formatted metadata is an important step toward the democratization of remote sensed data and it also help us to create nice visualization tools.

Introducing landsatlive.live

By combining both STAC and mosaicJSONs we built a small demo called landsatlive.live (shoutout to the great landsat.live by our friends at Mapbox**).** This demo lets you visualize mosaics created dynamically using sat-api (our STAC search api) for the Landsat 8 dataset hosted on AWS.

A mosaic of Landsat 8 scenes. Web map tiles created dynamically using sat-api + mosaic-json + rio-tiler-mosaic.

Conclusion

Cloud Optimized GeoTIFF format is an efficient way of storing massive dataset, …

  • We recently refactored the cogeo-mosaic project to make it a proper python module (special thanks to Jeff Albrecht ****for his great help).

    $ pip install cogeo-mosaic

    $ cogeo-mosaic --help Usage: cogeo-mosaic [OPTIONS] COMMAND [ARGS]... cogeo_mosaic cli. Options: --version Show the version and exit. --help Show this message and exit. Commands: create Create mosaic definition from list of files footprint Create geojson from list of files overview [EXPERIMENT] Create COG overviews for a mosaic

    $ cogeo-mosaic create my-list-of-cog.txt -o mosaic.json

We released cogeo-mosaic-viewer, a cogeo-mosaic plugin to visualize mosaicJSON locally.

$ pip install cython==0.28 # python-vtzero will only compile with Cython < 0.29
$ pip install git+https://github.com/developmentseed/cogeo-mosaic-viewer.git

$ cogeo-mosaic viewer --help                 
Usage: cogeo-mosaic viewer [OPTIONS] MOSAIC_PATH
  cogeo-mosaic Viewer.
Options:
  --style [satellite|basic]  Mapbox basemap
  --port INTEGER             Webserver port (default: 8080)
  --host TEXT                Webserver host url (default: 127.0.0.1)
  --mapbox-token TOKEN       Mapbox token
  --footprint                Visualize Mosaic Footprint
  --help                     Show this message and exit.

$ cogeo-mosaic viewer mosaic.json

The Serverless stack previously integrated in cogeo-mosaic is now hosted on a separate repo: https://github.com/developmentseed/cogeo-mosaic-tiler

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment