GMoncrieff/how_i_work.md

## how_i_work.md

      
    Raw
  

              how_i_work.md
            
          
    How I do machine learning with geospatial data

I have a couple of AI/ML projects related to mapping things, often conservation related, with remote sensing data. Some details and packages will vary, but the process below describes how I generally approach these types of problems. Some of these tools I have only touched briefly, but I like them, and this is more an outline of how I would like to approach a new project than a retrospective look at my previous work.

We use AWS, so it makes sense to use datasets and services that are already hosted on AWS. The data discovery and loading part of this process would look somewhat different if we were using Azure and Planetary Computer, and very different if we were using GCP and Earth Engine.

Compute

All of my analysis will be done using python on an AWS VM in the same region as my data on S3, Probably using VSCode on Sagemaker or JupyterLab. If my needs become large, I will just increase the machine size. If my needs become very large, I will use coiled to manage a cluster on AWS.
Data discovery

I am assuming we are starting with a dataset of labels we want to predict from remote sensing imagery. These labels will be in the form of spatial vector data. My preference is to store them in geoparquet.
First we need to have a sense of what data is available. I will typically do some exploration with a tool like Sentinel-hub, NASA Earthdata, or Planet explorer to see what is available for my AOI and time period.
Once I know what I want, I search for the public data using a stac API through pystac. Earth on AWS lists many of the public data sources I would consider. The result of this search would be a collection of stac items, which I would load in an xarray using odc-stac or stacstack. The NASA python package EarthAccess is how I would typically access and request NASA data, and results from this will also be returned as an xarray. For Planet data I would use their Python SDK, which can be used to order images and request delivery to an S3 bucket.
Data loading and exploration

The starting point now is data in an S3 bucket (either my own or hosted by the data provider), typically stored as cloud-optimised geotiffs but can also be netcdf or zarr. I have loaded this cloud data into an xarray, with geospatial capabilties provided by rioxarray and distributed computation backed by dask. Now I will do a bunch of data exploration, manipulation and transformation. Data viz and plotting will be done with holoviz tools. Any spatial vector data will be accessed and manipulated using geopandas, with data preferably stored as geoparquet. If I am working with a small amount of data (< 10 GB) I will keep it on s3 and not move it onto the disk of the VM I am using. If I am working with a medium amount of data (< 1TB) I will copy data to my local disk to reduce I/O and latency issues. If I am working with a very large amount of data >> 1TB) I will reshape and reformat the data and store it in zarr format on s3.
Machine learning model training

This process will be somewhat different depending on whether I am working with RGB/multispectral images from a single point in time, or with timeseries imagery and/or hyperspectral data. Timeseries and hyperspectral data have 4+ dimensions, so some libraries for deep learning on remote sensing data are not designed for this. Prior to model training, I will usually move the data to on-disk storage to reduce I/O and latency issues, unless it is very large in which case it will be stored as a zarr on s3. If I am fitting a non-deep learning model (e.g. Random Forest, XGboost), I will use the either basic scikit-learn functionality for this, or if the data is very large, I will use dask-ml, which provides scalable machine learning in Python using Dask. Here is an example of how I have done this before.
For deep learning, if I am working with RGB/multispectral images from a single point in time, I will use torchgeo to construct pytorch dataloaders and  models. I will simply point the dataloaders to the geotiffs and geoparquet spatial vector data stored on disk. To simplify matters pytorch lightning will be used to train models. If I am working with timeseries and hyperspectral data, these will be stored as zarr, and I will build a custom pytorch dataloader using xbatcher to extract batches of data. Model training will again be done using pytorch lightning. All model management and logging will be done using weights & biases. Some examples of repos where I use this approach can be found  here and here.
Inspiration

The patterns and tools I use are based primarily on recommendations and the excellent work by Development Seed and Earthmover. Many of the libraries are part of the Pangeo ecosystem for working with earth science data in python. The Development Seed blog has a nice series of posts outlining their ML stack.