KMarkert/g4g-2022-rest-api-demo-serviceaccount.ipynb

## g4g-2022-rest-api-demo-serviceaccount.ipynb
{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "collapsed_sections": [],
      "private_outputs": true,
      "include_colab_link": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "view-in-github",
        "colab_type": "text"
      },
      "source": [
        "<a href=\"https://colab.research.google.com/gist/KMarkert/a7bc79a8367d4abd27f89ab1cc224268/g4g-2022-rest-api-demo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
      ]
    },
    {
      "cell_type": "code",
      "source": [
        "#@title Copyright 2022 Google LLC. { display-mode: \"form\" }\n",
        "# Licensed under the Apache License, Version 2.0 (the \"License\");\n",
        "# you may not use this file except in compliance with the License.\n",
        "# You may obtain a copy of the License at\n",
        "#\n",
        "# https://www.apache.org/licenses/LICENSE-2.0\n",
        "#\n",
        "# Unless required by applicable law or agreed to in writing, software\n",
        "# distributed under the License is distributed on an \"AS IS\" BASIS,\n",
        "# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\n",
        "# See the License for the specific language governing permissions and\n",
        "# limitations under the License."
      ],
      "metadata": {
        "id": "BiKvq_m_HKFo"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "# Scaling with the Earth Engine REST API\n",
        "\n",
        "This is a demonstration notebook for using the Earth Engine REST API to .  See the complete guide for more information: https://developers.google.com/earth-engine/reference/Quickstart.\n",
        "\n",
        "Before getting started, we will need to enable a few cloud APIs and services for the cloud project you are working with:\n",
        "1. [Enable Earth Engine API here](https://console.cloud.google.com/flows/enableapi?apiid=earthengine.googleapis.com) to use cloud project with EE\n",
        "2. [Enable IAM API here](https://console.cloud.google.com/flows/enableapi?apiid=iam.googleapis.com) for managing service account and permissions\n",
        "3. [Setup storage bucket here](https://console.cloud.google.com/storage/create-bucket) to store outputs and upload to EE"
      ],
      "metadata": {
        "id": "L6UIyvAcHLme"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Getting started with authentication\n",
        "\n",
        "The first step is to login to Google Cloud and choose a project to make calls to Earth Engine. We will be using `gcloud` to create a service account and secret key for interacting with Earth Engine and the REST API."
      ],
      "metadata": {
        "id": "z54c7dRWNBS_"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# INSERT YOUR CLOUD PROJECT ID HERE\n",
        "PROJECTID = 'g4g22-demo'\n",
        "\n",
        "!gcloud auth login --project {PROJECTID}"
      ],
      "metadata": {
        "id": "CTf07fvdMnkA"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Here we will create a service account for your cloud project. You can also create one [using the cloud console](https://console.cloud.google.com/iam-admin/serviceaccounts). If you already have a service account and would like to use that, then place the name but note you will get an error if it already exists."
      ],
      "metadata": {
        "id": "LduV1T11l_nZ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# INSERT THE NAME OF YOUR SERVICE ACCOUNT HERE\n",
        "SA_NAME = 'kmarkert-test'\n",
        "SERVICE_ACCOUNT = f'{SA_NAME}@{PROJECTID}.iam.gserviceaccount.com'\n",
        "\n",
        "!gcloud iam service-accounts create {SA_NAME} --description=\"Demo SA for Earth Engine REST API\""
      ],
      "metadata": {
        "id": "8VYw8OLCdVYX"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Next we need to give our service account the correct [roles and permissions](https://developers.google.com/earth-engine/cloud/roles_permissions) so that we can make calls to Earth Engine with this account."
      ],
      "metadata": {
        "id": "mfMUnHQSnt1M"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!gcloud projects add-iam-policy-binding {PROJECTID} \\\n",
        "    --member='serviceAccount:'{SERVICE_ACCOUNT} --role='roles/earthengine.admin'\n",
        "\n",
        "# !gcloud projects add-iam-policy-binding {PROJECTID} \\\n",
        "#     --member='serviceAccount:'{SERVICE_ACCOUNT} --role='roles/storage.objectCreator'"
      ],
      "metadata": {
        "id": "LCMcLCc7G2jN"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now that a service account has been created with the correct permissions, we will now create a secret key to authenticate with Google APIs."
      ],
      "metadata": {
        "id": "R5D4e_ocPJD-"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# INSERT YOUR SERVICE ACCOUNT HERE\n",
        "KEY = 'ee-demo-key.json'\n",
        "\n",
        "!gcloud iam service-accounts keys create {KEY} --iam-account {SERVICE_ACCOUNT}"
      ],
      "metadata": {
        "id": "jjZbfYQNyTuC"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The last step we need to do for setup is to [register the service account for use with Earth Engine](https://signup.earthengine.google.com/#!/service_accounts)."
      ],
      "metadata": {
        "id": "hwklcxWqm0OD"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "import ee\n",
        "from google.auth.transport.requests import AuthorizedSession\n",
        "\n",
        "# ee.Authenticate()  #  or !earthengine authenticate --auth_mode=gcloud\n",
        "session = AuthorizedSession(ee.data.get_persistent_credentials())"
      ],
      "metadata": {
        "id": "7q4yHbZ04tmi"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Accessing and Testing your Credentials\n",
        "\n",
        "You are now ready to send your first query to the Earth Engine API. Use the private key to get credentials. Use the credentials to create an authorized session to make HTTP requests."
      ],
      "metadata": {
        "id": "oh94J398leZZ"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# get an authorized session using the service account\n",
        "from google.auth.transport.requests import AuthorizedSession\n",
        "from google.oauth2 import service_account\n",
        "\n",
        "# credentials = service_account.Credentials.from_service_account_file(KEY)\n",
        "# scoped_credentials = credentials.with_scopes(\n",
        "#     [\n",
        "#         'https://www.googleapis.com/auth/cloud-platform',\n",
        "#         'https://www.googleapis.com/auth/earthengine'\n",
        "#     ]\n",
        "# )\n",
        "\n",
        "# session = AuthorizedSession(scoped_credentials)"
      ],
      "metadata": {
        "id": "SvMI330albLo"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# send the HTTP call to the REST API\n",
        "url = 'https://earthengine.googleapis.com/v1alpha/projects/earthengine-public/assets/LANDSAT'\n",
        "\n",
        "response = session.get(url)\n",
        "\n",
        "from pprint import pprint\n",
        "import json\n",
        "pprint(json.loads(response.content))"
      ],
      "metadata": {
        "id": "-AM2oYWhnOik"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "If everything is configured correctly, running this will produce output that looks like:\n",
        "\n",
        "```\n",
        "{'id': 'LANDSAT',\n",
        " 'name': 'projects/earthengine-public/assets/LANDSAT',\n",
        " 'type': 'FOLDER'}\n",
        " ```\n",
        "\n"
      ],
      "metadata": {
        "id": "tuhs3aR3nROy"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Scaling EE Processing\n",
        "\n",
        "Now that we have everything set up for working with the REST API, imagine that we are going to create a system to detect deforestation at a yearly scale. However, our process relies on some local functionality that is not in Earth Engine. So, we will use Earth Engine to perform the large scale geoprocessing and get the results loacally using the REST API for use within our custom workflow. The results will be stored in cloud storage and create an image collection on Earth Engine that you or your users can use within the platform.\n",
        "\n",
        "This example leverages some examples and concepts from the following notebook: [Logistic regression the TensorFlow way](https://developers.google.com/earth-engine/guides/tf_examples#logistic-regression-the-tensorflow-way)"
      ],
      "metadata": {
        "id": "VtHpX1Bjnk46"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "To begin, we will use some third-party libraries to handle the requests to the REST API and geospatial data locally."
      ],
      "metadata": {
        "id": "IuZLrp5vhEwx"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "!pip install restee rioxarray rio-cogeo -q"
      ],
      "metadata": {
        "id": "EwgXoU6Fsrp1"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import ee\n",
        "import restee as ree"
      ],
      "metadata": {
        "id": "e9elkfHds6oq"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# hacky work around to get a class that works with restee\n",
        "# without having to use a service account\n",
        "# custom EE Session class that accepts an already authorized session\n",
        "class EESession_slim(ree.EESession):\n",
        "    def __init__(self, project, session):\n",
        "        self._PROJECT = project\n",
        "        self._SESSION = session\n",
        "\n",
        "\n",
        "# create an EESesssion object with the correct permissions\n",
        "ee_session = EESession_slim(PROJECTID, session)"
      ],
      "metadata": {
        "id": "DdQaE_ee5U9N"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# create an EESesssion object with the service account\n",
        "# ee_session = ree.EESession(PROJECTID, KEY)\n",
        "\n",
        "# authenticate EE with the session credentials\n",
        "# ee.Initialize(ee_session.session.credentials)\n",
        "ee.Initialize(ee_session.session.credentials, project=PROJECTID)"
      ],
      "metadata": {
        "id": "_vuoA56Dn07T"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "### Server-side Earth Engine computation\n",
        "\n",
        "We will build our deforestation prediction workflow in this example. The first step is to use Earth Engine for processing Landsat data and sampling for a training dataset using the Python API."
      ],
      "metadata": {
        "id": "-jh88uyHguly"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# get geometry information over Cambodia\n",
        "countries = ee.FeatureCollection('USDOS/LSIB_SIMPLE/2017')\n",
        "kh = countries.filter(ee.Filter.eq('country_na', 'Cambodia'))"
      ],
      "metadata": {
        "id": "s8cYcTFTEGsP"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Cloud masking function.\n",
        "def ls_qa(image):\n",
        "  qaMask = image.select('QA_PIXEL').bitwiseAnd(int('111111', 2)).eq(0);\n",
        "  saturationMask = image.select('QA_RADSAT').eq(0);\n",
        "  mask = qaMask.And(saturationMask)\n",
        "  return ee.Image(\n",
        "      image.select('SR_B[2-7]') # select multispectral bands\n",
        "      .multiply(0.0000275).add(-0.2) # apply scale to reflectance\n",
        "      .updateMask(mask) # mask poor quality data\n",
        "      .copyProperties(image,[\"system:time_start\"]) # make sure has time info\n",
        "      )"
      ],
      "metadata": {
        "id": "boReA3NHFhtB"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# function to add vegetation, water, and base soil indices\n",
        "def add_indices(image):\n",
        "    ndvi = image.normalizedDifference(['SR_B5','SR_B4']).rename('ndvi')\n",
        "    mndwi = image.normalizedDifference(['SR_B3','SR_B6']).rename('mndwi')\n",
        "\n",
        "    bsi = image.expression('((B6 + B4) - (B5 + B2)) / ((B6 + B4) + (B5 + B2))',{\n",
        "        'B2': image.select('SR_B2'),\n",
        "        'B4': image.select('SR_B4'),\n",
        "        'B5': image.select('SR_B5'),\n",
        "        'B6': image.select('SR_B6'),\n",
        "    }).rename('bsi')\n",
        "\n",
        "    return image.addBands(ee.Image.cat([ndvi,mndwi,bsi]))"
      ],
      "metadata": {
        "id": "glYJcx3MFkBE"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# group the preprocessing functions so it can be applied in one go\n",
        "preprocess = lambda x: add_indices(ls_qa(x))"
      ],
      "metadata": {
        "id": "hRuJfnc-FmYg"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Use Landsat 8 surface reflectance data\n",
        "l8sr = ee.ImageCollection('LANDSAT/LC08/C02/T1_L2').filterBounds(kh)"
      ],
      "metadata": {
        "id": "_BkhDLBHF69L"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Make \"before\" and \"after\" composites.\n",
        "before_composite = l8sr.filterDate(\n",
        "    '2019-01-01', '2020-01-01').map(preprocess).median()\n",
        "after_composite  = l8sr.filterDate(\n",
        "    '2021-01-01', '2021-12-31').map(preprocess).median()"
      ],
      "metadata": {
        "id": "E7R7ZikrF9AI"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# combine the composites so we have information from before and after\n",
        "stack = before_composite.addBands(after_composite)"
      ],
      "metadata": {
        "id": "iUn-8ADIF-4E"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "#@title Optional: Export image as asset\n",
        "\n",
        "#@markdown This export will take about 40 min to complete.\n",
        "#@markdown In effort to complete the demo in a reasonable time, we will load in a pre-exported asset. However, once your export is complete you can use your own.\n",
        "\n",
        "# specify the asset output\n",
        "export_image = f'projects/{PROJECTID}/assets/deforestation_demo_image'\n",
        "\n",
        "# create a batch export task\n",
        "image_task = ee.batch.Export.image.toAsset(\n",
        "  image = stack, \n",
        "  description = 'deforestation_demo_image', \n",
        "  assetId = export_image, \n",
        "  region = kh.geometry().bounds(),\n",
        "  scale = 30,\n",
        "  maxPixels = 1e10,\n",
        ")\n",
        "\n",
        "# run the task\n",
        "image_task.start()"
      ],
      "metadata": {
        "cellView": "form",
        "id": "1cLwknQFqaol"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# to speed things up for demostration purposes, we will load in a pre-exported image\n",
        "# if you would like to use your image you just exported, simply comment out the following line\n",
        "export_image = 'projects/g4g22-demo/assets/deforestation_demo_image'"
      ],
      "metadata": {
        "id": "XCx2_j_6rLlw"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# Forest loss in 2020 is what we want to predict.\n",
        "forest_img = ee.Image(\"UMD/hansen/global_forest_change_2021_v1_9\")\n",
        "loss = forest_img.select('lossyear').eq(20).rename('loss')"
      ],
      "metadata": {
        "id": "oHt47WVxGA6g"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# sample areas for deforestation\n",
        "# use stratified sampling because deforestation\n",
        "# area is small compared to non-deforested\n",
        "# note: this loads in pre-exported image\n",
        "sample = stack.addBands(loss).stratifiedSample(\n",
        "  numPoints = 200,\n",
        "  classBand = 'loss', \n",
        "  region = kh.geometry(1e3).bounds(1e3),\n",
        "  scale = 30,\n",
        "  classValues = [0,1],\n",
        "  classPoints=[200,200],\n",
        "  tileScale = 4,\n",
        "  geometries = True\n",
        ")"
      ],
      "metadata": {
        "id": "Us8v3-ZJGCyl"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "#@title Optional: Export samples as feature collection asset\n",
        "\n",
        "#@markdown This is an optional step and will store the sample results on Earth Engine as an asset.\n",
        "\n",
        "# specify export asset path\n",
        "export_features = f'projects/{PROJECTID}/assets/deforestation_demo_samples'\n",
        "\n",
        "# create a batch export task for FC\n",
        "sample_task = ee.batch.Export.table.toAsset(\n",
        "    collection = sample,\n",
        "    assetId = export_features,\n",
        ")\n",
        "\n",
        "# run task\n",
        "sample_task.start()"
      ],
      "metadata": {
        "id": "Tl5_PvXmL_-e"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "ee.batch.Task.list()"
      ],
      "metadata": {
        "id": "xqiMpd6yxFh3"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Request data to client side using REST API\n",
        "\n",
        "Now that all of our data have been preprocessed on Earth Engine, we can request the data to our local systems using the REST API. We are using a third-party package ([`restee`](https://kmarkert.github.io/restee/)) for handling the requests. This allows us to easily define areas to request data and store the data in common Python formats like [`xarray`](https://xarray.dev/) or [`geopandas`](https://geopandas.org/en/stable/) to use with our other Python processing packages."
      ],
      "metadata": {
        "id": "NWt4Az-NraQK"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# import some more packages \n",
        "import tqdm\n",
        "import numpy as np\n",
        "from concurrent.futures import ThreadPoolExecutor"
      ],
      "metadata": {
        "id": "oYOcSPOMMFr4"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Cambodia is a faily large country and Earth Engine has a quota for the volume of data that can be requested at a time (https://developers.google.com/earth-engine/reference#quota-and-limits). To circumvent the quota and request a larger area, we will set up smaller regions and request the data for those smaller areas. "
      ],
      "metadata": {
        "id": "nZhaGZRprfDs"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# specify the projection we want to work in \n",
        "# this is UTM zone 48 N over Cambodia\n",
        "crs = \"EPSG:32648\""
      ],
      "metadata": {
        "id": "Jwl3wUiPMGJP"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# get the bounding box coordinates from Cambodia\n",
        "bounds = kh.geometry(1e3).bounds(1e3).transform(crs,1e3)\n",
        "coordinates = ree.get_value(ee_session, bounds.coordinates())\n",
        "xcoords, ycoords = list(zip(*coordinates[0]))"
      ],
      "metadata": {
        "id": "QBCiXOL3MIIn"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# extract out the bounding box from the coordinates\n",
        "minx = min(xcoords)\n",
        "maxx = max(xcoords)\n",
        "miny = min(ycoords)\n",
        "maxy = max(ycoords)"
      ],
      "metadata": {
        "id": "UzWwfgZXMIoz"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "block_size = 500 # x and y pixels for each block to request\n",
        "resolution = 90 # 90m resolution\n",
        "\n",
        "# calculate the number of segments in x and y direction to create domains\n",
        "nx = int(np.ceil(((maxx - minx) / resolution) / block_size))\n",
        "ny = int(np.ceil(((maxy - miny) / resolution) / block_size))"
      ],
      "metadata": {
        "id": "J6miWpvJQR33"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Here is where we create multiple smaller domains to handle REST API requests that cover all of Cambodia."
      ],
      "metadata": {
        "id": "HsF9-2KlrpRE"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# get multiple domains for the Cambodia area\n",
        "domains = []\n",
        "\n",
        "# loop over each chunk in the x and y direction\n",
        "for x in range(nx):\n",
        "    for y in range(ny):\n",
        "        # get the bounding info for block\n",
        "        block_xmin = minx + (block_size * resolution * x)\n",
        "        block_xmax = block_xmin + (resolution * block_size)\n",
        "        block_ymin = miny + (block_size * resolution * y)\n",
        "        block_ymax = block_ymin + (resolution * block_size)\n",
        "\n",
        "        # create a domain object\n",
        "        domain = ree.Domain(\n",
        "            (block_xmin, block_ymin, block_xmax, block_ymax),\n",
        "            resolution = resolution,\n",
        "            crs = crs\n",
        "        )\n",
        "\n",
        "        domains.append(domain)"
      ],
      "metadata": {
        "id": "q1or5ic4QT3C"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "The `ree.Domain` object controls the spatial information (i.e. [PixelGrid](https://developers.google.com/earth-engine/reference/rest/v1/PixelGrid)) for the REST API requests including: extent, coordinate reference system (CRS), and pixel resolution.\n",
        "\n",
        "***Note:*** *We provided the spatial coordinates for each block explicitly but we can also create domain objects from a file (i.e. `domain = ree.Domain.from_geopandas(gdf, resolution=30)`) or from an EE object directly (i.e. `domain = ree.Domain.from_ee_geometry(session, ee_geometry, resolution=30)`)*"
      ],
      "metadata": {
        "id": "vv5ffbjGrsbb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# request the processed image collection local data\n",
        "# define number of concurrent threads to use\n",
        "max_workers = 20\n",
        "\n",
        "# wrapper function to send requests concurrently\n",
        "def request_func(x):\n",
        "    return ree.img_to_xarray(ee_session, x, stack, no_data_value=0)\n",
        "\n",
        "# create a multithreading object and apply the function to request data\n",
        "with ThreadPoolExecutor(max_workers) as executor:\n",
        "    gen = executor.map(request_func, domains)\n",
        "\n",
        "    blocks = tuple(tqdm.tqdm(gen, total=len(domains), desc=f\"Block request progress\"))\n"
      ],
      "metadata": {
        "id": "0PcRgkr6QXVv"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Running concurrent requests really speeds up processing and if you would like to use more workers you should use the [High-volume API endpoint](https://developers.google.com/earth-engine/cloud/highvolume). This is where other services such as [DataFlow](https://cloud.google.com/dataflow) can come in handy to manage the parallel processing."
      ],
      "metadata": {
        "id": "6pEqlVN4sHMr"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# inspect the resulting local dataset\n",
        "# we will inspect a random block \n",
        "i = np.random.choice(len(domains)) # get random block to visualize\n",
        "print(f'Inspecting block {i}\\n')\n",
        "blocks[i]"
      ],
      "metadata": {
        "id": "2_dXQp20Qbur"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# visualize ndvi band\n",
        "blocks[i].ndvi.plot(cmap=\"Greens\", vmin=0, vmax=1, figsize=(12,10));"
      ],
      "metadata": {
        "id": "I6yHcy7jQqM7"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# request the samples as a local dataframe\n",
        "sample_gdf = ree.features_to_geodf(ee_session, sample)\n",
        "kh_gdf = ree.features_to_geodf(ee_session, kh)"
      ],
      "metadata": {
        "id": "y8A8KUY0QulU"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# display the dataframe to get an idea of data\n",
        "sample_gdf.head()"
      ],
      "metadata": {
        "id": "xgf4K53BQvv-"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# plot the sample locations as a static map\n",
        "ax = kh_gdf.plot(color='white', edgecolor='black', figsize=(10,10))\n",
        "sample_gdf.plot(ax=ax, alpha=0.6, column=\"loss\", cmap=\"PiYG_r\",vmin=-0.1,vmax=1.1);"
      ],
      "metadata": {
        "id": "puHA2gF3Q103"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Client-side data processing\n",
        "\n",
        "Now that we have our data from EE, we can begin processing locally. To detect deforestation, we will train a logistic regression model using [`scikit-learn`](https://scikit-learn.org/stable/), a very popular machine learning package in Python. This will give us a probability of deforestation for each data block that we can then upload to cloud storage and Earth Engine."
      ],
      "metadata": {
        "id": "hex5WgMRspt7"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# import our ML packages\n",
        "from sklearn import metrics\n",
        "from sklearn.linear_model import LogisticRegression\n",
        "from sklearn.model_selection import train_test_split"
      ],
      "metadata": {
        "id": "fpiGFLYrTEB2"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# get which columns we should be using to predict from our image\n",
        "pred_cols = ree.get_value(ee_session, stack.bandNames())"
      ],
      "metadata": {
        "id": "i2R50FCaTICh"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# extract feature values and target\n",
        "X = sample_gdf[pred_cols]\n",
        "y = sample_gdf['loss']"
      ],
      "metadata": {
        "id": "op1yTr_ATL1N"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# split data into training and testing datasets\n",
        "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42, stratify=y)"
      ],
      "metadata": {
        "id": "4bynsV-7TPbk"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# create a logistic regression model and fit with data\n",
        "clf = LogisticRegression(random_state=0).fit(X.values, y.values)"
      ],
      "metadata": {
        "id": "kpxdRRs3TSjf"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# apply prediction on test data\n",
        "preds = clf.predict(X_test.values)"
      ],
      "metadata": {
        "id": "TpStq8wATVkE"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# check the accuracy against test observation\n",
        "acc = metrics.accuracy_score(y_test, preds)\n",
        "print(f\"Prediction accuracy: {acc:.4f}\")"
      ],
      "metadata": {
        "id": "-bnisGaSTXtb"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now that we have a model we need to apply on each block!"
      ],
      "metadata": {
        "id": "71qthMDksugT"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "#run predictions on each block\n",
        "\n",
        "block_predictions = [] # list to append results to\n",
        "\n",
        "# loop through the blocks to predict\n",
        "for block in blocks:\n",
        "    # get band info from block and \n",
        "    block_pred = block[pred_cols[0]].copy(deep=True)\n",
        "\n",
        "    # we can only predict on a 2D array so we need \n",
        "    # get the original shape of image block\n",
        "    in_shape = block_pred.shape\n",
        "    # format data for prediction\n",
        "    in_infer =(\n",
        "        block[pred_cols] #\n",
        "        .fillna(-999) # fill nan values\n",
        "        .to_array().values # convert to 3D array\n",
        "        .transpose(1,2,0) # transpose dims to [y,x,c]\n",
        "        .reshape(np.prod(in_shape),-1) # reshape to [space, c]\n",
        "    )\n",
        "\n",
        "    # run prediction and reshape to image\n",
        "    block_pred[:] = clf.predict_proba(in_infer)[:,1].reshape(in_shape)\n",
        "\n",
        "    # append prediction to list\n",
        "    block_predictions.append(block_pred.rename(\"deforestation_proba\"))\n",
        "\n",
        "# to speed this up for time critical operations we can run concurrent processes"
      ],
      "metadata": {
        "id": "B5GKAW7gTaqs"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# visualize results from prediction\n",
        "block_predictions[i].plot(vmin=0,vmax=1,figsize=(12,10));"
      ],
      "metadata": {
        "id": "fTaruL3rTfNl"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Again, this is a faily simple example of running a local algorithm on Earth Engine results. Here is where you would change to your workflow to run complicated or domain specific workflows outside of EE."
      ],
      "metadata": {
        "id": "JAedJWr-sxBg"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Save dataset as COGs\n",
        "\n",
        "Depending on the application it can be beneficial to store your data on Cloud Storage and read the files in Earth Engine from cloud storage (although this approach is less performant than standard assets when used in computations). We do this by storing the image data as a [Cloud Optimized Geotiff](https://www.cogeo.org/) and creating an image collection on Earth Engine that points to the data in cloud storage. This section illustrates how to save the data as a COG."
      ],
      "metadata": {
        "id": "9Vh4hJlDs0HC"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# import more packages to help save the data\n",
        "from pathlib import Path\n",
        "import rioxarray\n",
        "import datetime"
      ],
      "metadata": {
        "id": "UzKJqX_TTkll"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# define a local folder where to save the data\n",
        "staging_dir = Path(\"/content/staging/\")\n",
        "\n",
        "# check if the folder exists\n",
        "# if not then create it\n",
        "if not staging_dir.exists():\n",
        "    staging_dir.mkdir(parents=True)"
      ],
      "metadata": {
        "id": "Z9PEeI1ns3ck"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Here will will save the data as COGs. This file format has a very particular internal structure and this format needs to be exact so that Earth Engine can then read from cloud storage directly. See [COG Configuration Docs](https://developers.google.com/earth-engine/Earth_Engine_asset_from_cloud_geotiff#configuration) for information on how best to format COGs for use with Earth Engine. The parameters to format COGs can be found at the following documentation: https://gdal.org/drivers/raster/cog.html"
      ],
      "metadata": {
        "id": "h-hHge4Ss7fD"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# save each prediction block as a COG\n",
        "# there are many files to write so we will\n",
        "# write data concurrently with multiple threads\n",
        "\n",
        "# define a function that will save each block\n",
        "def save_cog(block):\n",
        "    # unpack parameters\n",
        "    da, i = block \n",
        "\n",
        "    # rename the coordinate dims if it is geographic coordinate \n",
        "    # system so that the rioxarray extension can inteface with\n",
        "    # the coordinates for saving data \n",
        "    if domain.crs == 'EPSG:4326':\n",
        "        da = da.rename({'lat':'y', 'lon':'x'})\n",
        "\n",
        "    # set the CRS, use the domain info to set\n",
        "    da.rio.write_crs(domains[0].crs, inplace=True)\n",
        "    \n",
        "    # save the data as a cog\n",
        "    da.astype(np.float32).rio.to_raster(\n",
        "        staging_dir / f\"deforestation_pred_block_{i}.tif\",\n",
        "        driver = \"COG\", # set the driver to be Cloud-Optimized GeoTIFF\n",
        "        windowed=True,  # rioxarray: read & write one window at a time,\n",
        "        overviews='auto', # auto generate internal overviews if not available\n",
        "        blocksize=256, # set size of tiles to 256x256\n",
        "        compress='LZW', # LZW compression, use LZW or DEFLATE\n",
        "        level = 9, # level 9 compression (highest)\n",
        "    )\n",
        "\n",
        "    return 1\n",
        "\n",
        "# multithreading functions only accept one argument\n",
        "# but we want to keep track of which block we are on\n",
        "# here we are packing the block data with an id \n",
        "in_args = [(pred_block,i) for i, pred_block in enumerate(block_predictions)]\n",
        "\n",
        "# create a multithreading object and apply the function to save cogs\n",
        "with ThreadPoolExecutor(max_workers=max_workers) as executor:\n",
        "    gen = executor.map(save_cog, in_args)\n",
        "\n",
        "    _ = tuple(tqdm.tqdm(gen, total=len(domains), desc=f\"COG write progress\"))"
      ],
      "metadata": {
        "id": "pK3zbBTis7vh"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "Again, COGs have a very specific format and it is a good idea to check and make sure we got everything correct. Here we will validate that our created files are in fact valid Cloud-Optimized GeoTIFF (but we will just test one). To do this we will use the [Cloud Optimized GeoTIFF (COG) creation and validation plugin for Rasterio](https://cogeotiff.github.io/rio-cogeo/)"
      ],
      "metadata": {
        "id": "tscg_20Bs_7A"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# get a list of file \n",
        "file_paths = list(staging_dir.glob('*.tif'))\n",
        "# and the name with no directory or extension (used later)\n",
        "file_names = [p.stem for p in file_paths]"
      ],
      "metadata": {
        "id": "7xjUEQYltAGV"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "# get the first file from the list\n",
        "first_cog = file_paths[0]\n",
        "\n",
        "# use the cogeo plugin to run the validation\n",
        "!rio cogeo validate {first_cog} --strict"
      ],
      "metadata": {
        "id": "MiTb-HMDtCkj"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "***WARNING:*** *this validation command will confirm it is a valid COG without a projection and a projection is required to use with Earth Engine so you may need to double check if you are unsure if there is a projection (i.e. `gdalinfo <filename>`)*"
      ],
      "metadata": {
        "id": "B8xCT3P7tERC"
      }
    },
    {
      "cell_type": "markdown",
      "source": [
        "Now that we know the COGs we created are valid, we can move to cloud storage and create out assets!"
      ],
      "metadata": {
        "id": "LwDt7e37tKok"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# INSERT YOUR STORAGE BUCKET NAME HERE\n",
        "BUCKET_NAME = \"g4g22-demo-cogs\"\n",
        "\n",
        "# define the bucket and file pattern to move to cloud storage\n",
        "# change to your storage bucket\n",
        "BUCKET = f\"gs://{BUCKET_NAME}\"\n",
        "pattern = str(staging_dir / \"*.tif\")\n",
        "\n",
        "# run the data move\n",
        "!gsutil -m cp {pattern} {BUCKET}"
      ],
      "metadata": {
        "id": "u__4MTu1tK6W"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "markdown",
      "source": [
        "## Create COG backed asset on EE\n",
        "\n",
        "This section demonstrates how to create Earth Engine assets backed by COGs. An advantage of COG-backed assets is that the spatial and metadata fields of the image will be indexed at asset creation time, making the image more performant in collections. (In contrast, an image created through ee.Image.loadGeoTIFF and put into a collection will require a read of the GeoTiff for filtering operations on the collection.) A disadvantage of COG-backed assets is that they may be several times slower than standard assets when used in computations.\n",
        "\n",
        "To create a COG-backed asset, make a POST request to the Earth Engine CreateAsset endpoint. As shown in the following, this request must be authorized to create an asset in your user folder."
      ],
      "metadata": {
        "id": "cW7thVmlthlb"
      }
    },
    {
      "cell_type": "code",
      "source": [
        "# INSERT YOUR OUTPUT IMAGE COLLECTION NAME HERE\n",
        "collection_name = 'cog_demo_collection_test'\n",
        "\n",
        "# create a new image collection\n",
        "# this is not necessary if one already exists\n",
        "ee.data.createAsset({'type':'ImageCollection'}, f'projects/{PROJECTID}/assets/{collection_name}')"
      ],
      "metadata": {
        "id": "HQoovdLetjvj"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [
        "import re\n",
        "from pprint import pprint\n",
        "\n",
        "cog_asset_endpoint = 'https://earthengine.googleapis.com/v1alpha/projects/{}/assets?assetId={}'\n",
        "\n",
        "for name in file_names:\n",
        "    # Your new asset name relative to the project's home folder.\n",
        "    # Any parent folders or collections must already exist.\n",
        "    img_name = re.sub(\"-|:|\\.\", \"\", name)\n",
        "    asset_id = f'{collection_name}/{img_name}'\n",
        "    \n",
        "    # parse the time information from the file name\n",
        "    time_info = name.split(\"_\")[-1]\n",
        "\n",
        "    # Request body as a dictionary.\n",
        "    request = {\n",
        "    'type': 'IMAGE',\n",
        "    'gcs_location': {\n",
        "        'uris': [f'{BUCKET}/{name}.tif']\n",
        "    },\n",
        "    'properties': {\n",
        "        'source': 'my world altering process' # change to add some meaningful metadata\n",
        "    },\n",
        "    'startTime': f'2020-01-01T00:00:00Z', # can programmatically change date information\n",
        "    'endTime': f'2020-12-31T11:59:59Z',\n",
        "    }\n",
        "\n",
        "    response = session.post(\n",
        "        url = cog_asset_endpoint.format(PROJECTID, asset_id),\n",
        "        data = json.dumps(request)\n",
        "    )\n",
        "\n",
        "    break\n",
        "\n",
        "# display the response from the final request\n",
        "pprint(json.loads(response.content))\n"
      ],
      "metadata": {
        "id": "H5MmTSGptmVN"
      },
      "execution_count": null,
      "outputs": []
    },
    {
      "cell_type": "code",
      "source": [],
      "metadata": {
        "id": "AXrfvlBe4MSt"
      },
      "execution_count": null,
      "outputs": []
    }
  ]
}