Skip to content

Instantly share code, notes, and snippets.

@metasim
Created January 18, 2018 21:33
Show Gist options
  • Save metasim/06960f1f032bec953533123b2e89668b to your computer and use it in GitHub Desktop.
Save metasim/06960f1f032bec953533123b2e89668b to your computer and use it in GitHub Desktop.

Creating RasterFrames

Initialization

There are a couple of setup steps necessary anytime you want to work with RasterFrames. the first is to import the API symbols into scope:

import astraea.spark.rasterframes._
import org.apache.spark.sql._

Next, initialize the SparkSession, and call the withRasterFrames method on it:

implicit val spark = SparkSession.builder().
  master("local").appName("RasterFrames").
  getOrCreate().
  withRasterFrames

And, as is standard Spark SQL practice, we import additional DataFrame support:

import spark.implicits._

Now we are ready to create a RasterFrame.

Reading a GeoTIFF

The most straightforward way to create a RasterFrame is to read a GeoTIFF file using a RasterFrame DataSource designed for this purpose.

First add the following import:

import astraea.spark.rasterframes.datasource.geotiff._
import java.io.File

(This is what adds the .geotiff method to spark.read below.)

Then we use the DataFrameReader provided by spark.read to read the GeoTIFF:

val samplePath = new File("src/test/resources/LC08_RGB_Norfolk_COG.tiff")
// samplePath: java.io.File = src/test/resources/LC08_RGB_Norfolk_COG.tiff

val tiffRF = spark.read.
  geotiff.
  loadRF(samplePath.toURI)
// tiffRF: astraea.spark.rasterframes.RasterFrame = [spatial_key: struct<col: int, row: int>, extent: struct<xmin: double, ymin: double ... 2 more fields> ... 4 more fields]

Let's inspect the structure of what we get back:

scala> tiffRF.printSchema()
root
 |-- spatial_key: struct (nullable = false)
 |    |-- col: integer (nullable = false)
 |    |-- row: integer (nullable = false)
 |-- extent: struct (nullable = false)
 |    |-- xmin: double (nullable = false)
 |    |-- ymin: double (nullable = false)
 |    |-- xmax: double (nullable = false)
 |    |-- ymax: double (nullable = false)
 |-- metadata: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = false)
 |-- tile_1: rf_tile (nullable = false)
 |-- tile_2: rf_tile (nullable = false)
 |-- tile_3: rf_tile (nullable = false)

As reported by Spark, RasterFrames extracts 6 columns from the GeoTIFF we selected. Some of these columns are dependent on the contents of the source data, and some are are always available. Let's take a look at these in more detail.

  • spatial_key: GeoTrellis assigns a SpatialKey or a SpaceTimeKey to each tile, mapping it to the layer grid from which it came. If it has a SpaceTimeKey, RasterFrames will split it into a SpatialKey and a TemporalKey (the latter with column name temporal_key).
  • extent: The bounding box of the tile in the tile's native CRS.
  • metadata: The TIFF format header tags found in the file.
  • tile or tile_n (where n is a band number): For singleband GeoTIFF files, the tile column contains the cell data split into tiles. For multiband tiles, each column with tile_ prefix contains each of the sources bands, in the order they were stored.

See the section Inspecting a RasterFrame (below) for more details on accessing the RasterFrame's metadata.

Reading a GeoTrellis Layer

If your imagery is already ingested into a GeoTrellis layer, you can use the RasterFrames GeoTrellis DataSource. There are two parts to this GeoTrellis Layer support. The first is the GeoTrellis Catalog DataSource, which lists the GeoTrellis layers available at a URI. The second part is the actual RasterFrame reader for pulling a layer into a RasterFrame.

Before we show how all of this works we need to have a GeoTrellis layer to work with. We can create one with the RasterFrame we constructed above.

import astraea.spark.rasterframes.datasource.geotrellis._
import java.nio.file.Files

val base = Files.createTempDirectory("rf-").toUri
val layer = Layer(base, "sample", 0)
tiffRF.write.geotrellis.asLayer(layer).save()

Now we can point our catalog reader at the base directory and see what was saved:

scala> val cat = spark.read.geotrellisCatalog(base)
cat: org.apache.spark.sql.DataFrame = [index: int, layer: struct<base: struct<uri: string>, id: struct<name: string, zoom: int>> ... 9 more fields]

scala> cat.printSchema
root
 |-- index: integer (nullable = false)
 |-- layer: struct (nullable = true)
 |    |-- base: struct (nullable = false)
 |    |    |-- uri: string (nullable = false)
 |    |-- id: struct (nullable = false)
 |    |    |-- name: string (nullable = true)
 |    |    |-- zoom: integer (nullable = false)
 |-- format: string (nullable = true)
 |-- keyClass: string (nullable = true)
 |-- path: string (nullable = true)
 |-- valueClass: string (nullable = true)
 |-- bounds: struct (nullable = true)
 |    |-- maxKey: struct (nullable = true)
 |    |    |-- col: long (nullable = true)
 |    |    |-- row: long (nullable = true)
 |    |-- minKey: struct (nullable = true)
 |    |    |-- col: long (nullable = true)
 |    |    |-- row: long (nullable = true)
 |-- cellType: string (nullable = true)
 |-- crs: string (nullable = true)
 |-- extent: struct (nullable = true)
 |    |-- xmax: double (nullable = true)
 |    |-- xmin: double (nullable = true)
 |    |-- ymax: double (nullable = true)
 |    |-- ymin: double (nullable = true)
 |-- layoutDefinition: struct (nullable = true)
 |    |-- extent: struct (nullable = true)
 |    |    |-- xmax: double (nullable = true)
 |    |    |-- xmin: double (nullable = true)
 |    |    |-- ymax: double (nullable = true)
 |    |    |-- ymin: double (nullable = true)
 |    |-- tileLayout: struct (nullable = true)
 |    |    |-- layoutCols: long (nullable = true)
 |    |    |-- layoutRows: long (nullable = true)
 |    |    |-- tileCols: long (nullable = true)
 |    |    |-- tileRows: long (nullable = true)


scala> cat.show()
+-----+--------------------+------+--------------------+--------+--------------------+-------------+--------+--------------------+--------------------+--------------------+
|index|               layer|format|            keyClass|    path|          valueClass|       bounds|cellType|                 crs|              extent|    layoutDefinition|
+-----+--------------------+------+--------------------+--------+--------------------+-------------+--------+--------------------+--------------------+--------------------+
|    0|[[file:///var/fol...|  file|geotrellis.spark....|sample/0|geotrellis.raster...|[[4,3],[0,0]]|  uint16|+proj=utm +zone=1...|[395295.0,364455....|[[395295.0,364455...|
+-----+--------------------+------+--------------------+--------+--------------------+-------------+--------+--------------------+--------------------+--------------------+

As you can see, there's a lot of information stored in each row of the catalog. Most of this is associated with how the layer is discretized. However, there may be other application specific metadata serialized with a layer that can be use to filter the catalog entries or select a specific one. But for now, we're just going to load a RasterFrame in from the catalog using a convenience function.

scala> val rfAgain = cat.select(geotrellis_layer).loadRF
rfAgain: astraea.spark.rasterframes.RasterFrame = [spatial_key: struct<col: int, row: int>, extent: struct<xmin: double, ymin: double ... 2 more fields> ... 3 more fields]

scala> rfAgain.show()
+-----------+--------------------+--------------------+--------------------+--------------------+
|spatial_key|              extent|              tile_1|              tile_2|              tile_3|
+-----------+--------------------+--------------------+--------------------+--------------------+
|      [4,3]|[389127.0,4080315...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [4,1]|[389127.0,4095150...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [4,2]|[389127.0,4087732...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [1,0]|[370623.0,4102567...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [0,0]|[364455.0,4102567...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [3,0]|[382959.0,4102567...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [1,1]|[370623.0,4095150...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [0,1]|[364455.0,4095150...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [2,0]|[376791.0,4102567...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [2,1]|[376791.0,4095150...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [3,1]|[382959.0,4095150...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [1,2]|[370623.0,4087732...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [0,3]|[364455.0,4080315...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [0,2]|[364455.0,4087732...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [4,0]|[389127.0,4102567...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [1,3]|[370623.0,4080315...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [2,3]|[376791.0,4080315...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [3,3]|[382959.0,4080315...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [3,2]|[382959.0,4087732...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
|      [2,2]|[376791.0,4087732...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
+-----------+--------------------+--------------------+--------------------+--------------------+

If you already know the LayerId of what you're wanting to read, you can bypass working with the catalog:

scala> val anotherRF = spark.read.geotrellis.loadRF(layer)
anotherRF: astraea.spark.rasterframes.RasterFrame = [spatial_key: struct<col: int, row: int>, extent: struct<xmin: double, ymin: double ... 2 more fields> ... 3 more fields]

Using GeoTrellis APIs

If you are used to working directly with the GeoTrellis APIs, there are a number of additional ways to create a RasterFrame, as enumerated in the sections below.

First, some more imports:

import geotrellis.raster.io.geotiff.SinglebandGeoTiff
import geotrellis.spark.io._

From ProjectedExtent

The simplest mechanism for getting a RasterFrame is to use the toRF(tileCols, tileRows) extension method on ProjectedRaster.

scala> val scene = SinglebandGeoTiff("src/test/resources/L8-B8-Robinson-IL.tiff")
scene: geotrellis.raster.io.geotiff.SinglebandGeoTiff = SinglebandGeoTiff(geotrellis.raster.UShortConstantNoDataArrayTile@6b7ad1fb,Extent(431902.5, 4313647.5, 443512.5, 4321147.5),EPSG:32616,Tags(Map(AREA_OR_POINT -> POINT),List(Map())),GeoTiffOptions(geotrellis.raster.io.geotiff.Striped@7f3d4fc3,geotrellis.raster.io.geotiff.compression.DeflateCompression$@11179621,1,None))

scala> val rf = scene.projectedRaster.toRF(128, 128)
rf: astraea.spark.rasterframes.RasterFrame = [spatial_key: struct<col: int, row: int>, tile: rf_tile]

scala> rf.show(5, false)
+-----------+--------------------------------------------------------+
|spatial_key|tile                                                    |
+-----------+--------------------------------------------------------+
|[0,0]      |geotrellis.raster.UShortConstantNoDataArrayTile@1e1cfae6|
|[1,1]      |geotrellis.raster.UShortConstantNoDataArrayTile@2f29e65f|
|[6,1]      |geotrellis.raster.UShortConstantNoDataArrayTile@10fdb904|
|[3,1]      |geotrellis.raster.UShortConstantNoDataArrayTile@1b970790|
|[4,2]      |geotrellis.raster.UShortConstantNoDataArrayTile@62a515ce|
+-----------+--------------------------------------------------------+
only showing top 5 rows

From TileLayerRDD

Another option is to use a GeoTrellis LayerReader, to get a TileLayerRDD for which there's also a toRF extension method.

import geotrellis.spark._
val tiledLayer: TileLayerRDD[SpatialKey] = ???
val rf = tiledLayer.toRF

Inspecting a RasterFrame

RasterFrame has a number of methods providing access to metadata about the contents of the RasterFrame.

Tile Column Names

rf.tileColumns.map(_.toString)
// res8: Seq[String] = ArraySeq(tile)

Spatial Key Column Name

rf.spatialKeyColumn.toString
// res9: String = spatial_key

Temporal Key Column

Returns an Option[Column] since not all RasterFrames have an explicit temporal dimension.

rf.temporalKeyColumn.map(_.toString)
// res10: Option[String] = None

Tile Layer Metadata

The Tile Layer Metadata defines how the spatial/spatiotemporal domain is discretized into tiles, and what the key bounds are.

scala> import spray.json._
import spray.json._

scala> // NB: The `fold` is required because an `Either` is returned, depending on the key type.
     | rf.tileLayerMetadata.fold(_.toJson, _.toJson).prettyPrint
res12: String =
{
  "extent": {
    "xmin": 431902.5,
    "ymin": 4313647.5,
    "xmax": 443512.5,
    "ymax": 4321147.5
  },
  "layoutDefinition": {
    "extent": {
      "xmin": 431902.5,
      "ymin": 4313467.5,
      "xmax": 445342.5,
      "ymax": 4321147.5
    },
    "tileLayout": {
      "layoutCols": 7,
      "layoutRows": 4,
      "tileCols": 128,
      "tileRows": 128
    }
  },
  "bounds": {
    "minKey": {
      "col": 0,
      "row": 0
    },
    "maxKey": {
      "col": 6,
      "row": 3
    }
  },
  "cellType": "uint16",
  "crs": "+proj=utm +zone=16 +datum=WGS84 +units=m +no_defs "
}
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment