There are a couple of setup steps necessary anytime you want to work with RasterFrames. the first is to import the API symbols into scope:
import astraea.spark.rasterframes._
import org.apache.spark.sql._
Next, initialize the SparkSession
, and call the withRasterFrames
method on it:
implicit val spark = SparkSession.builder().
master("local").appName("RasterFrames").
getOrCreate().
withRasterFrames
And, as is standard Spark SQL practice, we import additional DataFrame support:
import spark.implicits._
Now we are ready to create a RasterFrame.
The most straightforward way to create a RasterFrame
is to read a GeoTIFF
file using a RasterFrame DataSource
designed for this purpose.
First add the following import:
import astraea.spark.rasterframes.datasource.geotiff._
import java.io.File
(This is what adds the .geotiff
method to spark.read
below.)
Then we use the DataFrameReader
provided by spark.read
to read the GeoTIFF:
val samplePath = new File("src/test/resources/LC08_RGB_Norfolk_COG.tiff")
// samplePath: java.io.File = src/test/resources/LC08_RGB_Norfolk_COG.tiff
val tiffRF = spark.read.
geotiff.
loadRF(samplePath.toURI)
// tiffRF: astraea.spark.rasterframes.RasterFrame = [spatial_key: struct<col: int, row: int>, extent: struct<xmin: double, ymin: double ... 2 more fields> ... 4 more fields]
Let's inspect the structure of what we get back:
scala> tiffRF.printSchema()
root
|-- spatial_key: struct (nullable = false)
| |-- col: integer (nullable = false)
| |-- row: integer (nullable = false)
|-- extent: struct (nullable = false)
| |-- xmin: double (nullable = false)
| |-- ymin: double (nullable = false)
| |-- xmax: double (nullable = false)
| |-- ymax: double (nullable = false)
|-- metadata: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = false)
|-- tile_1: rf_tile (nullable = false)
|-- tile_2: rf_tile (nullable = false)
|-- tile_3: rf_tile (nullable = false)
As reported by Spark, RasterFrames extracts 6 columns from the GeoTIFF we selected. Some of these columns are dependent on the contents of the source data, and some are are always available. Let's take a look at these in more detail.
spatial_key
: GeoTrellis assigns aSpatialKey
or aSpaceTimeKey
to each tile, mapping it to the layer grid from which it came. If it has aSpaceTimeKey
, RasterFrames will split it into aSpatialKey
and aTemporalKey
(the latter with column nametemporal_key
).extent
: The bounding box of the tile in the tile's native CRS.metadata
: The TIFF format header tags found in the file.tile
ortile_n
(wheren
is a band number): For singleband GeoTIFF files, thetile
column contains the cell data split into tiles. For multiband tiles, each column withtile_
prefix contains each of the sources bands, in the order they were stored.
See the section Inspecting a RasterFrame
(below) for more details on accessing the RasterFrame's metadata.
If your imagery is already ingested into a GeoTrellis layer, you can use the RasterFrames GeoTrellis DataSource. There are two parts to this GeoTrellis Layer support. The first is the GeoTrellis Catalog DataSource, which lists the GeoTrellis layers available at a URI. The second part is the actual RasterFrame reader for pulling a layer into a RasterFrame.
Before we show how all of this works we need to have a GeoTrellis layer to work with. We can create one with the RasterFrame we constructed above.
import astraea.spark.rasterframes.datasource.geotrellis._
import java.nio.file.Files
val base = Files.createTempDirectory("rf-").toUri
val layer = Layer(base, "sample", 0)
tiffRF.write.geotrellis.asLayer(layer).save()
Now we can point our catalog reader at the base directory and see what was saved:
scala> val cat = spark.read.geotrellisCatalog(base)
cat: org.apache.spark.sql.DataFrame = [index: int, layer: struct<base: struct<uri: string>, id: struct<name: string, zoom: int>> ... 9 more fields]
scala> cat.printSchema
root
|-- index: integer (nullable = false)
|-- layer: struct (nullable = true)
| |-- base: struct (nullable = false)
| | |-- uri: string (nullable = false)
| |-- id: struct (nullable = false)
| | |-- name: string (nullable = true)
| | |-- zoom: integer (nullable = false)
|-- format: string (nullable = true)
|-- keyClass: string (nullable = true)
|-- path: string (nullable = true)
|-- valueClass: string (nullable = true)
|-- bounds: struct (nullable = true)
| |-- maxKey: struct (nullable = true)
| | |-- col: long (nullable = true)
| | |-- row: long (nullable = true)
| |-- minKey: struct (nullable = true)
| | |-- col: long (nullable = true)
| | |-- row: long (nullable = true)
|-- cellType: string (nullable = true)
|-- crs: string (nullable = true)
|-- extent: struct (nullable = true)
| |-- xmax: double (nullable = true)
| |-- xmin: double (nullable = true)
| |-- ymax: double (nullable = true)
| |-- ymin: double (nullable = true)
|-- layoutDefinition: struct (nullable = true)
| |-- extent: struct (nullable = true)
| | |-- xmax: double (nullable = true)
| | |-- xmin: double (nullable = true)
| | |-- ymax: double (nullable = true)
| | |-- ymin: double (nullable = true)
| |-- tileLayout: struct (nullable = true)
| | |-- layoutCols: long (nullable = true)
| | |-- layoutRows: long (nullable = true)
| | |-- tileCols: long (nullable = true)
| | |-- tileRows: long (nullable = true)
scala> cat.show()
+-----+--------------------+------+--------------------+--------+--------------------+-------------+--------+--------------------+--------------------+--------------------+
|index| layer|format| keyClass| path| valueClass| bounds|cellType| crs| extent| layoutDefinition|
+-----+--------------------+------+--------------------+--------+--------------------+-------------+--------+--------------------+--------------------+--------------------+
| 0|[[file:///var/fol...| file|geotrellis.spark....|sample/0|geotrellis.raster...|[[4,3],[0,0]]| uint16|+proj=utm +zone=1...|[395295.0,364455....|[[395295.0,364455...|
+-----+--------------------+------+--------------------+--------+--------------------+-------------+--------+--------------------+--------------------+--------------------+
As you can see, there's a lot of information stored in each row of the catalog. Most of this is associated with how the layer is discretized. However, there may be other application specific metadata serialized with a layer that can be use to filter the catalog entries or select a specific one. But for now, we're just going to load a RasterFrame in from the catalog using a convenience function.
scala> val rfAgain = cat.select(geotrellis_layer).loadRF
rfAgain: astraea.spark.rasterframes.RasterFrame = [spatial_key: struct<col: int, row: int>, extent: struct<xmin: double, ymin: double ... 2 more fields> ... 3 more fields]
scala> rfAgain.show()
+-----------+--------------------+--------------------+--------------------+--------------------+
|spatial_key| extent| tile_1| tile_2| tile_3|
+-----------+--------------------+--------------------+--------------------+--------------------+
| [4,3]|[389127.0,4080315...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [4,1]|[389127.0,4095150...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [4,2]|[389127.0,4087732...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [1,0]|[370623.0,4102567...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [0,0]|[364455.0,4102567...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [3,0]|[382959.0,4102567...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [1,1]|[370623.0,4095150...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [0,1]|[364455.0,4095150...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [2,0]|[376791.0,4102567...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [2,1]|[376791.0,4095150...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [3,1]|[382959.0,4095150...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [1,2]|[370623.0,4087732...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [0,3]|[364455.0,4080315...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [0,2]|[364455.0,4087732...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [4,0]|[389127.0,4102567...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [1,3]|[370623.0,4080315...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [2,3]|[376791.0,4080315...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [3,3]|[382959.0,4080315...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [3,2]|[382959.0,4087732...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
| [2,2]|[376791.0,4087732...|geotrellis.raster...|geotrellis.raster...|geotrellis.raster...|
+-----------+--------------------+--------------------+--------------------+--------------------+
If you already know the LayerId
of what you're wanting to read, you can bypass working with the catalog:
scala> val anotherRF = spark.read.geotrellis.loadRF(layer)
anotherRF: astraea.spark.rasterframes.RasterFrame = [spatial_key: struct<col: int, row: int>, extent: struct<xmin: double, ymin: double ... 2 more fields> ... 3 more fields]
If you are used to working directly with the GeoTrellis APIs, there are a number of additional ways to create a RasterFrame
, as enumerated in the sections below.
First, some more import
s:
import geotrellis.raster.io.geotiff.SinglebandGeoTiff
import geotrellis.spark.io._
The simplest mechanism for getting a RasterFrame is to use the toRF(tileCols, tileRows)
extension method on ProjectedRaster
.
scala> val scene = SinglebandGeoTiff("src/test/resources/L8-B8-Robinson-IL.tiff")
scene: geotrellis.raster.io.geotiff.SinglebandGeoTiff = SinglebandGeoTiff(geotrellis.raster.UShortConstantNoDataArrayTile@6b7ad1fb,Extent(431902.5, 4313647.5, 443512.5, 4321147.5),EPSG:32616,Tags(Map(AREA_OR_POINT -> POINT),List(Map())),GeoTiffOptions(geotrellis.raster.io.geotiff.Striped@7f3d4fc3,geotrellis.raster.io.geotiff.compression.DeflateCompression$@11179621,1,None))
scala> val rf = scene.projectedRaster.toRF(128, 128)
rf: astraea.spark.rasterframes.RasterFrame = [spatial_key: struct<col: int, row: int>, tile: rf_tile]
scala> rf.show(5, false)
+-----------+--------------------------------------------------------+
|spatial_key|tile |
+-----------+--------------------------------------------------------+
|[0,0] |geotrellis.raster.UShortConstantNoDataArrayTile@1e1cfae6|
|[1,1] |geotrellis.raster.UShortConstantNoDataArrayTile@2f29e65f|
|[6,1] |geotrellis.raster.UShortConstantNoDataArrayTile@10fdb904|
|[3,1] |geotrellis.raster.UShortConstantNoDataArrayTile@1b970790|
|[4,2] |geotrellis.raster.UShortConstantNoDataArrayTile@62a515ce|
+-----------+--------------------------------------------------------+
only showing top 5 rows
Another option is to use a GeoTrellis LayerReader
,
to get a TileLayerRDD
for which there's also a toRF
extension method.
import geotrellis.spark._
val tiledLayer: TileLayerRDD[SpatialKey] = ???
val rf = tiledLayer.toRF
RasterFrame
has a number of methods providing access to metadata about the contents of the RasterFrame.
rf.tileColumns.map(_.toString)
// res8: Seq[String] = ArraySeq(tile)
rf.spatialKeyColumn.toString
// res9: String = spatial_key
Returns an Option[Column]
since not all RasterFrames have an explicit temporal dimension.
rf.temporalKeyColumn.map(_.toString)
// res10: Option[String] = None
The Tile Layer Metadata defines how the spatial/spatiotemporal domain is discretized into tiles, and what the key bounds are.
scala> import spray.json._
import spray.json._
scala> // NB: The `fold` is required because an `Either` is returned, depending on the key type.
| rf.tileLayerMetadata.fold(_.toJson, _.toJson).prettyPrint
res12: String =
{
"extent": {
"xmin": 431902.5,
"ymin": 4313647.5,
"xmax": 443512.5,
"ymax": 4321147.5
},
"layoutDefinition": {
"extent": {
"xmin": 431902.5,
"ymin": 4313467.5,
"xmax": 445342.5,
"ymax": 4321147.5
},
"tileLayout": {
"layoutCols": 7,
"layoutRows": 4,
"tileCols": 128,
"tileRows": 128
}
},
"bounds": {
"minKey": {
"col": 0,
"row": 0
},
"maxKey": {
"col": 6,
"row": 3
}
},
"cellType": "uint16",
"crs": "+proj=utm +zone=16 +datum=WGS84 +units=m +no_defs "
}