Skip to content

Instantly share code, notes, and snippets.

@VehpuS
Created June 8, 2022 08:35
Show Gist options
  • Save VehpuS/093904764cceb978b0ced9ebdae7d213 to your computer and use it in GitHub Desktop.
Save VehpuS/093904764cceb978b0ced9ebdae7d213 to your computer and use it in GitHub Desktop.
Accessing datasets located in buffers using MemoryFile and ZipMemoryFile (based on https://github.com/rasterio/rasterio/issues/977)

Rasterio has different ways to access datasets located on disk or at network addresses and datasets located in memory buffers. This document explains the former once again and then introduces the latter for the first time.

Accessing datasets on your filesystem

To access datasets on disk, give a filesystem path to rasterio.open().

import rasterio

# Open a dataset located in a local file.
with rasterio.open('data/RGB.byte.tif') as dataset:
    print(dataset.profile)

Equivalently, use a file:// URL.

with rasterio.open('file://data/RGB.byte.tif') as dataset:
    print(dataset.profile)

Accessing datasets in a zip archive

To access a dataset located in a local zip file, pass a zip:// URL (Apache VFS style) to rasterio.open().

with rasterio.open('zip://data/files.zip!RGB.byte.tif') as dataset:
    print(dataset.profile)

Accessing network datasets

Datasets at http://, https://, or s3:// (AWS CLI style) network locations can be accessed by passing these locators to rasterio.open(). See #942 for details.

The difference from GDAL If you're a GDAL user, you may be used to passing strings like /vsizip/foo.zip to call for zip file handling and strings like /viscurl/https://example.com/foo.tif to call for HTTP protocol handling. Rasterio registers handlers by URL schemes instead. Rasterio uses GDAL's special strings internally, but they are not part of the Rasterio API.

Accessing datasets in memory buffers

Rasterio can access datasets located in the buffers of Python objects without writing the buffers to disk. To see, open and read any GeoTIFF file.

data = open('data/RGB.byte.tif', 'rb').read()

The buffer of data's value contains that GeoTIFF. To make it available to Rasterio (and GDAL), give data to a MemoryFile and then open the dataset using MemoryFile.open().

from rasterio.io import MemoryFile

with MemoryFile(data) as memfile:
    with memfile.open() as dataset:
        print(dataset.profile)

As there is only one dataset per MemoryFile, MemoryFile.open() needs no filename or path argument. In many cases the usage can be condensed to the following.

with MemoryFile(data).open() as dataset:
    print(dataset.profile)

MemoryFile is like Python's BytesIO class but has an additional special feature: the bytes buffer is mapped to a virtual file for use by GDAL. The virtual file is deleted when the MemoryFile closes.

You can also pass a file-like object opened in binary mode to MemoryFile(). This is for convenience only, the bytes of the file are read immediately into a bytes object.

fp = open('data/RGB.byte.tif', 'rb')

with MemoryFile(fp).open() as dataset:
    print(dataset.profile)
    rgb_profile = dataset.profile
    rgb_data = dataset.read()

Note that the profile and band data of that dataset have been captured for use in other examples below.

Performance notes

Recognize the above as a more memory-intensive way of getting the same results as the very first example in this document. Generally speaking, raster data formats are optimized for random access and GDAL format drivers need datasets to be written entirely onto disk or into memory and mapped to a virtual file. Using MemoryFile to hold a large GeoTIFF doesn't require a hard disk (which is good for serverless applications) but loads the entire GeoTIFF into RAM.

Writing to MemoryFile

A MemoryFile can also be written. You can create a GeoTIFF (for example) in memory and then stream its bytes elsewhere without writing to disk. In this case you must bind the MemoryFile to a name so it can be referenced later.

with MemoryFile() as memfile:
    with memfile.open(**rgb_profile) as dataset:
        dataset.write(rgb_data)

    memfile.seek(0)
    print(memfile.read(1000))

Writing band data to the opened dataset modifies the virtual file and consequently the MemoryFile buffer.

Be kind: rewind Note well: after dataset closes, the memfile position is left at its end.

Zip files in a buffer The ZipMemoryFile class is mostly the same, but is for use with a buffer that contains a zip archive.

from rasterio.io import ZipMemoryFile

fp = open('data/files.zip', 'rb')

with ZipMemoryFile(fp) as zipmem:
    with zipmem.open('RGB.byte.tif') as dataset:
        print(dataset.profile)

This is much the same interface as that of zipfile.ZipFile.

Writing in-memory zip files

Writing to a ZipMemoryFile is not currently supported, but it is possible to do so using Python's zipfile library and Rasterio's MemoryFile together.

from io import BytesIO
import zipfile

with BytesIO() as bytes_buffer:
    with zipfile.ZipFile(bytes_buffer, 'w') as zf:

        with MemoryFile() as memfile:
            with memfile.open(**rgb_profile) as dataset:
                dataset.write(rgb_data)
                
            memfile.seek(0)
            zf.writestr('foo.tif', memfile.read())

    bytes_buffer.seek(0)
    with ZipMemoryFile(bytes_buffer).open('foo.tif') as dataset:
        print(dataset.profile)

Final notes on convenience features

By popular request, rasterio.open() can also take a file object opened in binary modes 'rb' or 'wb' as its first argument.

with open('data/RGB.byte.tif') as f:
    with rasterio.open(f) as dataset:
        print(dataset.profile)

A MemoryFile is created internally to hold the bytes read from the input file object. This is therefore not the best way to read or write datasets already on disk and addressable by name.

As is the case for every printed profile, the output is the following.

{'tiled': False, 'transform': Affine(300.0379266750948, 0.0, 101985.0,
       0.0, -300.041782729805, 2826915.0), 'width': 791, 'dtype': 'uint8', 'interleave': 'pixel', 'driver': 'GTiff', 'crs': CRS({'init': 'epsg:32618'}), 'count': 3, 'height': 718, 'nodata': 0.0}

Rasterio has different ways to access datasets located on disk or at network addresses and datasets located in memory buffers. The features are acquired from GDAL, but the abstractions are different, more Pythonic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment