Skip to content

Instantly share code, notes, and snippets.

@debradley
Last active April 20, 2020 20:24
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save debradley/44eb1fe8ae54df570152a97bbc593edf to your computer and use it in GitHub Desktop.
Save debradley/44eb1fe8ae54df570152a97bbc593edf to your computer and use it in GitHub Desktop.
Avro

Schema

Avro schemas are defined using JSON and built from primitive types and complex types.

A schema definition may contain many records

A record definition must, at minimum, have type, name, and fields. The full name of a schema is composed of its optional namespace and name. For example, example.avro.User:

{
  "namespace": "example.avro",
  "type": "record",
  "name": "User",
  "fields": []
}  

Fields are defined via an array of objects, each of which defines a name and type.

Mapping

A mapping relates an Avro data type to a language-specific data type. For example, an Avro double is a float in Python. Any given language may support multiple mappings, but all languages support the Generic mapping.

Reading and Writing

Serialized data is stored with its schema, which makes it possible to read Avro records without preloading its schema.

TODO: Is using a schema during read useful for validation?

Always use binary mode when working with files to avoid potential corruption due to automatic replacement of newline characters with their platform-specific representations.

Schema Evolution

When a data file is saved, the schema used to write that file (the "writer's schema") is embedded in the header. However you can optionally use a different, newer schema when reading that file. Any fields added to the new schema must have a default value.

Defining a default null requires a union:

{ "name": "description", "type": ["null", "string"], "default": null }

Other reasons to read a data file with a different schema than the one it was written with:

  • create a projection of the original data by loading only certain fields
  • create aliases of the older field names
  • to sort fields in a different order

Python

The official package is avro-python3 and is written entirely in python. fastavro uses C extensions to get better performance but lacks certain features. The following examples all use the fastavro API.

avro-python validates on writer. fastavro does not by default, but can.

fastavro includes a command line tool that can be used to dump data files (or their schema) as json.

# Load a schema
with open("weather.avsc", "wb") as file:
    schema = parse_schema(file)
    
print(fullname(schema)) # => avro.test.Weather

Writing a Record

# Non-validating write
with open("weather.avro", "wb") as out:
    writer(out, parsed_schema, records)
    
# Writing with compression
with open("weather.avro", "wb") as out:
    writer(out, parsed_schema, records, codec="deflate")

# Validating write: any malformed records will cause an exception
with open("weather.avro", "wb") as out:
    writer(out, parsed_schema, records, validator=True)

Reading a Record

The reader will use the schema stored in the data file's header.

with open('weather.avro', 'rb') as fo:
    for record in reader(fo):
        print(record)
        
# Reading in blocks
with open("weather.avro", "rb") as fo:
    avro_reader = block_reader(fo)
    for block in avro_reader:
        print(
            f"num_records={block.num_records} offset={block.offset} "
            f"size={block.size}"
        )

File Types

.avdl - Avro IDL: a more human-friendly description

.avsc - Avro Schema: is the conventional extension for an Avro schema. A schema file may only contain a single schema definition.

.avpr - Avro Protocol: a collection of schemas

Schemas or protocols will be stored in a schema registry, not an the IDL.

.avro is the datafile, object container file, holds sequences of Avro objects Has a header with metadata: the schema, a sync marker, and a series of optionally compressed data blocks. Blocks contain serialized objects. Blocks are separated by the sync marker from the header. Because of the sync marker, files are splittable, which supports map-reduce processing.

Compression

Blocks can be compressed use a codec. fastavro and BigQuery both support only none, deflate, or snappy. Snappy is faster while Deflate is slightly smaller.

Other

Two avro files containing the same data will never be byte-for-byte the same due to the random sync marker used during file writing. To compare two data files you need to deserialize them and compare their contents (or convert both to JSON and compare that).

Codegen

In Java there's support to generate classes from an Avro schema. These classes are then subclassed to add functionality (so that base classes can be regenerated). There are [https://pypi.org/project/avro-gen/](some attempts) to do something similar in Python.

Pydantic

An alternative to Avro is JSON + pydantic.

avro performance is worse than json, contract can be achieved with validated json. But schema evolution?

pydantic will deserialize json to objects which creates a Resource object

Maybe be able to do the same with avro/dict with some glue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment