Skip to content

Instantly share code, notes, and snippets.

@jsquire
Last active June 22, 2023 17:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save jsquire/c60aa6229c35d9c35ea7784b23462949 to your computer and use it in GitHub Desktop.
Save jsquire/c60aa6229c35d9c35ea7784b23462949 to your computer and use it in GitHub Desktop.
Schema Registry: JSON Serializer Thoughts

Schema Registry: JSON Serializer

Azure Schema Registry currently supports the Avro schema type and is adding support for JSON Schema. The Azure SDK library for Schema Registry offers an Avro serializer that is integrated with the Schema Registry client. In order to provide parity between the schema formats, serializer support for JSON Schema with a developer experience consistent with Avro is needed.

Business impact

  • Contributes to the Kafka/Confluent compete story for Event Hubs. (marketing "checkbox" feature)

  • Contributes to the Kafka interoperability story for Event Hubs, focusing on cross-product producing and consuming scenarios. (marketing "checkbox" feature)

  • Enables a consistent developer experience with Schema Registry across different schema formats, reducing support costs and special-case documentation needs.

Champion Scenarios

An event published with schema by the Event Hubs clients can be consumed/validated by the Kafka client

  • An object is serialized into EventData using the JSON Serializer:

    • The schema for the object is provided by the caller.
    • The serializer requests the schema from the service.
    • If the schema doesn't exist, validation fails and an error is thrown.
    • If the schema exists, the EventData is created and returned with the content type (including schema Id) set.
  • The EventData instance is published using an Event Hubs client.

  • The Kafka client receives the event:

    • It uses the content type to invoke the Schema Registry JSON serializer plug-in.
    • The plug-in parses the Schema Id from the content-type.
    • The plug-in retrieves the schema from the service.
    • The plug-in validates the schema against the type of object being deserialized and invokes either the error callback (if invalid) or the process message callback (if valid).

An event published with schema by the Kafka client can be consumed by the Event Hubs clients

  • The Kafka client receives a request to publish data:

    • It uses context hints to decide to invoke the Schema Registry JSON serializer plug-in
    • The plug-in validates the object being serialized against the schema in the service (somewhow; details unknown)
    • The client invokes either the error callback (if invalid) or publishes the event and calls the process message callback (if valid)
  • An EventData that was received is deserialized into an object using the JSON Serializer:

    • The serializer parses the schema id from the content-type.
    • The serializer retrieves the schema from the service.
    • If the schema does not exist, validation fails and an error is thrown.
    • If the schema exists, the serializer attempts to deserialize the EventData body into the requested object. If this succeeds, it implies that the object is valid according to the schema.

Data can be serialized and deserialized for a JSON Schema which uses any valid version accepted by the service

  • The serializer should be able to work regardless of the schema version that was registered.

  • The serializer should be able to use future JSON Schema versions without the need to upgrade the package.

NOTE: To support this, the serializer should not assume that it can infer a schema from the type being serialized/deserialzied since it is not possible to know what version the registered schema may be using. Even if the object and schema are identical other than the version, lookup will fail.

Goals

  • Create a JSON serializer package for Schema Registry that has the same general structure and follows the same conventions as the Avro serializer.

  • Ensure that the payload and content-type are populated on the resulting event in the expected JSON format, enabling inter op with Kafka plugins.

Out of scope

  • Building our own JSON Schema generator
  • Building our own JSON Schema validator

Key challenges

  • Schema cannot be automatically generated from the type being serialized/deserialized, as the version of the schema that was previously registered is unknown. If a schema is inferred, there is a risk that the version differs. Any difference in the schema when querying the service - including just version number - results in the service considering the schema as unregistered.

  • There is no built-in schema support for any language. Most languages do not have a free, open-source library backed by an entity trusted to continue long-term support. This leaves the choice of whether we should not support automatic schema validator or should invest in building our own validator.

Proposed API

Serialize

  • string Serialize<T>(T data)

    • Allow the application to infer a schema from the data. For example, using a callback function.
    • If a schema was inferred, use it to query the service. If either no schema was inferred or the service did not return one, the operation fails.
    • If a schema was returned, allow the application to validate the type against it. For example, using a callback function.
    • Serialize and return the data
  • string Serialize<T>(T data, string schemaDefinition)

    • Use the schemaDefinition to query the service; if no schema is returned, the operation fails.
    • If a schema was returned, allow the application to validate the type against it. For example, using a callback function.
    • Serialize and return the data
  • T Deserialize<T>(MessageContent message)

    • Parse the content type of the message to determine the schemaId
    • Query the service for the schemaId; if no schema is returned, the operation fails.
    • If a schema was returned, allow the application to validate the return type against it. For example, using a callback function.
    • Attempt to deserialize and return the data. If deserialization fails, the operation fails.

Open questions

  • What is the content type? Does this follow the same conventions as Avro, making it "application/json+{schemaId}"?

References

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment