Skip to content

Instantly share code, notes, and snippets.

@mgodave
Created January 26, 2018 20:54
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save mgodave/b265250a685f3574166ae617462ea4f9 to your computer and use it in GitHub Desktop.
Save mgodave/b265250a685f3574166ae617462ea4f9 to your computer and use it in GitHub Desktop.
  • Status: Proposal
  • Author: Dave Rusek - Streamlio
  • Pull Request: See Below
  • Mailing List discussion:

Motivation

Data flowing through a messaging system is typically untyped. Data flows from end-to-end as bytes and only the producers and consumers are aware of the type and structure of the data. This requires systems to coordinate out-of-band and makes it hard for other systems to discover useful data on which they can operate. Schema registries help to alleviate these problems by providing a centralized storage area for structural definitions of system data. By having a centralized storage repository systems producing data to the system can communicate to downstream systems the structure of the data being produced.

This document is a proposal to build a schema registry service tightly integrated with Pulsar's topic hierarchy. This schema integration is an opt-in feature and will not affect existing or future properties, clusters, namespaces, or topics that do not choose to take advantage. If however, an administrator chooses to use this functionality then it will serve as a self-describing integrity check for data in the system as well as allow integrations between Pulsar and other systems that are able to discover and take advantage of this type information

Design

Data Model

message Schema {
    enum Format {
        AVRO = 0;
        JSON = 1;
        PROTOBUF = 2;
        THRIFT = 3;
    }

    enum State {
        STAGED = 1;
        ACTIVE = 2;
    }

    optional string name = 1;
    optional int32 version = 2;
    optional Format format = 3;
    optional State state = 4;
    optional string modified_user = 5;
    optional string modified_time = 6;
}

Storing Schema Data

Schema data will be stored alongside message data in BookKeeper. Much like a managed ledger schema entries will be stored as an append only, ordered, list of entries. Schema entries occupy a BookKeeper Ledger and a topic with an associated schema will require a zookeeper node. Topics without any associated schema data will incur no overhead.

Serving Schema Data

Serving schemas from the pulsar brokers would allow us to take advantage of the topic ownership routing logic to co-locate a schema with it’s topic as well as ensure a single owner per schema ledger in the case of the streamlio schema registry. Such an arrangement would serve both read and writes through the same broker. This will require a new admin API to expose the schema data model as a collection of REST resources.

@GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
@GET @Path("/{property}/{cluster}/{namespace}/{topic}/schema/{version}")
@DELETE @Path("/{property}/{cluster}/{namespace}/{topic}/schema")
@POST @Path("/{property}/{cluster}/{namespace}/{topic}/schema")

Changes

  • Implement a Schema Repository in Pulsar brokers Staged PR
  • Add Schema resouces to broker admin API Staged PR
  • Extend client/server binary protocol to expose schema to client PR
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment