Skip to content

Instantly share code, notes, and snippets.

@samv
Last active September 13, 2019 19:51
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save samv/322ad3b9b29b306a8772186579b7b07b to your computer and use it in GitHub Desktop.
Save samv/322ad3b9b29b306a8772186579b7b07b to your computer and use it in GitHub Desktop.
CRDs Ideas

data resource-focused CRDs and Operators

A lot of operators focus on the service being deployed, such as the service or technology, and its scale. Why not focus on the state resources that they provide?

Schema-first design

Applications have schemas for the data that they are consuming and producing. Building schema first declares these data dependencies explicitly. The Schema system should be relatively ambivalent to the schema language used; a "type" field may indicate whether the schema is a flatbuffers, protobuf, avro, thrift, json-schema, parquet etc.

The schema type will focus on message formats, not SQL schemas for database tables.

Registered schemas could be validated as compatible with a ValidatingAdmissionWebhook. Once validated, the new schema could be issued a 32-bit local schema ID (i.e., the admission controller is also mutating), along with any patches (as indicatred by the schema language or as json patch.

This would function like the "Schema Registry" in an Enterprise Kafka System.

Realtime-next design

General event buses like Kafka, NATS Streaming, etc, may play the role of a bus for these messages.

Primitives from those systems such as materialized views, changelog events/tables, enrichment, and even windowed joins could be represented in a general sense in terms of the messages types.

Most database tables are either:

  • a materialized view of some expression over the events in the system;
  • an API for recording events about new entities (inserts)
  • an API for recording events which influence existing data (updates)
  • an API for recording events which note the end of relevance of existing data (deletes)

So you can look at databases as simply a source and sink for events.

There could be a view CRD which can be a materialized table (either global, or partitioned) or a timeseries stream.

Similarly, the schema CRD could be used as a basis to automatically capture DB writes and convert them into events.

Where supported/possible, the operators could use these objects to stand up Kafka Connect, KSQL, goka, etc stream processors to transform the data.

Kafka

This could use a "Topic" or perhaps "TopicGroup" resource name that lists the topics and perhaps things like acceptable partition and replication levels (replication levels are an application concern where you have applications that don't use the default ISRs).

Kafka RBAC

There could be a CRD for "KafkaClientRole", which declares what the client type needs to consume/produce to, etc - essentially defining a Kafka ACL in resource form. This could in turn generate SSL certs and/or SASL secrets in the appropriate namespace, and a configmap that can be used to connect to the cluster.

The starting point for this would be things like Role and ClusterRole used for Kubernetes RBAC. Those types can't be used directly as they are for controlling access to Kube API resources, not resources for things within the cluster.

Namespaces considerations

The kafka broker should be run in its own namespace, and it could potentially be in a different cluster altogether, or externally hosted - the operator should be an AdminClient for the in-Kafka operations it needs to perform. Zookeeper access may still need to be figured out.

Postgres

One approach here is to simply collect SQL schemas and migrations as their own CRs.

Another more deeply modeled approach would be to tie them to schemas, such that a table can be created from a view defined with a schema. Potentially inserts, updates and deletes can be performed via Pg triggers or an extension that instead emits events, or they can be captured via CDC (Debezium/StorageTapper/etc).

With the definition of a table, the schema is migrated and if necessary populated. A schema would refer to a bunch of tables using selectors, and name them.

The operator would then create a secret which holds the login credentials, as well as a configmap which contains the configuration details for the database.

A different 'deployment' operator would co-ordinate all this. It can then create a deployment with the configmap and secret mounted.

New versions of the application being deployed could re-use these if it has no schema migration; otherwise it could create a new DB schema (or instance) with the tables, populate them as required, and then switch to the new version using a rolling update.

References / Reading list

  • Interesting read but probably overkill: API Aggregation allows you to plug in your own web server to the kubeserver API itself; which given it would require you to also use a modified 'kubectl', means that it has limited, local use only, except perhaps in the context of CI automation.

Kubebuilder, a scaffolding tool for making CRD implementations [kubebuilder]: https://github.com/kubernetes-sigs/kubebuilder [kubebuilder-book]: https://book.kubebuilder.io/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment