samv/crds.md

## crds.md

      
    Raw
  

              crds.md
            
          
    data resource-focused CRDs and Operators

A lot of operators focus on the service being deployed, such as the service
or technology, and its scale.  Why not focus on the state resources that they
provide?
Schema-first design

Applications have schemas for the data that they are consuming and producing.
Building schema first declares these data dependencies explicitly.  The Schema
system should be relatively ambivalent to the schema language used; a "type"
field may indicate whether the schema is a flatbuffers, protobuf, avro, thrift,
json-schema, parquet etc.
The schema type will focus on message formats, not SQL schemas for database
tables.
Registered schemas could be validated as compatible with a ValidatingAdmissionWebhook.  Once validated, the new schema could be issued a 32-bit local schema
ID (i.e., the admission controller is also mutating), along with any patches
(as indicatred by the schema language or as
json patch.
This would function like the "Schema Registry" in an Enterprise Kafka System.
Realtime-next design

General event buses like Kafka, NATS Streaming, etc, may play the role of
a bus for these messages.
Primitives from those systems such as materialized views, changelog
events/tables, enrichment, and even windowed joins could be represented in a
general sense in terms of the messages types.
Most database tables are either:

a materialized view of some expression over the events in the system;
an API for recording events about new entities (inserts)
an API for recording events which influence existing data (updates)
an API for recording events which note the end of relevance of existing data (deletes)

So you can look at databases as simply a source and sink for events.
There could be a view CRD which can be a materialized table (either global, or
partitioned) or a timeseries stream.
Similarly, the schema CRD could be used as a basis to automatically capture DB writes
and convert them into events.
Where supported/possible, the operators could use these objects to stand up
Kafka Connect, KSQL, goka, etc stream processors to transform the data.
Kafka

This could use a "Topic" or perhaps "TopicGroup" resource name that lists the
topics and perhaps things like acceptable partition and replication levels
(replication levels are an application concern where you have applications that
don't use the default ISRs).
Kafka RBAC

There could be a CRD for "KafkaClientRole", which declares what the client type
needs to consume/produce to, etc - essentially defining a Kafka ACL in resource
form.  This could in turn generate SSL certs and/or SASL secrets in the
appropriate namespace, and a configmap that can be used to connect to the cluster.
The starting point for this would be things like Role and ClusterRole used
for Kubernetes RBAC.  Those types can't be used directly as they are for
controlling access to Kube API resources, not resources for things within the
cluster.
Namespaces considerations

The kafka broker should be run in its own namespace, and it could potentially
be in a different cluster altogether, or externally hosted - the operator
should be an AdminClient for the in-Kafka operations it needs to perform.
Zookeeper access may still need to be figured out.
Postgres

One approach here is to simply collect SQL schemas and migrations as their own CRs.
Another more deeply modeled approach would be to tie them to schemas, such that
a table can be created from a view defined with a schema.  Potentially inserts,
updates and deletes can be performed via Pg triggers or an extension that
instead emits events, or they can be captured via CDC (Debezium/StorageTapper/etc).
With the definition of a table, the schema is migrated and if necessary populated.
A schema would refer to a bunch of tables using selectors, and name them.
The operator would then create a secret which holds the login credentials, as well
as a configmap which contains the configuration details for the database.
A different 'deployment' operator would co-ordinate all this.  It can then create a
deployment with the configmap and secret mounted.
New versions of the application being deployed could re-use these if it has no
schema migration; otherwise it could create a new DB schema (or instance) with the
tables, populate them as required, and then switch to the new version using a
rolling update.
References / Reading list


Interesting read but probably overkill: API Aggregation allows you to plug in your own web server to the kubeserver API itself; which given it would require you to also use a modified 'kubectl', means that it has limited, local use only, except perhaps in the context of CI automation.

Kubebuilder, a scaffolding tool for making CRD implementations
[kubebuilder]: https://github.com/kubernetes-sigs/kubebuilder
[kubebuilder-book]: https://book.kubebuilder.io/