george-angel/operator.md

## operator.md

      
    Raw
  

              operator.md
            
          
    So you want to deploy an Operator

https://kubernetes.io/docs/concepts/extend-kubernetes/operator/
You might be exploring a tool and find a section to "Run this as an operator in
your Kubernetes cluster". That sounds swell, you run one magic command, some
make install and now you can define some config to just make things work.
On paper that sounds good, but there are few unspoken considerations.
Costs

Maintenance - once operator is deployed, supporting RBAC components and CRD
definitions need to be kept up to date. You will also need to monitor and
manage the operator, so the usual list of metrics, logs, resource usage, etc.
Setup - Operators comprise of both cluster scopred resources and namespaced.
Things like "Deployment", "Service", "ServiceAccount" are namespaced, where
"ClusterRole", "ClusterRoleBinding" and "CustomResourceDefinition" are
cluster scoped. First step is to separate the two because cluster scoped
resources should live in "kube-system" where CI agent has permission to apply
this scope. It also means @system needs to review changes for that scope - this
is important and will be explained later on "why".
Audit - Before deploying we need to review what kind of permissions the
operator requests and what kind of actions it will perform once deployed.
Discussed in detail in the next section, this needs to be done on every
upgrade.
Risks

APIServer - This is a core Kubernetes component that sits in front of ETCD and
exposes an API to Kubernetes resources. All Operators will want to talk to
APIServer, at the very least to watch for it's own CRD instances. Often
they will also want to rw Deployments, Pods ro Secrets. Here an
operator can be a bad resident by calling APIServer too aggressively, DOSing it
and bringing down the cluster. Examples:
kubernetes-sigs/external-dns#484,
kyverno/kyverno#6977
ETCD - database that keeps Kubernetes cluster state. Its very good at certain
things, but very bad at others. It does not respond well to large objects,
large number of objects, or large overall size. And they consider 8GB to be the
upper limit. More: https://etcd.io/docs/v3.5/dev-guide/limit/. Some operators
treat ETCD as tmpfs for their needs that can have negative and sometimes
problematic consequences, example: kyverno/kyverno#5830
Metrics cardinality - Some operators like Airflow operate creating large number
of short-lived containers to perform tasks. This is not a problem for
Kubernetes itself, but presents a big problem for our Prometheus + Thanos
monitoring stack. There are a number of common metrics that we export for all
Pods, things like cpu, memory, disk, etc. - this allows us to debug issues
without having every team expose these type of metrics themselves. By spinning
up and down large number of Pods, we grow cardinality of these metrics,
which is problematic for Prometheus:
https://www.robustperception.io/cardinality-is-key/
Secrets [permissions] - this is a big one, but is encountered surprisingly
often. A lot of operators will request a default permission to read all
Secrets in the cluster. Example k6:
https://github.com/grafana/k6-operator/blob/main/config/rbac/role.yaml#L74-L81,
https://github.com/grafana/k6-operator/blob/main/config/rbac/role_binding.yaml
that is quite an opening. Lesser in magnitude but its not great that someone
can delete my Deployments:
https://github.com/grafana/k6-operator/blob/main/config/rbac/role.yaml#L8C15-L19
How to deploy


You will need to unpick Helm charts / install scripts and PR cluster
scoped resources to kube-system/<operator>. This is for you to
understand what the operator does, and for us to review permissions and have
our CICD agent apply these resources.


Ideally your operator can run "namespace scoped":
https://sdk.operatorframework.io/docs/building-operators/golang/operator-scope/
. Then only CRD definition needs to live in kube-system perhaps with
ClusterRole, and everything else can live inside your namespace, importantly
including RoleBinding. Using a RoleBinding on ClusterRole means that
permissions granted are only for the namespace where RoleBinding lives.


All namespace scoped components then get deployed inside your namespace.


This is an example of Keda Operator manifests that are broken up by scope and
exposed as a Kustomize base: https://github.com/utilitywarehouse/keda-manifests
Cluster scoped base is then referenced from kube-system:
https://github.com/utilitywarehouse/kubernetes-manifests/blob/master/dev-merit/kube-system/keda/kustomization.yaml#LL4C1-L4C68
where extra resources are bindings:
https://github.com/utilitywarehouse/kubernetes-manifests/blob/master/dev-merit/kube-system/keda/kustomization.yaml#L5-L7
. These are kept out of the base as they specify the namespace where the
operator is deployed.