Skip to content

Instantly share code, notes, and snippets.

@timflannagan
Last active September 13, 2022 16:22
Show Gist options
  • Star 0 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save timflannagan/5490dc7471831da11772cd695512b863 to your computer and use it in GitHub Desktop.
Save timflannagan/5490dc7471831da11772cd695512b863 to your computer and use it in GitHub Desktop.

OLM-2668 Overview

Overview

Implements the components described in the phase 0 platform operators EP.

Payload Introduction

The o/platform-operators#58 introduces the PO components to the payload.

High Level Components

Note: These components will only be present when admins enable the "TechPreviewNoUpgrades" feature set.

Terminology

  • CPOM: cluster platform operator manager
  • ACO: aggregate ClusterOperator controller
  • PO: platform operator(s)
  • BD: rukpak BundleDeployment API
  • CO: ClusterOperator API

Overview

  • CPOM: Reconciles the PlatformOperator resource and creates BundleDeployment resources. Interacts with OLM's CatalogSource API and registry+v1 bundle format.
  • ACO: Reconciles an "aggregate" ClusterOperator resource and acts as a proxy to the CVO. Lists any PlatformOperator resources in the cluster, inspects their status, and bubbles up any failure states to the "platform-operators-aggregated" ClusterOperator resource.
  • RukPak: Manages the Bundle and BundleDeployment resources. Dynamically watches the underlying resources, and ensures they're present on the cluster.

Configuring Platform Operators

Enabling the "TechPreviewNoUpgrades" Feature Set

This feature set can either be enabled before cluster creation or after cluster rollout.

Before Cluster Creation

Configure a "FeatureGate" YAML manifest and include that manifest in the cluster's Kubernetes manifests before running the openshift installer. The openshift installer will bootstrap the cluster using those configured manifests.

After Cluster Creation

Patch the "FeatureGate" cluster singleton resource and enable this feature set:

cat <<EOF | oc apply -f -
---
apiVersion: config.openshift.io/v1
kind: FeatureGate
metadata:
  annotations:
    include.release.openshift.io/self-managed-high-availability: "true"
    include.release.openshift.io/single-node-developer: "true"
    release.openshift.io/create-only: "true"
  name: cluster
spec:
  featureSet: TechPreviewNoUpgrade
EOF
}

After patching that resource, you can run kubectl wait --for=condition=Available=True clusteroperators.config.openshift.io/platform-operators-aggregated command to ensure a successful rollout.

Installing Platform Operators

Note: Only the redhat-operators catalog source can be used during phase 0. This is a short term limitation. The small pool of testing candidates may include the cert-manager, service-binding-operator, and local-storage-operator packages. Note: During phase 0, automatic upgrades are unsupported. In the case that the generated BundleDeployment resource already exists, the reconciliation logic will skip running the sourcing logic to avoid performing unnecessary work.

An admin may wish to create a PO resource before or after cluster creation.

Before Cluster Creation

The process for creating a PO before cluster creation is similar to the "Enabling the "TechPreviewNoUpgrades" Feature Set" header where the admin will configure a PO YAML manifest and include it with the rest of the cluster's Kubernetes manifests.

After Cluster Creation

After cluster creation, an admin can create a PO resource. The following will attempt to install the cert-manager component as a platform operator:

apiVersion: platform.openshift.io/v1alpha1
kind: PlatformOperator
metadata:
  name: cert-manager
spec:
  package:
    name: openshift-cert-manager-operator

Once that resource has been created, the CPOM component will attempt to find that "openshift-cert-manager-operator" resource in the redhat-operators catalog source, and then generate a rukpak BundleDeployment resource that specifies a registry+v1 bundle image that's within that package name.

It's important to note that an "invalid" PlatformOperator resource can influence cluster lifecycle events, which is similar to how the ClusterOperator components behave. During phase 0, this means that an "invalid" PlatformOperator can block cluster rollout.

Deleting Platform Operators

In the case that an admin wants to delete a previously installed platform operator from the cluster, the following workflow could be used:

  • Run kubectl get platformoperators to determine the metadata.name of the platform operator they wish to delete from the cluster.
  • Run kubectl delete platformoperator <name of PO resource>.

After running those commands, a cascading deletion of the PO resources will be performed. In the case that the PO being deleted was able to successfully source and install a registry+v1 bundle, then the expectation is that all the bundle's resources (e.g. CRDs/operator deployments/etc.) will be deleted.

Note: This cascading deletion behavior can only account for the Kubernetes manifests defined in a registry+v1 bundle image. In the case that a platform operator creates/manages/etc. resources that are defined outside of their bundle image, then those resources won't be garbage collected and will require manual admin intervention to remove an remaining resources. This is a known limitation of the cert-manager component when testing on 4.12 clusters.

Manually Updating Platform Operators

As mentioned above, the phase 0 implementation doesn't support automatic upgrades. After a PO has been successfully installed on the cluster, then the CPOM component will only ensure that the generated BundleDeployment resource is still present on the cluster.

In the case that the redhat-operators catalog source has been updated, and new bundle content is available, then either of the following manual steps can be performed:

  • Delete the underlying BundleDeployment resource. This may result in Kubernetes garbage collecting cluster-scoped resources, e.g. CRDs, which can impact workload health in unsupported configurations.
  • Update the underlying BundleDeployment resource and change the registry+v1 bundle image container image being referenced.

In order to update the underlying BundleDeployment resource:

  • Identify the desired registry+v1 bundle container image. This can be done by inspecting the redhat-operators catalog source, and using OLM's registry gRPC APIs to find bundles within the admin configured package name. The CPOM component will chose the highest semver value for all the bundles in that list.
  • Patch the generated BundleDeployment resource with that desired registry+v1 bundle image. The CPOM component will generate a BundleDeployment resource with the same name as the PO resource that manages that resource.

After performing those steps, you can watch the BundleDeployment resource status to ensure a successful pivot occurs. This can be verified by running the kubectl get bd -w command, and eventually seeing a populated "InstallationSucceeded" message in the "INSTALL STATE" custom column output.

Summary

The CPOM component is responsible for reconciling the PlatformOperators API. During phase 0, admins will configure a desired OLM-based operator package name that's present in the redhat-operators catalog source.

During reconciliation, this component is responsible for finding a registry+v1 bundle that satisfies the admin configured package name. In the case of an invalid configuration, e.g. the configured package doesn't exist, that state will be bubbled up to the PlatformOperator resource the manager is reconciling. After "sourcing" a registry+v1 bundle, the reconciliation logic will generate a rukpak BundleDeployment resource under-the-hood, and delegate the management of that registry+v1 bundle's contents to the rukpak stack.

After generating the BundleDeployment resource that will manage the sourced registry+v1 bundle, the manager will be responsible for inspecting that resource's status sub-resource, and bubble up any failure states to the PlatformOperator resource it's currently reconciling. Common examples may include a registry+v1 bundle that cannot be successfully "unpacked" by rukpak's registry+v1 provisioner, or a registry+v1 bundle contains manifests that are invalid and the Kubernetes API server rejects the creation of that manifest.

Testing

As mentioned above, this work will be delivered through tech preview guidelines. As a result, it may be reasonable to only block payload introduction on any systematic issue discovered that cannot result in a PO being installed.

Reporting Bugs

Any bugs uncovered while testing can be filed against the OLM bugzilla component. Any issues with individual operators should be filed against their bug reporting system(s).

Suggested Testing Plan

  • An "invalid" PO can block cluster rollout. If a PlatformOperator references an OLM package through via the spec.package.name field, and that package doesn't exist in the redhat-operators catalog source, then cluster roll out should be blocked.
  • A generated BundleDeployment resource that references a registry+v1 bundle image that doesn't support the AllNamespace install mode should result in an installation failure. The aggregate CO resource should be updated to reflect that failure condition.
  • When there's no POs installed on the cluster, then the aggregate CO should report an available state.
  • When there are multiple POs installed on the cluster, and one of those POs has failed to find a registry+v1 bundle image or fails to install successfully, then the aggregate CO should report an unavailable state.
  • The platform-operators-aggregate ClusterOperator should have a populated status.versions that matches the other ClusterOperator versions in the cluster.
  • Deleting an installed PlatformOperator should result in the underlying operator resources being deleted.

Optional Testing Checks

  • Installing an OLM operator that references the same package as an installed PO results in a failed installation.
  • Deleting an underlying platform operator resource (e.g. deployment) results in that resource being recreated.
  • Manually updating a PO results in the rollout of new platform operator contents.
  • POs are excluded from the hypershift cluster profile during phase 0.
  • Any PSA/webhook HA/service-ca-operator configuration issues.

Known Limitations

  • OLM's marketplace component is an optional cluster capability that can be disabled. This has implications on phase 0 as we only source from the redhat-operators catalog source that's managed by the marketplace component. Available workarounds include creating this catalog source yourself.
  • The rukpak provisioner implementations don't have the ability to inspect the health/state of the resources that it's managing. This has implications when bubbling up the generated BundleDeployment state to the PlatformOperator resource that owns it. In the case that a registry+v1 bundle contains manifests that can be successfully applied to the cluster, but will fail at runtime (e.g. a Deployment referencing a non-existent image) the result will be a successful status being reflected in an individual PlatformOperator/BundleDeployment resource.
  • Admins configuring these PlatformOperator custom resources before cluster creation cannot easily determine the desired package name without leveraging an existing cluster, or relying on downstream documented examples. There's no validation logic that ensures an individually configured PlatformOperator resource will be able to successfully roll out to the cluster.
  • There's no logic that guarantees package name uniqueness.
  • There's no core ClusterOperator that reflects the status of the CPOM component.
  • There's no logic to ensure that accommodates the current cluster topology mode. All webhooks are deployed using the HA configuration, which has implications on SNO cluster environments.
@timflannagan
Copy link
Author

Also have a hackmd for this: https://hackmd.io/w0SXzm3-QQiKvOAlH73ErQ. I'm not sure if that link is viewable, or how I can easily share this so I moved it to gists instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment