Skip to content

Instantly share code, notes, and snippets.

@machadovilaca
Last active September 28, 2023 18:05
Show Gist options
  • Save machadovilaca/c28b79cb4725dedc84baa6153961fc84 to your computer and use it in GitHub Desktop.
Save machadovilaca/c28b79cb4725dedc84baa6153961fc84 to your computer and use it in GitHub Desktop.

TestOperatorDown

Meaning

This alert fires when no test-operator pod in the Running state has been detected for 10 minutes.

The test-operator is the first Operator to start in a cluster. Its primary responsibilities include the following:

The test-operator deployment has a default replica of 2 pods.

Impact

This alert indicates a failure at the level of the cluster. Critical cluster-wide management functionalities, such as certification rotation, upgrade, and reconciliation of controllers, might not be available.

The test-operator is directly responsible for the management of the resources in the cluster. Therefore, its temporary unavailability affects significantly the workloads.

Diagnosis

  1. Set the NAMESPACE environment variable:

    $ export NAMESPACE="$(kubectl get tests -A \
      -o custom-columns="":.metadata.namespace)"
  2. Check the status of the test-operator deployment:

    $ kubectl -n $NAMESPACE get deploy test-operator -o yaml
  3. Obtain the details of the test-operator deployment:

    $ kubectl -n $NAMESPACE describe deploy test-operator
  4. Check the status of the test-operator pods:

    $ kubectl get pods -n $NAMESPACE -l=app=test-operator
  5. Check for node issues, such as a NotReady state:

    $ kubectl get nodes

Mitigation

Based on the information obtained during the diagnosis procedure, try to find the root cause and resolve the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment