Skip to content

Instantly share code, notes, and snippets.

@darkowlzz
Last active March 4, 2022 16:41
Show Gist options
  • Star 6 You must be signed in to star a gist
  • Fork 0 You must be signed in to fork a gist
  • Save darkowlzz/969c90b2f309908a6d71dd861ba69653 to your computer and use it in GitHub Desktop.
Save darkowlzz/969c90b2f309908a6d71dd861ba69653 to your computer and use it in GitHub Desktop.

kstatus conditions

The kstatus document describes a few concepts related to the status conditions and a few standard conditions that controllers can implement. This document tries to provide more details about the status conditions with various examples to describe the mechanics of how the conditions work, what they mean and what we can expected from them.

Starting with an example of status conditions:

status:
  conditions:
    - type: DiskPressure
      status: "False"
      reason: KubeletHasNoDiskPressure
      message: kubelet has no disk pressure
      lastHeartbeatTime: "2019-11-17T14:18:26Z"
      lastTransitionTime: "2019-10-22T16:27:53Z"
    - type: Ready
      status: "True"
      reason: KubeletReady
      message: kubelet is posting ready status. AppArmor enabled
      lastHeartbeatTime: "2019-11-17T14:18:26Z"
      lastTransitionTime: "2019-10-22T16:27:53Z"

The above example status contains two conditions of type DiskPressure and Ready. The status field in the conditions indicates the status of the condition, either True or False. reason is a short, one-word, description of the reason behind the condition. message is a human readable phrase or sentence, which may contain details about the condition.

The status conditions are a way to communicate to others about the status of an object. An observer should be able to read the status of an object it's interested in and know the state of the object. The kstatus status and polling packages provide higher level APIs for a client to read object status. The status conditions are machine readable.

Abnormal-true/Negative-polarity Condition

The conditions defined in the kstatus document and packages are designed to adhere to the "abnormal-true" polarity pattern.

In the above example conditions, Ready is an example of positive-polarity/normal-true condition. DiskPressure is an example of a negative-polarity/abnormal-true condition. When the status is True, it indicates that something isn't right, the system is not in a normal state. The negative-polarity/abnormal-true conditions are present and with a value of true whenever something unusual happens. Absence of such conditions means that everything is normal.

NOTE: The above example is not based on kstatus, hence the negative-polarity condition DiskPressure with value False is present in the status. As described above, it should not be present when it's false. Kubernetes native objects don't implement kstatus conventions at the moment.

Object Generation

The kstatus document describes using generation and observedGeneration to indicate any discrepancies in the reconciliation of the objects. The generation specifies the generation of the configuration or the desired state of a resource and is part of the object metadata. The kubernetes api-conventions describes it as:

generation: a sequence number representing a specific generation of the desired state. Set by the system and monotonically increasing, per-resource. May be compared, such as for RAW and WAW consistency.

The observedGeneration specifies the generation of the configuration or the desired state that has been fully reconciled with the actual state. It is not a built-in property of object status. It needs to be added in the status API of objects.

Example:

kind: HelmRepository
apiVersion: source.toolkit.fluxcd.io/v1beta1
metadata:
  ...
  generation: 1
  name: podinfo
  namespace: default
spec:
  ...
status:
  ...
  observedGeneration: 1

In the above example, metadata.generation is the HelmRepository object's generation and status.observedGeneration is the observedGeneration. Since they are equal here, it means that the desired state matches with the actual state of the Helm repository.

Ready, Reconciling and Stalled Conditions

The Ready condition shows if everything is normal or if there's some abnormality due to which an object isn't in the normal state.

As per the kstatus document:

Reconciling: The controller is currently working on reconciling the latest changes.

and:

Stalled: The controller has encountered an error during the reconcile process or it has made insufficient progress (timeout).

Ready is a positive-polarity condition.

Reconciling and Stalled are negative-polarity conditions.

Based on the above definitions, we can define some rules about how to use these three conditions.

  • Reconciling and Stalled conditions are mutually exclusive to each other. They both can't be present at the same time.
  • In presence of either Reconciling or Stalled conditions, the Ready condition will have false value, since reconciling and stalled indicate abnormality.

Some general rules for the Reconciling condition:

  • Reconciling condition can be set when a new generation of configuration is available. This can be detected by comparing the object generation and the status observedGeneration.
  • Reconciling condition should be removed on a successful reconciliation, the desired state matches with the actual state.
  • Reconciling condition can be set when a domain specific drift is detected. This is when an object depends on another object or system and they change, resulting in the object to be reconciled to match the desired state.
  • Reconciling condition should be set and persist across reconciliations during long running operations which may consist of multiple retries.

Some general rules for the Stalled condition:

  • Stalled condition can be set when the controller is sure about some misconfiguration that can't be resolved upon retry. Such situations require the desired configuration to be updated with a new generation.
  • When Stalled condition is determined, it means that the provided object generation is reconciled successfully and it resulted in a stalled state. TheobservedGeneration can be updated to be equal to the current generation.

Along with Ready, Reconciling and Stalled conditions, controllers may add extra conditions to provide extra information about the status of the objects.

Example Scenarios

Following are examples of a few scenarios with various status conditions. They are meant to show and describe the mechanics of status conditions and observedGeneration and how they behave with change in status. The examples are based on the flux helm repository reconciler.

NOTE: The scenarios are not strictly sequential. There may be cases where a scenario may appear to be based on the previous scenario.

Scenario 1

Failure on first reconciliation:

status:
  conditions:
  - lastTransitionTime: "2021-12-17T11:53:35Z"
    message: Reconciling new generation 1
    observedGeneration: 1
    reason: NewGeneration
    status: "True"
    type: Reconciling
  - lastTransitionTime: "2021-12-17T11:53:36Z"
    message: 'Failed to get secret ''default/my-secret'': Secret "my-secret" not found'
    observedGeneration: 1
    reason: AuthenticationFailed
    status: "False"
    type: Ready
  - lastTransitionTime: "2021-12-17T11:53:36Z"
    message: 'Failed to get secret ''default/my-secret'': Secret "my-secret" not found'
    observedGeneration: 1
    reason: AuthenticationFailed
    status: "True"
    type: FetchFailed

The above shows status conditions resulting from a failure in the first reconciliation of the provided configuration. Since it's the first generation of configuration and it failed to reconcile successfully, there's no status.observedGeneration set. The individual status conditions have observedGeneration with the value of the generation that caused those conditions to be set, in this case 1.

The Reconciling condition specifies that the reason for reconciliation is a new generation of object configuration.

The FetchFailed condition specifies the reason for failure.

The Ready condition specifies the actual cause of the overall failure, for which it shows the same reason and message as in the FetchFailed condition.

In this scenario, the failure seems to be due to a missing secret reference. This could be due to accidental deletion of a secret or the secret hasn't been created yet. So, the controller can retry looking for the secret until it becomes available. The object continues to be Reconciling.

If the object configuration is updated to refer to another secret that actually exists, it would result in a successful reconciliation. But since the configuration is updated, it'd result in a new generation. Following is the status after a successful reconciliation with new configuration:

status:
  artifact:
    checksum: 83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111
    lastUpdateTime: "2021-12-17T11:55:22Z"
    path: helmrepository/default/podinfo/index-83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111.yaml
    revision: 83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111
    url: http://localhost:9090/helmrepository/default/podinfo/index-83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111.yaml
  conditions:
  - lastTransitionTime: "2021-12-17T11:55:22Z"
    message: Stored artifact for revision '83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111'
    observedGeneration: 2
    reason: Succeeded
    status: "True"
    type: Ready
  observedGeneration: 2
  url: http://localhost:9090/helmrepository/default/podinfo/index.yaml

This status condition shows a normal state, without any abnormal conditions. All the other abnormal conditions that existed before have been removed because they are no longer true, absent when things are normal. The ready condition has observedGeneration value of 2, which is also the value of status.observedGeneration. This means that the desired state matches the actual state.

Scenario 2

Failure introduced by an update in the configuration.

status:
  artifact:
    checksum: 83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111
    lastUpdateTime: "2021-12-17T11:55:22Z"
    path: helmrepository/default/podinfo/index-83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111.yaml
    revision: 83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111
    url: http://localhost:9090/helmrepository/default/podinfo/index-83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111.yaml
  conditions:
  - lastTransitionTime: "2021-12-17T11:56:14Z"
    message: Reconciling new generation 3
    observedGeneration: 3
    reason: NewGeneration
    status: "True"
    type: Reconciling
  - lastTransitionTime: "2021-12-17T11:56:14Z"
    message: 'Failed to get secret ''default/my-secret'': Secret "my-secret" not found'
    observedGeneration: 3
    reason: AuthenticationFailed
    status: "False"
    type: Ready
  - lastTransitionTime: "2021-12-17T11:56:14Z"
    message: 'Failed to get secret ''default/my-secret'': Secret "my-secret" not found'
    observedGeneration: 3
    reason: AuthenticationFailed
    status: "True"
    type: FetchFailed
  observedGeneration: 2
  url: http://localhost:9090/helmrepository/default/podinfo/index.yaml

The above shows status conditions similar to Scenario 1, but the status.observedGeneration is set to 2 and the observedGeneration in the individual conditions are 3. This means that the object was in normal state before in generation 2 of the configuration. Reconciling generation 3 of the configuration resulted in some abnormal state. The reason for failure is similar to the failure in scenario 1, reference to a secret that does not exist. The new generation may have updated the secret reference.

Scenario 3

Initially, when everything is normal, status contains the happy conditions:

status:
  artifact:
    checksum: 83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111
    lastUpdateTime: "2021-12-17T11:57:10Z"
    path: helmrepository/default/podinfo/index-83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111.yaml
    revision: 83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111
    url: http://localhost:9090/helmrepository/default/podinfo/index-83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111.yaml
  conditions:
  - lastTransitionTime: "2021-12-17T11:57:10Z"
    message: Stored artifact for revision '83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111'
    observedGeneration: 1
    reason: Succeeded
    status: "True"
    type: Ready
  observedGeneration: 1
  url: http://localhost:9090/helmrepository/default/podinfo/index.yaml

In this example, this object depends on another object, a secret reference to perform the reconciliation. If the secret object gets deleted without any change in this object's configuration, the next time when this object is reconciled, it'll result in a failure. The following shows an example of the status due to the failure:

status:
  artifact:
    checksum: 83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111
    lastUpdateTime: "2021-12-17T11:57:10Z"
    path: helmrepository/default/podinfo/index-83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111.yaml
    revision: 83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111
    url: http://localhost:9090/helmrepository/default/podinfo/index-83a3c595163a6ff0333e0154c790383b5be441b9db632cb36da11db1c4ece111.yaml
  conditions:
  - lastTransitionTime: "2021-12-17T11:58:10Z"
    message: 'Failed to get secret ''default/my-secret'': Secret "my-secret" not found'
    observedGeneration: 1
    reason: AuthenticationFailed
    status: "False"
    type: Ready
  - lastTransitionTime: "2021-12-17T11:58:10Z"
    message: 'Failed to get secret ''default/my-secret'': Secret "my-secret" not found'
    observedGeneration: 1
    reason: AuthenticationFailed
    status: "True"
    type: FetchFailed
  observedGeneration: 1
  url: http://localhost:9090/helmrepository/default/podinfo/index.yaml

The failure seems to be due to the same reason as in the previous scenarios. But in this case, the object wasn't updated. Due to this, the status.observedGeneration value and the observedGeneration value are equal. The current generation of the configuration was reconciled successfully previously, but the periodic reconciliation of the same configuration may have resulted in some failures. Since this specific failure is not due to a domain specific drift, for example, an update in the remote source that this controller may have been observing and needs to rebuild the artifact, there's no Reconciling status condition. There's only Ready condition and FetchFailed condition.

Instead of the above failure, if a domain specific drift was detected, that would have resulted in Reconciling condition to be added.

Scenario 4

When the value in the configuration is invalid, the object enters into a stalled state. Usually, invalid values can be caught by object schema validation and validating webhooks, but just for the sake of this example, the provided configuration has some bad values, resulting in the following status:

status:
  conditions:
  - lastTransitionTime: "2021-12-17T12:00:22Z"
    message: 'parse "https+$://stefanprodan.github.io/podinfo": first path segment
      in URL cannot contain colon'
    observedGeneration: 1
    reason: URLInvalid
    status: "True"
    type: Stalled
  - lastTransitionTime: "2021-12-17T12:00:22Z"
    message: 'Invalid Helm repository URL: parse "https+$://stefanprodan.github.io/podinfo":
      first path segment in URL cannot contain colon'
    observedGeneration: 1
    reason: URLInvalid
    status: "False"
    type: Ready
  - lastTransitionTime: "2021-12-17T12:00:22Z"
    message: 'Invalid Helm repository URL: parse "https+$://stefanprodan.github.io/podinfo":
      first path segment in URL cannot contain colon'
    observedGeneration: 1
    reason: URLInvalid
    status: "True"
    type: FetchFailed
  observedGeneration: 1

This shows that object is in the Stalled state with Stalled condition.

FetchFailed shows the reason for the failure.

Stalled condition shows the actual reason for the failure, similar to the FetchFailed condition.

Ready condition shows the overall reason for failure based on the FetchFailed condition.

If we compare this with Scenario 1 where the first generation of configuration failed, this status has status.observedGeneration set to the 1, which is also the observedGeneration of each of the conditions. This means that the controller has successfully processed the given configuration generation and it resulted in a stalled state. Retrying the same configuration will not resolve this failure. A new generation of configuration is needed to resolve this failure.


Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment