Home for notes/TODOs/etc. around getting downstream CI green by the end of the sprint.
- Setting up cluster test environment
- Individual test case failures
- "Catalog represents a store of bundles which OLM can use to install Operators image update"
- "Disabling copied CSVs when an operator is installed in AllNamespace mode should have Copied CSVs in all other namespaces"
- "Metrics are generated for OLM managed resources Metrics emitted by CatalogSources when A valid CatalogSource object is created when The CatalogSource object is deleted deletes the metrics for the CatalogSource"
- "Operator API when a subscription to a package exists should automatically adopt components"
- Launch a clusterbot cluster:
launch openshift/operator-framework-olm#264 gcp
- Wait until the clusterbot cluster has become available
- Download the kubeconfig
- [Optional] Check out the downstream sync PR locally:
gh pr checkout 264
- Override the local $KUBECONFIG environment variable to point to the clusterbot generated kubeconfig
- Run e2e suite locally and focus on a specific test 1
For a list of individual CI failures, see the Tracking CI Failures spreadsheet.
- Reproducibility: Always
- Decision: Flake
- Update: Regression
- Summary: We're running into authentication issues when manipulating the internal registry image.
$ make e2e/olm TEST="Catalog represents a store of bundles which OLM can use to install Operators image update"
make e2e WHAT=operator-lifecycle-manager E2E_INSTALL_NS=openshift-operator-lifecycle-manager E2E_TEST_NS=openshift-operators E2E_TIMEOUT=120m
make[1]: Entering directory '/home/tflannag/Documents/openshift/operator-framework-olm'
scripts/e2e.sh
~/Documents/openshift/operator-framework-olm/staging/operator-lifecycle-manager ~/Documents/openshift/operator-framework-olm
Populating nested vendor directory
all modules verified
Running operator-lifecycle-manager e2e tests
make[2]: Entering directory '/home/tflannag/Documents/openshift/operator-framework-olm/staging/operator-lifecycle-manager'
GO111MODULE=on GOFLAGS="-mod=vendor" go run github.com/onsi/ginkgo/ginkgo -focus 'Catalog represents a store of bundles which OLM can use to install Operators image update' -flakeAttempts 1 -nodes 1 -timeout 120m -v -randomizeSuites -race -trace -progress ./test/e2e -- -namespace=openshift-operators -olmNamespace=openshift-operator-lifecycle-manager -dummyImage=bitnami/nginx:latest -kubeconfig=/home/tflannag/Downloads/cluster-bot-2022-03-21-181202.kubeconfig.txt
Running Suite: End-to-end
=========================
Random Seed: 1647890444
Will run 1 of 179 specs
SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
Catalog represents a store of bundles which OLM can use to install Operators
image update
/home/tflannag/Documents/openshift/operator-framework-olm/staging/operator-lifecycle-manager/test/e2e/catalog_e2e_test.go:664
[BeforeEach] Catalog represents a store of bundles which OLM can use to install Operators
/home/tflannag/Documents/openshift/operator-framework-olm/staging/operator-lifecycle-manager/test/e2e/catalog_e2e_test.go:44
created the catsrc-e2e-qlfv7 testing namespace
[It] image update
/home/tflannag/Documents/openshift/operator-framework-olm/staging/operator-lifecycle-manager/test/e2e/catalog_e2e_test.go:664
Diving into the skopeo
Pod that gets created locally:
Note: we do properly create this skopeo Pod in it's own testing namespace.
$ k get pods
kNAME READY STATUS RESTARTS AGE
skopeo 0/1 Error 0 2m19s
$ k logs skopeo --tail 15
Copying blob sha256:5ed0db9fc7bca9b1823320ef608fe0afafe02d3f447acdd9164cbe51ca85d862
time="2022-03-21T19:22:41Z" level=debug msg="Detected compression format gzip"
time="2022-03-21T19:22:41Z" level=debug msg="Using original blob without modification"
time="2022-03-21T19:22:41Z" level=debug msg="Checking /v2/openshift-operators/catsrc-update/blobs/sha256:5ed0db9fc7bca9b1823320ef608fe0afafe02d3f447acdd9164cbe51ca85d862"
time="2022-03-21T19:22:41Z" level=debug msg="HEAD https://image-registry.openshift-image-registry.svc:5000/v2/openshift-operators/catsrc-update/blobs/sha256:5ed0db9fc7bca9b1823320ef608fe0afafe02d3f447acdd9164cbe51ca85d862"
time="2022-03-21T19:22:41Z" level=debug msg="... not authorized"
time="2022-03-21T19:22:41Z" level=debug msg="... not authorized"
time="2022-03-21T19:22:42Z" level=debug msg="... not authorized"
Copying blob sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4
time="2022-03-21T19:22:42Z" level=debug msg="Detected compression format gzip"
time="2022-03-21T19:22:42Z" level=debug msg="Using original blob without modification"
time="2022-03-21T19:22:42Z" level=debug msg="Checking /v2/openshift-operators/catsrc-update/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4"
time="2022-03-21T19:22:42Z" level=debug msg="HEAD https://image-registry.openshift-image-registry.svc:5000/v2/openshift-operators/catsrc-update/blobs/sha256:a3ed95caeb02ffe68cdd9fd84406680ae93d633cb16422d00e8a7c22955b46d4"
time="2022-03-21T19:22:42Z" level=debug msg="... not authorized"
time="2022-03-21T19:22:42Z" level=fatal msg="Error writing blob: Error checking whether a blob sha256:df20fa9351a15782c64e6dddb2d4a6f50bf6d3688060a34c4014b0d9a752eb4c exists in image-registry.openshift-image-registry.svc:5000/openshift-operators/catsrc-update: unauthorized: authentication required"
Update: brought up another 4.11 cluster using Per's downstream sync PR as the release image through clusterbot, ran the e2e suite using the same content as Per's downstream sync PR, and realized we're likely running into the issue where we had refactored the CatalogSource e2e tests such that each test generates a namespace to cut down on testing pollution:
skopeoArgs := skopeoCopyCmd(testImage, tag, catsrcImage, "new", registryAuth)
- err = createSkopeoPod(c, skopeoArgs, testNamespace)
+ err = createSkopeoPod(c, skopeoArgs, ns.GetName())
Expect(err).NotTo(HaveOccurred(), "error creating skopeo pod: %s", err)
// wait for skopeo pod to exit successfully
- awaitPod(GinkgoT(), c, testNamespace, skopeo, func(pod *corev1.Pod) bool {
+ awaitPod(GinkgoT(), c, ns.GetName(), skopeo, func(pod *corev1.Pod) bool {
return pod.Status.Phase == corev1.PodSucceeded
})
- err = deleteSkopeoPod(c, testNamespace)
+ err = deleteSkopeoPod(c, ns.GetName())
The problematic change was updating how we authenticated to OCP's internal image registry:
registryURL = openshiftregistryFQDN
- registryAuth, err = openshiftRegistryAuth(c, testNamespace)
+ registryAuth, err = openshiftRegistryAuth(c, ns.GetName())
Expect(err).NotTo(HaveOccurred(), "error getting openshift registry authentication: %s", err)
Where we updated the openshiftRegistryAuth(...)
helper function, which is responsible for building up a bearer token for authentication to OCP's internal image reigstry, but neglected to update that openshiftregistryFQDN global constant variable which hardcodes that openshift-operators namespace:
- registryURL = openshiftregistryFQDN
+ registryURL = fmt.Sprintf("%s/%s", openshiftregistryFQDN, ns.GetName())
registryAuth, err = openshiftRegistryAuth(c, ns.GetName())
Expect(err).NotTo(HaveOccurred(), "error getting openshift registry authentication: %s", err)
}
@@ -1156,7 +1162,7 @@ var _ = Describe("Catalog represents a store of bundles which OLM can use to ins
})
const (
- openshiftregistryFQDN = "image-registry.openshift-image-registry.svc:5000/openshift-operators"
+ openshiftregistryFQDN = "image-registry.openshift-image-registry.svc:5000"
catsrcImage = "docker://quay.io/olmtest/catsrc-update-test:"
)
After making those changes, the skopeo Pod was successfully able to authenticate back to the internal image registry, and the test case was no long (perma)failing.
"Disabling copied CSVs when an operator is installed in AllNamespace mode should have Copied CSVs in all other namespaces"
- Reproducibility: Low
- Decision: (Potential) Flake
- Summary: Difficult to triage this further locally. Best guess is we're running in namespaces that take longer to terminate, and we're not filtering out terminating namespaces in our test assumptions.
Run this test in isolation 5 straight times and didn't encounter an error. Hypothesis: poor testing inferences, or long waits to tear down unrelated testing resources that influence this tests' outcome.
The following was the error message that popped up while running CI:
{ Failure /go/src/github.com/openshift/operator-framework-olm/staging/operator-lifecycle-manager/test/e2e/copied_csv_e2e_test.go:70
Timed out after 60.001s.
Expected success, but got an error:
<*errors.errorString | 0xc0009b7510>: {
s: "67 copied CSVs found, expected 70",
}
67 copied CSVs found, expected 70
/go/src/github.com/openshift/operator-framework-olm/staging/operator-lifecycle-manager/test/e2e/copied_csv_e2e_test.go:95}
Is it possible those three namespaces are testing-related namespaces, and there's an issue around tearing them down quickly within a 60s window? Should we add labels to those test generated namespaces, and filter them out when performing the copied CSV check?
"Metrics are generated for OLM managed resources Metrics emitted by CatalogSources when A valid CatalogSource object is created when The CatalogSource object is deleted deletes the metrics for the CatalogSource"
- Reproducibility: High
- Decision: (Potential) Flake
- Update: (Potential) Regression
- Summary: Difficult to triage this further locally. Best guess is we're not handling signals correctly in that container.
- Potential Solution: use the support grpc-based CatalogSource type instead of internal/ConfigMap based type for the metrics-related tests?
Ran this locally a couple of times and was able to consistently reproduce the failure outlined in CI.
Note: we create this CatalogSource in the default operator namespace, so there is the possibility of test pollution in the future.
It looks like the Pod was stuck in terminating for longer than the test case was configured to wait for the Pod to be deleted:
$ k get pods
NAME READY STATUS RESTARTS AGE
metrics-catsrc-valid-mvgkd 0/1 Terminating 0 52s
Simply bumping the wait timeout here could silence the error, but is it possible we need to tweak the terminationGracePeriod further?
Potentially relevant logs from the olm-operator container:
```console
$ k -n openshift-operator-lifecycle-manager logs catalog-operator-7c7d86658b-d7jth --tail 5
time="2022-03-21T21:20:31Z" level=info msg="removed client for deleted catalogsource" source="{metrics-catsrc-valid openshift-operators}"
E0321 21:20:31.517006 1 queueinformer_operator.go:290] sync {"update" "openshift-operators/metrics-catsrc-valid"} failed: failed to get catalog config map metrics-catsrc-valid-configmap: configmaps "metrics-catsrc-valid-configmap" not found
time="2022-03-21T21:21:32Z" level=info msg="removed client for deleted catalogsource" source="{metrics-catsrc-valid openshift-operators}"
time="2022-03-21T21:21:32Z" level=error msg="UpdateStatus - error while setting CatalogSource status" error="Operation cannot be fulfilled on catalogsources.operators.coreos.com \"metrics-catsrc-valid\": StorageError: invalid object, Code: 4, Key: /kubernetes.io/operators.coreos.com/catalogsources/openshift-operators/metrics-catsrc-valid, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: d71a2b75-c92e-47bb-ada2-fc7e3af8bdfc, UID in object meta: " id=pXMM6 source=metrics-catsrc-valid
E0321 21:21:32.043437 1 queueinformer_operator.go:290] sync {"update" "openshift-operators/metrics-catsrc-valid"} failed: Operation cannot be fulfilled on catalogsources.operators.coreos.com "metrics-catsrc-valid": StorageError: invalid object, Code: 4, Key: /kubernetes.io/operators.coreos.com/catalogsources/openshift-operators/metrics-catsrc-valid, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: d71a2b75-c92e-47bb-ada2-fc7e3af8bdfc, UID in object meta:
$ # check whether the Pod is "stuck" in terminating for longer than the 60s timeout
$ k get pods
NAME READY STATUS RESTARTS AGE
metrics-catsrc-valid-rq5mz 0/1 Terminating 0 2m4s
Issue with the configmap-registry-server
container not handling signals correctly? Is this only an issue with clusters that are more bloated compared to KinD clusters?
Note: terminationGracePeriodSeconds is set to 30s. There's no finalizers set on this Pod. Unlikely there's any node pressure on a fresh clusterbot cluster.
Update ran this same test on a 4.10 cluster + using the release-4.10 branch for testing and wasn't able to reproduce anymore. Do we have a regression with some of the registry commit(s) being pulled into the downstream sync PR?
- Reproducibility: Always
- Decision: Bug/Regression
- Update: Regression
- Summary: I'd guess the adoption controller isn't placing the requisite label/annotation such that the Operator controller can correctly propagate this underlying kiali-operator resource to it's status.
Note: we're definitely leaking resources here between runs here. See the following output after running this test locally five times:
k get operators
NAME AGE
kiali.ns-9q7cc 19m
kiali.ns-b62z5 9m25s
kiali.ns-h9z47 5m54s
kiali.ns-w9xbt 14m
kiali.ns-wplg5 3m33s
Diving into this test case failure more locally:
$ k get operators kiali.ns-9q7cc -o yaml
apiVersion: operators.coreos.com/v1
kind: Operator
metadata:
creationTimestamp: "2022-03-21T21:40:08Z"
generation: 1
name: kiali.ns-9q7cc
resourceVersion: "86937"
uid: 5970ac25-d785-4c75-bf12-c9d69b904b40
spec: {}
status:
components:
labelSelector:
matchExpressions:
- key: operators.coreos.com/kiali.ns-9q7cc
operator: Exists
refs:
- apiVersion: apps/v1
conditions:
- lastTransitionTime: "2022-03-21T21:40:43Z"
lastUpdateTime: "2022-03-21T21:40:43Z"
message: Deployment has minimum availability.
reason: MinimumReplicasAvailable
status: "True"
type: Available
- lastTransitionTime: "2022-03-21T21:40:27Z"
lastUpdateTime: "2022-03-21T21:40:43Z"
message: ReplicaSet "kiali-operator-67659cc6b8" has successfully progressed.
reason: NewReplicaSetAvailable
status: "True"
type: Progressing
kind: Deployment
name: kiali-operator
namespace: ns-9q7cc
- apiVersion: apiextensions.k8s.io/v1
conditions:
- lastTransitionTime: "2022-03-21T21:40:26Z"
message: no conflicts found
reason: NoConflicts
status: "True"
type: NamesAccepted
- lastTransitionTime: "2022-03-21T21:40:26Z"
message: the initial names have been accepted
reason: InitialNamesAccepted
status: "True"
type: Established
kind: CustomResourceDefinition
name: monitoringdashboards.monitoring.kiali.io
- apiVersion: apiextensions.k8s.io/v1
conditions:
- lastTransitionTime: "2022-03-21T21:40:26Z"
message: no conflicts found
reason: NoConflicts
status: "True"
type: NamesAccepted
- lastTransitionTime: "2022-03-21T21:40:26Z"
message: the initial names have been accepted
reason: InitialNamesAccepted
status: "True"
type: Established
kind: CustomResourceDefinition
name: kialis.kiali.io
- apiVersion: operators.coreos.com/v1alpha1
conditions:
- lastTransitionTime: "2022-03-21T21:40:10Z"
message: all available catalogsources are healthy
reason: AllCatalogSourcesHealthy
status: "False"
type: CatalogSourcesUnhealthy
kind: Subscription
name: sub-9jrkx
namespace: ns-9q7cc
- apiVersion: operators.coreos.com/v1alpha1
conditions:
- lastTransitionTime: "2022-03-21T21:40:26Z"
lastUpdateTime: "2022-03-21T21:40:26Z"
status: "True"
type: Installed
kind: InstallPlan
name: install-kvmbv
namespace: ns-9q7cc
- apiVersion: operators.coreos.com/v1alpha1
conditions:
- lastTransitionTime: "2022-03-21T21:40:43Z"
lastUpdateTime: "2022-03-21T21:40:43Z"
message: install strategy completed with no errors
reason: InstallSucceeded
status: "True"
type: Succeeded
kind: ClusterServiceVersion
name: kiali-operator.v1.4.2
namespace: ns-9q7cc
- apiVersion: operators.coreos.com/v2
kind: OperatorCondition
name: kiali-operator.v1.4.2
namespace: ns-9q7cc
Note in paste above that there's no kiali-operator ServiceAccount resource that's propagate to the kiali Operator's status:
$ k get sa kiali-operator -o yaml
apiVersion: v1
imagePullSecrets:
- name: kiali-operator-dockercfg-8hd78
kind: ServiceAccount
metadata:
creationTimestamp: "2022-03-21T21:40:26Z"
name: kiali-operator
namespace: ns-9q7cc
ownerReferences:
- apiVersion: operators.coreos.com/v1alpha1
blockOwnerDeletion: false
controller: false
kind: ClusterServiceVersion
name: kiali-operator.v1.4.2
uid: 40663f9c-040a-4da7-b39f-b092a086c6e8
resourceVersion: "86595"
uid: f9674065-1500-4239-a1eb-8b0f4963616c
secrets:
- name: kiali-operator-token-6g895
- name: kiali-operator-dockercfg-8hd78
It's difficult to debug what resource we're waiting for in that gomega matcher so I made the following changes locally:
-func ReferenceComponents(refs []*corev1.ObjectReference) gomegatypes.GomegaMatcher {
+func ReferenceComponents(expectedRefs []*corev1.ObjectReference) gomegatypes.GomegaMatcher {
return &OperatorMatcher{
matches: func(operator *operatorsv1.Operator) (bool, error) {
- actual := map[corev1.ObjectReference]struct{}{}
+ actualRefs := map[corev1.ObjectReference]struct{}{}
for _, ref := range operator.Status.Components.Refs {
- actual[*ref.ObjectReference] = struct{}{}
+ actualRefs[*ref.ObjectReference] = struct{}{}
}
- for _, ref := range refs {
- if _, ok := actual[*ref]; !ok {
+ for _, ref := range expectedRefs {
+ if _, ok := actualRefs[*ref]; !ok {
+ ctx.Ctx().Logf("waiting for the %s expected reference to exist in the list of actual references", ref.String())
return false, nil
}
}
return true, nil
},
- name: fmt.Sprintf("ReferenceComponents(%v)", refs),
+ name: fmt.Sprintf("ReferenceComponents(%v)", expectedRefs),
}
}
Which resulted in the following output:
$ make e2e/olm TEST="Operator API when a subscription to a package exists should automatically adopt components"
...
[It] should automatically adopt components
/home/tflannag/Documents/openshift/operator-framework-olm/staging/operator-lifecycle-manager/test/e2e/operator_test.go:336
waiting for the &ObjectReference{Kind:ServiceAccount,Namespace:ns-h9z47,Name:kiali-operator,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,} expected reference to exist in the list of actual references
waiting for the &ObjectReference{Kind:ServiceAccount,Namespace:ns-h9z47,Name:kiali-operator,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,} expected reference to exist in the list of actual references
waiting for the &ObjectReference{Kind:ServiceAccount,Namespace:ns-h9z47,Name:kiali-operator,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,} expected reference to exist in the list of actual references
waiting for the &ObjectReference{Kind:ServiceAccount,Namespace:ns-h9z47,Name:kiali-operator,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,} expected reference to exist in the list of actual references
waiting for the &ObjectReference{Kind:ServiceAccount,Namespace:ns-h9z47,Name:kiali-operator,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,} expected reference to exist in the list of actual references
waiting for the &ObjectReference{Kind:ServiceAccount,Namespace:ns-h9z47,Name:kiali-operator,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,} expected reference to exist in the list of actual references
waiting for the &ObjectReference{Kind:ServiceAccount,Namespace:ns-h9z47,Name:kiali-operator,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,} expected reference to exist in the list of actual references
waiting for the &ObjectReference{Kind:ServiceAccount,Namespace:ns-h9z47,Name:kiali-operator,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,} expected reference to exist in the list of actual references
waiting for the &ObjectReference{Kind:ServiceAccount,Namespace:ns-h9z47,Name:kiali-operator,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,} expected reference to exist in the list of actual references
waiting for the &ObjectReference{Kind:ServiceAccount,Namespace:ns-h9z47,Name:kiali-operator,UID:,APIVersion:v1,ResourceVersion:,FieldPath:,} expected reference to exist in the list of actual references
And querying for that kiali-operator
ServiceAccount resource results in the following output:
$ k get sa kiali-operator -o yaml
apiVersion: v1
imagePullSecrets:
- name: kiali-operator-dockercfg-98ptm
kind: ServiceAccount
metadata:
creationTimestamp: "2022-03-21T21:55:46Z"
name: kiali-operator
namespace: ns-wplg5
ownerReferences:
- apiVersion: operators.coreos.com/v1alpha1
blockOwnerDeletion: false
controller: false
kind: ClusterServiceVersion
name: kiali-operator.v1.4.2
uid: 133b7bd5-c5a7-4768-ae0f-53c483e3d477
resourceVersion: "94861"
uid: f2c4ca1c-951f-4d0a-b12f-5ec2d6db2096
secrets:
- name: kiali-operator-dockercfg-98ptm
- name: kiali-operator-token-bqzj5
And that seems to confirm that the assumption that there's a potential controller issue.
Potential next step: fire off a 4.10 clusterbot and see whether this same test passes consistently. If so, dive into the c-r v0.11.x changes and see whether we need to continue maintaining a custom fork.
Update: ran this on a 4.10 clusterbot cluster and this looks like a 4.11 regression. The immediate suggestion is to start working towards whether this is simply an issue with the partial-metadata implementation that Joe had tried to fix upstream in the v0.11.x c-r release.
We're still leaking resources between CI runs but the individual test case at least passes:
$ k get operators kiali.ns-6h2mj -o yaml
apiVersion: operators.coreos.com/v1
kind: Operator
metadata:
creationTimestamp: "2022-03-21T23:23:42Z"
generation: 1
name: kiali.ns-6h2mj
resourceVersion: "37005"
uid: 3d6b140c-c212-439f-87fd-94686566f8e6
spec: {}
status:
components:
labelSelector:
matchExpressions:
- key: operators.coreos.com/kiali.ns-6h2mj
operator: Exists
refs:
- apiVersion: apiextensions.k8s.io/v1
conditions:
- lastTransitionTime: "2022-03-21T23:24:21Z"
message: no conflicts found
reason: NoConflicts
status: "True"
type: NamesAccepted
- lastTransitionTime: "2022-03-21T23:24:21Z"
message: the initial names have been accepted
reason: InitialNamesAccepted
status: "True"
type: Established
kind: CustomResourceDefinition
name: kialis.kiali.io
- apiVersion: apiextensions.k8s.io/v1
conditions:
- lastTransitionTime: "2022-03-21T23:24:21Z"
message: no conflicts found
reason: NoConflicts
status: "True"
type: NamesAccepted
- lastTransitionTime: "2022-03-21T23:24:21Z"
message: the initial names have been accepted
reason: InitialNamesAccepted
status: "True"
type: Established
kind: CustomResourceDefinition
name: monitoringdashboards.monitoring.kiali.io
Footnotes
-
make e2e/olm TEST="Catalog represents a store of bundles which OLM can use to install Operators image update" ↩
@timflannagan maybe don't and see if it it turns up green downstream. Then we can, e.g. run it 20 times locally and untoggle the flake tag if it's all good.