liggitt/deflaking.md

## deflaking.md

      
    Raw
  

              deflaking.md
            
          
    Video

This content was presented at a sig-testing meeting on 8/25/2020, available as a video here
Links / References


http://git.k8s.io/community/contributors/devel/sig-testing/flaky-tests.md
https://github.com/kubernetes/kubernetes/labels/kind%2Fflake
http://storage.googleapis.com/k8s-metrics/flakes-latest.json
https://testgrid.k8s.io
http://go.k8s.io/triage
https://godoc.org/golang.org/x/tools/cmd/stress
https://golang.org/doc/articles/race_detector.html

Deflaking Mindset

A flake means there's a problem in one or more of these places:

in the thing being tested
in the test
in the thing running the test (infrastructure)

Temptation: assume flakes are always infrastructure (#3)
We've worked really hard to improve CI infrastructure consistency so this is less often the issue
Temptation: assume flakes are always a "test-only issue"
Always work to understand why a flake is happening before compensating in the test
Examples:

before polling/sleeping in a test, make sure the thing you're testing is supposed to be happening asynchronously
before lengthening a timeout in a test, understand what is taking up the time

sometimes the longer time is expected and we actually need to tolerate longer delays

kubernetes/kubernetes#88053


sometimes the longer time is not expected and we need to fix a bug

kubernetes/kubernetes#88297 (comment)
kubernetes/kubernetes#88382


watch out for changing a test in such a way that it is longer testing the same thing

initial change in kubernetes/kubernetes#93622 (comment) fixed the test flake, but did not fix a real underlying issue
final change in kubernetes/kubernetes#93622 fixed a real bug


Finding things to fix


Reported flakes

https://github.com/kubernetes/kubernetes/labels/kind%2Fflake
http://storage.googleapis.com/k8s-metrics/flakes-latest.json


Tests you see flake in your PRs

before you /retest, see if there's an issue for the failed test


Tests that flake in your SIGs jobs/packages

https://testgrid.k8s.io
http://go.k8s.io/triage


Good flake reports

Template: https://github.com/kubernetes/kubernetes/issues/new?labels=kind%2Fflake&template=flaking-test.md
Example: kubernetes/kubernetes#93358

is this flaking in multiple jobs (can search the failing test or message via http://go.k8s.io/triage)
are there multiple tests in the same package/suite failing with the same apparent error?
link to testgrid history for the jobs, filtered to the relevant tests
include the failed test output (makes issue searchable)
link to the triage query
link to specific failures
tag the relevant sig, if known

Running unit tests to reproduce flakes

Example: kubernetes/kubernetes#93905
Simple attempt:
go test ./pkg/kubelet/config -run TestInvalidPodFiltered

Run bypassing go test cache:
go test ./pkg/kubelet/config -count=1 -run TestInvalidPodFiltered

Run with race detection on:
go test ./pkg/kubelet/config -race -count=1 -run TestInvalidPodFiltered

Build a standalone test binary (with race detection enabled) and stress test with https://godoc.org/golang.org/x/tools/cmd/stress:
go install golang.org/x/tools/cmd/stress@latest

go test ./pkg/kubelet/config -race -c

stress ./config.test -test.run TestInvalidPodFiltered

411 runs so far, 0 failures
/var/folders/7f/9xt_73f12xlby0w362rgk0s400kjgb/T/go-stress-20200825T115041-341977266
--- FAIL: TestInvalidPodFiltered (0.00s)
    config_test.go:126: Expected no update in channel, Got types.PodUpdate{Pods:[]*v1.Pod{(*v1.Pod)(0xc00059e400)}, Op:1, Source:"test"}
FAIL
ERROR: exit status 1
815 runs so far, 1 failures

Running integration tests to reproduce flakes

Similar to unit tests, though most integration tests expect a started etcd instance
In another tab:
etcd
[WARNING] Deprecated '--logger=capnslog' flag is set; use '--logger=zap' flag instead
2020-08-25 11:56:46.975384 I | etcdmain: etcd Version: 3.4.9
...

Example failure/flake: kubernetes/kubernetes#93496 (comment)
go test -c -race ./test/integration/endpointslice
stress ./endpointslice.test -test.run TestEndpointSliceMirroring
...
ls -lat $TMPDIR/go-stress*
cat $TMPDIR/go-stress...

Example package deadlock/timeout: kubernetes/kubernetes#93890

Run each test individually, figure out what the average runtime is
Stress each test individually, bounding the timeout to 100x the average run time (avoids having to wait for 2 minutes for a timeout)
Isolate the particular test that is deadlocking
Add debug output to figure out what is causing the deadlock

Deflaking e2e tests

When running an e2e test, "the thing being tested" is actually the whole system, which is good and bad.
Good:

we do need to make sure our system actually works when you run the whole thing

Bad:

An e2e test of volume subpath behavior can flake because "something" deleted the test namespace (kubernetes/kubernetes#94221)

Takeaways:

Skew toward more unit and integration tests where possible
Don't assume the title of the failing e2e test identifies where the issue is

Gather information

Example: https://prow.k8s.io/view/gcs/kubernetes-jenkins/pr-logs/directory/pull-kubernetes-e2e-gce/1296558932902285312
Artifacts:

build-log.txt
e2e-171671cb3f-674b9-master

kube-apiserver-audit.log (and rotated files)
kube-apiserver.log
kube-controller-manager.log
kube-scheduler.log
...


e2e-171671cb3f-674b9-minion-group-drkr, e2e-171671cb3f-674b9-minion-group-lr2z, e2e-171671cb3f-674b9-minion-group-qkkz

kubelet.log
docker.log
kube-proxy.log
...


Filter/Correlate information

Example: kubernetes/kubernetes#87668 (comment)

Pick relevant components (e.g. test log, API server, controller manager, kubelet)
Filter logs to relevant info, e.g.:

things that happened around the time of the failure
things that happened to relevant objects/namespaces


Prefix log lines with component
Sort by time

Example: kubernetes/kubernetes#94159 (comment)

Log lines pointed at something in runc
Specific lines correlated to 1.0.0-rc10
Searching runc issues revealed a relevant issue with retrying cgroup writes
kubernetes/kubernetes#94159 (comment)

Add debugging output if needed

If you don't have visibility to timing or specific code branches/interactions, adding debug logging is fine:
Example: kubernetes/kubernetes#88297 (comment)
Things to look for

Asynchronous operations in the thing under test

does the test assume something that runs in a goroutine or via a watch happens synchronously?
to verify the issue, simulate slow operations to reproduce issues

time.Sleep(time.Second) at the top of a goroutine
time.Sleep(time.Second) at the beginning of a watch event handler
time.Sleep(time.Second) at the end of a watch event handler
time.Sleep(time.Second) at the beginning of a sync loop worker
time.Sleep(time.Second) at the end of a sync loop worker


Races between the test and the thing under test

does the test conflict with some asynchronous process?
to verify the issue, simulate the test losing the race by putting a time.Sleep(time.Second) between test steps
Example: kubernetes/kubernetes#93496 (comment)

Tests that assume a "fast" operation

A thing that takes < 1 second locally could take a few seconds in CI environments

CI environments are generally more resource-constrained
CI environments often run multiple tests in parallel


Unless your test is specifically testing performance/timing, don't set tight timing tolerances

wait.ForeverTestTimeout is a reasonable stand-in for "things that should not take very long"... use it instead of polling for 1-10 seconds


Incorrectly assuming deterministic output:

map iteration in go is non-deterministic

kubernetes/kubernetes#94185


random allocation conflicting with static allocation in the same test

kubernetes/kubernetes#93906


if you are using a fake client with a watcher, it can relist/rewatch at any point

look for specific actions in the fake client rather than asserting exact content of the full set


Once you find the problem

Always ask whether the fix belongs in the test or in the thing being tested
Example: http://issue.k8s.io/93922
A test flaked because the explicit health port conflicted with an auto-allocated node port.
We could fix it by explicitly setting both in the test, or by fixing the auto-allocation to avoid explicitly specified ports.