anniesullie/perfbot_health_sheriffing.md

## perfbot_health_sheriffing.md

      
    Raw
  

              perfbot_health_sheriffing.md
            
          
    Perf Bot Sheriffing

The perf bot sheriff is responsible for keeping the bots on the chromium.perf
waterfall up and running, and triaging performance test failures and flakes.
Key Responsibilities


Keeping the chromium.perf waterfall green

Handling Test Failures
Handling Device and Bot Failures
Follow up on failures and keep state


Triaging Data Stoppage Alerts

### Keeping the chromium.perf waterfall green
The primary responsibility of the perfbot health sheriff is to keep the
chromium.perf waterfall green.
#### Understanding the Waterfall State
Everyone can view the chromium.perf waterfall at
https://build.chromium.org/p/chromium.perf/, but for Googlers it is recommended
that you use the url
https://uberchromegw.corp.google.com/i/chromium.perf/ instead. The reason for
this is that in order to make the performance tests as realistic as possible,
the chromium.perf waterfall runs release official builds of Chrome. But the
logs from release official builds may leak info from our partners that we do
not have permission to share outside of Google. So the logs are available to
Googlers only.
Note that there are four different views:

Console view makes
it easier to see a summary.
Waterfall view
shows more details, including recent changes.
Sheriff-o-Matic
attempts to group and summarize test failures with bug links. But it is
much more useful if all sheriffs use it.
Firefighter shows traces of
recent builds. It takes url parameter arguments:

master can be chromium.perf, tryserver.chromium.perf
builder can be a builder or tester name, like
"Android Nexus5 Perf (2)"
start_time is seconds since the epoch.


You can see a list of all previously filed bugs using the Performance-Waterfall
label in crbug.
Please also check the perf status
doc and keep it up to date throughout your shift with known issues and ongoing
problems.
#### Handling Test Failures
You want to keep the waterfall green! So any bot that is red or purple needs to
be investigated. When a test fails:

File a bug using this template.
You'll want to be sure to include:

Link to buildbot status page of failing build.
Copy and paste of relevant failure snippet from the stdio.
CC the test owner from go/perf-owners.
The revision range the test occurred on.
A list of all platforms the test fails on.


Disable the failing test if it is failing more than one out of five runs.
(see below for instructions on telemetry and other types of tests). Make sure
your disable cl includes a BUG= line with the bug from step 1 and the test
owner is cc-ed on the bug.
After the disable CL lands, you can downgrade the priority to Pri-2 and
ensure that the bug title reflects something like "Fix and re-enable
testname".
Investigate the failure. Some tips for investigating:

Debugging telemetry failures
If you suspect a specific CL in the range, you can revert it locally and
run the test on the
perf trybots.
You can run a return code bisect to narrow down the culprit CL:

Open up the graph in the perf dashboard
on one of the failing platforms.
Hover over a data point and click the "Bisect" button on the tooltip.
Type the Bug ID from step 1, the Good Revision the last commit
pos data was received from, the Bad Revision the last commit pos
and set Bisect mode to return_code.


##### Disabling Telemetry Tests
If the test is a telemetry test, its name will have a '.' in it, such as
thread_times.key_mobile_sites, or page_cycler.top_10. The part before the first
dot will be a python file in tools/perf/benchmarks.
If a telemetry test is failing and there is no clear culprit to revert
immediately, disable the test. You can do this with the @benchmark.Disabled
decorator. Always add a comment next to your decorator with the bug id which
has background on why the test was disabled, and also include a BUG= line in
the CL.
Please disable the narrowest set of bots possible; for example, if
the benchmark only fails on Windows Vista you can use @benchmark.Disabled('vista').
Supported disabled arguments include:

win
mac
chromeos
linux
android
vista
win7
win8
yosemite
elcapitan
all (please use as a last resort)

If the test fails consistently in a very narrow set of circumstances, you may
consider implementing a ShouldDisable method on the benchmark instead.
Here is
and example of disabling a benchmark which OOMs on svelte.
Disabling CLs can be TBR-ed to anyone in tools/perf/OWNERS, but please do not
submit with NOTRY=true.
##### Disabling Other Tests
Non-telemetry tests are configured in chromium.perf.json.
You can TBR any of the per-file OWNERS, but please do not submit with
NOTRY=true.
#### Handling Device and Bot Failures
##### Purple bots
When a bot goes purple, it's it's usually because of an infrastructure failure
outside of the tests. But you should first check the logs of a purple bot to
try to better understand the problem. Sometimes a telemetry test failure can
turn the bot purple, for example. If the bot is purple due to a test failure,
file a bug following the instructions for
handling test failures.
If the bot goes purple and you believe it's an infrastructure issue, file a bug
with
this template,
which will automatically add the bug to the trooper queue. Be sure to note
which step is failing, and paste any relevant info from the logs into the bug.
##### Android Device failures
There are two types of device failures:

A device is blacklisted in the device_status_check step. You can look at
the buildbot status page to see how many devices were listed as online during
this step. You should always see 7 devices online. If you see fewer than 7
devices online, there is a problem in the lab.
A device is passing device_status_check but still in poor health. The
symptom of this is that all the tests are failing on it. You can see that on
the buildbot status page by looking at the Device Affinity. If all tests
with the same device affinity number are failing, it's probably a device
failure.

For both types of failures, please file a bug with this template
which will add an issue to the infra labs queue.
#### Follow up on failures and keep state
Pri-0 bugs
should have an owner or contact on speed infra team and be worked on as top
priority. Pri-0 generally implies an entire waterfall is down.
Pri-1 bugs
should be pinged daily, and checked to make sure someone is following up. Pri-1
bugs are for a red test (not yet disabled), purple bot, or failing device.
Pri-2 bugs
are for disabled tests. These should be pinged weekly, and work towards fixing
should be ongoing when the sheriff is not working on a Pri-1 issue.
If you need help triaging, here are the common labels you should use:

Performance-Waterfall should go on all bugs you file about the bots,
it's the label we use to track all the issues.
Infra-Troopers adds the bug to the trooper queue. This is for high
priority issues, like a build breakage. Please add a comment explaining
what you want the trooper to do.
Infra-Labs adds the bug to the labs queue. If there is a hardware
problem, like an android device not responding or a bot that likely needs
a restart, please use this label. Make sure you set the OS- label
correctly as well, and add a comment explaining what you want the labs
team to do.
Infra label is appropriate for bugs that are not high priority, but we
need infra team's help to triage. For example, the buildbot status page
UI is weird or we are getting some infra-related log spam. The infra team
works to triage these bugs within 24 hours, so you should ping if you do
not get a response.
Cr-Tests-Telemetry for telemetry failures.
Cr-Tests-AutoBisect for bisect and perf try job failures.

If you still need help, ask the speed infra chat, or escalate to sullivan@.
### Triaging data stoppage alerts
Data stoppage alerts are listed on the
perf dashboard alerts page. Whenever
the dashboard is monitoring a metric, and that metric stops sending data, an
alert is fired. Some of these alerts are expected:

When a telemetry benchmark is disabled, we get a data stoppage alert.
Check the code for the benchmark
to see if it has been disabled, and if so associate the alert with the
bug for the disable.
When a bot has been turned down. These should be announced to
perf-sheriffs@chromium.org, but if you can't find the bot on the waterfall
and you didn't see the announcement, double check in the speed infra chat.
Ideally these will be associated with the bug for the bot turndown, but
it's okay to mark them invalid if you can't find the bug.

If there doesn't seem to be a valid reason for the alert, file a bug on it
using the perf dashboard, and cc the owner. Then do
some diagnosis:

Look at the perf dashboard graph to see the last revision we got data for,
and note that in the bug. Click on the buildbot stdio link in the tooltip
to find the buildbot status page for the last good build, and increment
the build number to get the first build with no data, and note that in the
bug as well. Check for any changes to the test in the revision range.
Go to the buildbot status page of the bot which should be running the test.
Is it running the test? If not, note that in the bug.
If it is running the test and the test is failing, diagnose as a test
failure.
If it is running the test and the test is passing, check the json.output
link on the buildbot status page for the test. This is the data the test
sent to the perf dashboard. Are there null values? Sometimes it lists a
reason as well. Please put your finding in the bug.