It's late and I'm tired, but I was lying in bed thinking about test scheduling (yeah, yeah), and had a moment of clarity. This is my attempt to write my thoughts down before falling asleep and losing them all.
We've been trying to think of ways to integrate code coverage into our scheduling algorithms somehow. The problem is that we don't have a reliable way of mapping test paths to chunks. Chunks vary by platform and even within the same platform tests can move around between chunks from one push to another. My realization is that to date we have been thinking of code coverage (or machine learning) reduction as something that happens during the optimization phase. Here are the full set of tasks, now let's filter them down with ccov data.
But instead we should be thinking of this reduction as something that happens
during task generation itself. Inside the transforms, not an optimization at
all. The key change is that a set of
paths (e.g determined by files modified,
or selected via try) needs to become an input to task generation. Probably by
requiring it in the parameters. Then we can ensure task generation is
reproducible for the given set of paths, though passing in an entirely new set
of paths would result in a different taskgraph. If no paths are specified,
that is the same as saying all paths are specified.
This means that chunks could become variable at task generation time. For
example we could have a transform that takes the files defined in
parameters.yml as input, downloads a ccov artifact and figures out which test
manifests need to run. Then based on the number of manifests, it would compute
the number of chunks it needs to schedule (e.g by using the in-tree runtime
metadata). On some pushes we might have ~12 mochitest chunks, on others we might
only have 1.
This idea isn't terribly novel, we've had brainstorm sessions that go down this line of reasoning before. The problem is we always come up with a bunch of things that start to fall apart. But thanks to a perfect storm of people working on various relevant things, I think I have viable solutions for most of them. Notice I say "viable" and not "easy".
Problem #1 - How to find tests?
It's already hard enough to figure out which chunk a test runs in (both for developers and sheriffs). With this proposal not only does each chunk run a completely unrelated set of tests from one push to the next, but the number of chunks itself isn't even predictable!
Solution: Luckily Armen has already started looking into ways to solve this. The problem to date has been where do we get the data from? Now that the decision task is aware of which manifests are in which chunks, it can upload a single artifact that contains all this information (across all platforms). Armen could download this and put it into the treeherder database (or even just some cache). The end result being that when developers type a path into the treeherder search bar, the data in this artifact is used to only show the tasks that ran a test containing that path.
Problem #2 - How to bisect tasks?
Another major problem is how will sheriffs be able to bisect failures? Since there are a different number of tasks from one push to the next, they can't simply trigger "mochitest-2" everywhere.. it would run different things.
Solution: Bob has already been poking around this problem space. But we should
be able to change the backfill action task to use the
path inputs from the
push that has the failure on all the previous pushes (rather than those pushes'
modified paths). To avoid confusion we should put backfills into their own
groupSymbol, or even change the treeherder symbol entirely.
If the failure is tied to a specific test, we could even go a step further and
only specify the test paths that failed in
parameters.yml and then schedule
all chunks that result from that taskgraph generation in the same "family" (e.g
windows debug mochitests). In other words, the backfill tasks only run the
manifests that contained failures. This would be a major shift in how we think
about our CI, but means instead of backfilling tasks, we backfill paths (or
from the users point of view, test failures).
Problem #3 - What about test interactions?
Since variable chunks are run and tests are shuffled all over the place to the extreme, we could run into problems with intermittents caused by interactions between tests.
Solution: We use manifests as the lowest unit. We also require
enabled in each harness that we do this plan with. This does mean we'll need to
continue supporting "classic" chunking for harnesses that don't have this
Problem #4 - How do we select tasks on try?
Kind of similar to #2, but now that there is no "definite"
mochitest-3, what shows up in e.g the
mach try fuzzy interface? How do I say
which tasks to run?
mach try generates a taskgraph locally, we just need to pass
in whatever paths the user specifies and it'll just work. If no paths are
specified (as in the common case), we run the full set of manifests and chunks
should mirror what gets scheduled on central.
The shifting chunks will definitely get confusing though. So we could modify
./mach try fuzzy to hide chunks by default. E.g we can collapse each chunked
group of tests down to a single task label. Then expand out the selected labels
afterwards. We could still provide some kind of
--show-chunks flag for power
users that know what they're doing. Same concept goes for
./mach try chooser.
Problem #5 - How do we compute chunks quickly?
The decision task will now be responsible for chunking manifests across chunks for every platform. This could be slow if we aren't careful (something we should avoid since the decision task is on the critical path of everything).
Solution: In-tree runtimes! Problem: But wait, those are huge / incomplete / difficult to update. Solution: We can modify the ActiveData query to compute the runtime for entire manifests at a time. This means we won't need to worry about disk space, and could even store manifests across all suites in a single file. The decision task could load this data once and cache it for the remainder of task generation. We can continue to use the same algorithm (currently in manifestparser) for chunking by runtime.
Problem #6 - Won't this make treeherder confusing?
It will take a big shift for developers and sheriffs to grasp this new world order. It will be very confusing to see M(1) on two pushes that don't even remotely run the same things.
Solution: Treeherder could hide chunks by default. Instead it could display a stand-in symbol that is "finished" once every chunk in that suite/platform is finished. Power users could tick "Display Chunks" or similar if they need a deeper view. That and outreach/education.