ahal/scheduling-braindump.md

## scheduling-braindump.md

      
    Raw
  

              scheduling-braindump.md
            
          
    It's late and I'm tired, but I was lying in bed thinking about test scheduling
(yeah, yeah), and had a moment of clarity. This is my attempt to write my
thoughts down before falling asleep and losing them all.
We've been trying to think of ways to integrate code coverage into our
scheduling algorithms somehow. The problem is that we don't have a reliable way
of mapping test paths to chunks. Chunks vary by platform and even within the
same platform tests can move around between chunks from one push to another. My
realization is that to date we have been thinking of code coverage (or machine learning) reduction as
something that happens during the optimization phase. Here are the full set of
tasks, now let's filter them down with ccov data.
But instead we should be thinking of this reduction as something that happens
during task generation itself. Inside the transforms, not an optimization at
all. The key change is that a set of paths (e.g determined by files modified,
or selected via try) needs to become an input to task generation. Probably by
requiring it in the parameters. Then we can ensure task generation is
reproducible for the given set of paths, though passing in an entirely new set
of paths would result in a different taskgraph. If no paths are specified,
that is the same as saying all paths are specified.
This means that chunks could become variable at task generation time. For
example we could have a transform that takes the files defined in
parameters.yml as input, downloads a ccov artifact and figures out which test
manifests need to run. Then based on the number of manifests, it would compute
the number of chunks it needs to schedule (e.g by using the in-tree runtime
metadata). On some pushes we might have ~12 mochitest chunks, on others we might
only have 1.
This idea isn't terribly novel, we've had brainstorm sessions that go down this
line of reasoning before. The problem is we always come up with a bunch of
things that start to fall apart. But thanks to a perfect storm of people working
on various relevant things, I think I have viable solutions for most of them.
Notice I say "viable" and not "easy".
Problem #1 - How to find tests?

It's already hard enough to figure out which chunk a test runs in (both for
developers and sheriffs). With this proposal not only does each chunk run a
completely unrelated set of tests from one push to the next, but the number of
chunks itself isn't even predictable!
Solution: Luckily Armen has already started looking into ways to solve this. The
problem to date has been where do we get the data from? Now that the decision
task is aware of which manifests are in which chunks, it can upload a single
artifact that contains all this information (across all platforms). Armen could
download this and put it into the treeherder database (or even just some cache).
The end result being that when developers type a path into the treeherder search
bar, the data in this artifact is used to only show the tasks that ran a test
containing that path.
Problem #2 - How to bisect tasks?

Another major problem is how will sheriffs be able to bisect failures? Since
there are a different number of tasks from one push to the next, they can't
simply trigger "mochitest-2" everywhere.. it would run different things.
Solution: Bob has already been poking around this problem space. But we should
be able to change the backfill action task to use the path inputs from the
push that has the failure on all the previous pushes (rather than those pushes'
modified paths).  To avoid confusion we should put backfills into their own
groupSymbol, or even change the treeherder symbol entirely.
If the failure is tied to a specific test, we could even go a step further and
only specify the test paths that failed in parameters.yml and then schedule
all chunks that result from that taskgraph generation in the same "family" (e.g
windows debug mochitests). In other words, the backfill tasks only run the
manifests that contained failures. This would be a major shift in how we think
about our CI, but means instead of backfilling tasks, we backfill paths (or
from the users point of view, test failures).
Problem #3 - What about test interactions?

Since variable chunks are run and tests are shuffled all over the place to the
extreme, we could run into problems with intermittents caused by interactions
between tests.
Solution: We use manifests as the lowest unit. We also require run-by-manifest
enabled in each harness that we do this plan with. This does mean we'll need to
continue supporting "classic" chunking for harnesses that don't have this
support.
Problem #4 - How do we select tasks on try?

Kind of similar to #2, but now that there is no "definite"
mochitest-3, what shows up in e.g the mach try fuzzy interface? How do I say
which tasks to run?
Solution: Since mach try generates a taskgraph locally, we just need to pass
in whatever paths the user specifies and it'll just work. If no paths are
specified (as in the common case), we run the full set of manifests and chunks
should mirror what gets scheduled on central.
The shifting chunks will definitely get confusing though. So we could modify
./mach try fuzzy to hide chunks by default. E.g we can collapse each chunked
group of tests down to a single task label. Then expand out the selected labels
afterwards. We could still provide some kind of --show-chunks flag for power
users that know what they're doing. Same concept goes for ./mach try chooser.
Problem #5 - How do we compute chunks quickly?

The decision task will now be responsible for chunking manifests across chunks
for every platform. This could be slow if we aren't careful (something we should
avoid since the decision task is on the critical path of everything).
Solution: In-tree runtimes!
Problem: But wait, those are huge / incomplete / difficult to update.
Solution: We can modify the ActiveData query to compute the runtime for entire
manifests at a time. This means we won't need to worry about disk space, and
could even store manifests across all suites in a single file. The decision task
could load this data once and cache it for the remainder of task generation. We
can continue to use the same algorithm (currently in manifestparser) for
chunking by runtime.
Problem #6 - Won't this make treeherder confusing?

It will take a big shift for developers and sheriffs to grasp this new world
order. It will be very confusing to see M(1) on two pushes that don't even
remotely run the same things.
Solution: Treeherder could hide chunks by default. Instead it could display a
stand-in symbol that is "finished" once every chunk in that suite/platform is
finished. Power users could tick "Display Chunks" or similar if they need a
deeper view. That and outreach/education.