Manishearth/gist:f2971973e164be03890a

## gistfile1.md

      
    Raw
  

              gistfile1.md
            
          
    Here's a design for a smarter build system that I've been toying with. It tweaks the current build system so that it is possible to solve the problem of PRs piling up by getting more build machines -- this isn't possible right now since PRs need to be tested sequentially for a 100% guarantee that they won't break anything.
It introduces three priority levels:


p=rollup: Treat as usual. Rollup into a big ball every few days.


p != rollup:

Lets say we have X extra build machines. Take X such PRs. First see if they can all be merged without conflicts, else choose a different set of X PRs
Test these PRs in parallel on the extra build machines, applied on top of master. This can be done whilst another build is running, provided these PRs apply cleanly on top of both master and the current build. (If not, pick a different set of PRs)

A slightly improved way of doing this is instead of testing A,B,C,D on tryservers, we test A, A+B, A+B+C, A+B+C+D. This makes it easier to identify which PRs are conflicting, automatically, and move ahead without human intervention.


Of the PRs that pass, merge them all into a rollup, and push to auto
Merge if it passes. This should be likely, since we've already determined that there are no individual failures and cross-PR failures are less likely
If it fails, we have multiple options:

If the queue has other PRs, shift focus onto them and notify the authors of the failed PRs of the failure. Lower the priority of the PRs that were just tested and wait for them to be fixed.
If the queue does not have other PRs, do one of the following:

Exclude half the PRs from the mergeball (binary searchlike) and push to auto. There's a good chance it will pass. This is more complicated, but ought to be more efficient. (Not sure, but some experimentation should tell)
Just do these sequentially and individually. This is the worst case scenario, with speed similar to what we have today.


p=individual: Use for llvm updates and other changes which are likely to break things. These will be tested individually, with high priority.


The main point of failure for a rollup -- one of the PRs individually was a failure -- will be remedied by this solution, so we can indiscriminately make large rollups of medium sized PRs without having to deal with this. Doing the individual builds in parallel makes it possible to automate this (whereas in the current system the rollup creator needs to figure out which PR caused the failure and why)

This is similar to Firefox's system, though it's more automated. In Firefox, the following is done:

All patches except tiny ones require a "try push", to the tryserver.
If the try push passes, only then will the sheriff take the patch
~10 patches are merged into fx-team or mozilla-inbound ("auto") at one time. If the rollup passes, they are merged to mozilla-central ("master"). This merge isn't immediate; auto is forwarded to master after a couple of cycles IIRC.
The system here is different from ours since it is built to tolerate failures on master -- there are multiple auto branches (mozilla-inbound, fx-team), and the sheriffs fix breakages via backouts if two patches end up breaking each other's tests. We don't need the flexibility of being able to break master (and are far better off without it) and don't have Firefox's volume, so we can avoid needing sheriffs and still implement this system.