I touched on Merge Queues in an earlier post: blog.aspect.build/monorepo-shared-green. Today I'll expand on the requirements that lead to considering them, and the alternative design we recommend instead.
When main
is red, it causes problems:
- Can teams release their application if tests are failing?
- Developers who rebase into the red region wonder if they broke something.
- Someone (let's call them the the "Build Cop") needs to take action to repair it. There is often no clear ownership, leading to the slack thread "hey is anyone looking at the red CI?"
- New pull requests are also red, and if merged can cause a "compound" breakage, making repair harder to reason about.
How do we prevent these problems? What policy decisions should we make, and how do we avoid causing new problems?
Reasons why main
goes red
In a naive repository, the main
branch goes red when someone merges broken code. Most teams now have some policies around what may be merged, which include some CI status. This prevents the majority of broken changes from being merged, but not all:
- Two changes which were green independently may be broken when combined, even though the
git merge
operation doesn't cause a merge conflict. This "in-flight collision" is usually rare, but in some cases when many engineers are working on closely adjacent code it can be more common. - "Stale Green" statuses from PRs which aren't rebased "close" to main are more likely to cause an in-flight collision. This can happen for two reasons: 2a) the PR was originally tested with an old "base" commit, so the status was stale when it was reported, or 2b) there was a delay between the status being reported and the PR merged, e.g. after vacation a developer merges a week-old PR which was green at the time.
- Tests are non-hermetic. They depend on external factors such as what OS version the CI machines have installed, or some hosted service being available, or a package manager serving some files. Whatever commit happens to execute on CI after one of these preconditions breaks is now red, with no fault of the changes in the commit.
- If the CI is unreliable, the policy may allow a "break the glass" to merge anyway. Developers may reason "my change is surely safe, this status was a flake" but such reasoning is often flawed.
I'll refer back to these reasons as we look at solutions.
Solution? Merge queues
A merge queue is an extra step where the CI system is triggered at merge-time, on the presumed . This can prevent #1 and #2 above, since the PR is first re-based on "current" HEAD and then tested on top of other changes. It's called a "queue" because the simplest implementation requires each PR to wait until the prior ones are merged, so that it may be rebased and tested in isolation. Some systems fudge on this by batching up PRs, introducing a ton of complexity around determining which member of the batch caused a red status, then replaying them individually, or "smart batching" by re-ordering the queue first based on some heuristic of which changes are least likely to collide.
I think this is a bad solution for a few reasons.
- Developers now have to wait for their changes to merge. They have to "stay at their desk" in case some action is required of them, and their teammates can't pull HEAD to build on top of their work until the merge queue runs.
- It increases costs. The CI system already runs tests for each PR snapshot and then again after merge. Adding a third trigger increases load on CI by up to 50%, and in many orgs this is a substantial increase on their Cloud Compute bill.
- It doesn't address reasons #3 and #4, so you still need a second solution anyway.
Better Solution: On-call plus policy
Here we approach the the problem with two tactics:
- other ways to avoid
main
going red - reduce the impact of red
main
Avoiding red main
Improve reliability: To deal with reason #4, we don't want developers to have a "break glass". But we shouldn't just take away something they have a legitimate need for. Instead we should "post-mortem" every time the CI gave a developer the wrong status - that means mitigating the effects of flaky tests, or refactoring tests to rely less on unreliable external services they expect to connect to.
freshness policy: A green PR status has an expiration date. If it has been more than a couple days, or a hundred commits behind HEAD, or whatever cutoff you choose, then that status no longer satisfies the policy requirement of "green status required to merge". This policy can also be fancier, using some knowledge of the dependency graph or "we think these are dangerous folders" to adjust the policy requirements to merge.
test the prospective merge: an easy solution for reason #2a is to ignore whatever Base commit the developer chose (or got by accident because they forgot to pull
before they started work), and instead test the result of rebasing their changes onto current HEAD. In cases where the rebase fails, you could choose to either force the engineer to rebase or proceed testing with their original base commit and warn. We prefer the latter because forcing developers to rebase can take them out of their "flow" by throwing merge conflict resolution into their product development.
Reduced impact of red main
stable
ref: Also called "Latest Known Green" (lkg
). Whenever main
CI has a green status, and you'd perform next steps like Continuous Delivery, also advance a "pointer" into the git history. A ref
is a lightweight, named pointer. You could have it be an actual branch, although that's not necessary since it will never have a diverging history from main
. Now update typical developer workflows to pull/rebase from stable
rather than main
. (It's also possible to rename the branches so that main
represents latest-known-green and unstable
or bleeding-edge
represents whatever has merged, though this is a bigger change to our mental model of version control).
This avoids developers accidentally rebasing into a red
region, but of course the trade-off is that they may need the "unstable" commit history if they want to pick up from work their teammate just merged.
BuildCop as on-call. Someone is quite literally on-call for reverting broken commits from main
. They require the policy approval from all teams in the repo that they are permitted to "revert first, ask questions later". In regulation-compliant repositories, you also need to inform your regulators that unreviewed commits on main
should be permitted if they simply "rewind history" to a previously-reviewed state.
This makes more sense in bigger monorepos where the burden of carrying a pager is offset by having one person perform this responsibility for a bunch of teams at once. At first, engineers are skeptical of having someone outside their own project or codebase doing this, but they are quickly relieved that main
stays green and they don't really need to be involved.
Aspect Workflows
If you're a Bazel user, you should know that we are building our recommended workflow into our product: docs.aspect.build/workflows
I highly recommend taking a look at this, and comment or reach out to me for a demo if you think this can help your team stay green!