Monorepo Shared Green

UpdatedMay 7, 2024

•6 min read

Fixing Bazel!

The journey to monorepo shouldn't stop when all the code is in a single Version Control repository. After all, much of the touted benefit of monorepos is the increased code sharing and consistency between projects. If each team continues to use the term "repo" for their top-level folder in the monorepo, and works in isolation from other folders in the monorepo, then there was no benefit of moving them in! Instead, we want to continue the "mono" effort, bringing the concept to more of the developer workflow.

Aspect writes a lot about Bazel, which is the "monobuild" for your monorepo, allowing for code sharing between projects. In this post, I want to cover "mono-CI/CD". That is, how many continuous integration pipelines do you have in a monorepo, and how many continuous delivery mechanisms. I'll advocate for the "shared green" model.

First some concepts are needed:

Buildcopping

Each red/green status in the repo needs to be kept green. Discovering a breakage (green->red) and repairing it quickly is the job of a "Build Cop".

Some teams are not so great at being build cops. Usually the responsibility is unclear ("hey, anyone looking at why master is red?") or is assigned to someone who has more pressing product development work and who isn't really proficient at reading logs and reasoning about what broke. They often don't have authority to revert bad commits, instead asking the commit author to resolve it. And too often, authors are attached to what they landed and spend precious time repairing the problem by rolling-forward (adding new commits) rather than reverting.

In a large organization, it's more economical to have a small rotation of people who are well-trained and have an obvious runbook for keeping the pipeline green, than for each team to do this themselves.

It's very risky to deploy a service when tests are failing. In some cases, it's even a violation of regulatory compliance. So, when it's time to release, any brokenness on CI is finally a critical issue to product owners. Thus, a red repo halts deployments and can cause real cost to the business.

The Hard Way: each project has their own pipeline and status

To gate commits, we need to decide which pipelines to run a change against. We could use the dependency graph to determine which targets are potentially affected by a change, but this is incorrect and/or slow.

tinder/bazel-diff is very incorrect
target-determinator is very slow
the SkyFrame system Google uses for this is occasionally incorrect, as Ulf recently reminded me

It's even harder to do CD. Can we release our service? Which tests should be green? Do you release if a library you depend on has a failing test? How do you communicate to engineers why the CD system didn't produce an output for their commit? There are no satisfying solutions on the market today.

The Easy Way: shared green

Shared green simply says, there is one build&test pipeline for the monorepo. This "monostatus" applies across all libraries, applications, and services in the repository. If anything is red, nothing can release.

It's easy for engineers to reason about and requires no code - all CI systems have a way to gate the delivery step on the success of the testing step.

Making shared green scale

The fundamental requirement of a shared green is that it has to be almost always green. Red regions block teams from releasing, and the more they are blocked, the more severe the pushback from those teams. Worse, if your time-to-repair is longer than the interval between breakages landing, they will compound and be much harder to reason about and resolve. Also, another team may have landed a change that depends on the commit that needs to be reverted, so that the oncall now has to revert more than one commit.

How can we stay green more of the time to avoid these shared-green failure modes?

Reduce the time between a bad commit landing and the breakage being reported. Perhaps introduce a "Failing" status in CI, where the build and test is still running, but is known that it will go red later.
Reduce the time it takes the oncall to respond. Make sure the paging system works well, and escalate to secondary. Avoid "false positive" pages where oncall is paged for flakes, as this makes them less likely to respond to a real breakage.
Reduce the time it takes the oncall to repair the build. Point directly to the commit and give instructions for how to revert that commit, or build a button into your UI that reverts immediately.
Reduce the time it takes for the fix commit to be reported as green. This is simply a matter of keeping the master pipeline fast.
Give the oncall authority. No one may question a revert that's performed to keep the CI green.
Post-mortem each breakage. In-flight semantic collisions occur when two PRs are green individually but red when combined - if this becomes frequent, you may need a Merge Queue to re-test green PRs when landing (especially those which are further behind HEAD)
Allow a "break the glass" in CD for teams who want to release despite red. Audit when this happens and work to reduce the frequency this is needed.

Can it really work at my scale?

This is a tough question. Here are data-points that I know:

Google circa 2015 had 2 billion SLOC and 50k engineers. There was no snapshot of the google3 monorepo where all the starlark code could even successfully parse, let alone be analyzed, built and tested. No chance for shared green.
A large Aspect customer has 2 million SLOC and 500 engineers. They are still on shared green, but without the "break the glass" for CD. On-call is sometimes hard, and there are stretches of redness on master which prevents deployment. More investment in on-call responsiveness as suggested above could provide some relief for better experience and more growth.
All other Aspect customers have a shared green.

This suggests that if your company is 2-3 orders of magnitude smaller than 2015 Google (100-1000 times smaller in SLOC * number of engineers) then you may be able to keep a shared green.

Merge Queues

If you encounter in-flight collisions a lot, then you may need to run tests a third time. In addition to testing the developer's snapshot and the result of the merge, you can also run the tests right before merging. This uses more resources of course, and also slows down the developer, because you need to test linearly, one commit (or batch of commits) at a time.

A "Merge Queue" is a separate developer workflow system that manages this. GitHub offers one: Managing a merge queue. There's also a great research paper from Uber describing their fancy Merge Queue design: "Keeping Master Green at Scale".

However, if your rate of in-flight collisions is rare, then we recommend you just allow the build cop to revert a colliding commit. The build cop has to monitor for other failures on main, and the cost for that person to revert the occasional commit a few times a month is typically less than a Merge Queue.

#ci-cd

Comments (1)

Join the discussion

David Cardona3y ago

Hello Alex. Thanks for this amazing article. I always have a big question about monorepo and its CICD workflow

The direct question is: Should I build and test all projects into the monorepo every time a merge on main/master/trunk occur? No matter if a projects doesn't change I will build and test, is it a good approach?

Why did this idea come to me? What happen if I merge into main a super feture which work and run perfectly on PR/MR but when run on main there was a small error in a yml file or some misconfigure file? The code was already merged but no one of the changes was deployed. So I fix the yml but non of the paths defined in change/rules (depends on the CI tool) where touched then nothing will de deployed… We cann’t just re run the last failed pipeline because it always checked out agains the old/failed commit..

There is no way to run the same projects that failed unless I touch some files (with empty spaces or enter or whatever change). IMO This is ugly.

So, an idea came up, what if I always build and test all projects on main . Somehow find the way to deploy those that was changed. Maybe a creating a docker-tag based on the digest-content and so avoid a rolling update, or use the changes/rules, or something else.

What do you think about this?

Thank you very much

Alex Eagle3y ago

If you can use an incremental build and test tool!

Testing everything on every commit is probably too resource intensive (expensive and slow). You really want to test "only what needs to be tested". This problem is called "test selection". Some ppl manually curate a mapping from directories to tests to run, then write their own logic to query git and do the triggering. That's very error prone.

Bazel gives automatic test selection just based on having high cache hit rates for anything unaffected by the developers changes.

David Cardona3y ago

Hi Alex Eagle, thank you very much.

We don't have Bazel nor incremental build/test. It's true that build everything on every merge is expensive and time consuming.

What about this scenario:

What happen if I merge into main a super feture which work and run perfectly on PR/MR but when run on main there was a small error in a yml file or some misconfigure file/script? The code was already merged but none of the changes was deployed. So I fix the yml/script but none of the paths defined in change/rules (depends on the CI tool) where touched then nothing will be deployed… 
We can not just re-run the last failed pipeline because it always checked out agains the old/failed commit..

Can you think of something to handle this situation?

Alex Eagle3y ago

David Cardona that's a bug in whatever logic determined

none of the paths defined in change/rules (depends on the CI tool) where touched then nothing will be deployed

It's really hard to get that logic correct without some tool like Bazel that provides developer feedback when the dependency graph isn't declared correctly.

David Cardona3y ago

Hi Alex Eagle I really appreciate your comments.

So, can I say that If we don't have Bazel/Nx/etc implemented (which would be the optimal solution), but we have a good cache, good CI runners which can run all CI Jobs in a efficient way, would it be better to run all build/test/deploy jobs instead of trying to arrange the rules/change/chang_in of my CI tool? This in a monorepo with about 20-30 projects.

Thank you very much