Automated testing of each commit != CI

Cross-posted from my Medium blog from 2019

I’ve had some chances on a recent trip to get more first-hand engagement with some enterprise-scale Angular customers. What I learned reinforces the impression I’ve gotten about our industry as micro-service architecture rolls out: we no longer do Continuous Integration (CI).

These companies seem self-assured that they have good testing practices: they have an automated test suite, and they run it continuously. That’s really great, and catches some bugs earlier and makes happier developers who don’t have to go back as frequently to debug through the sludge of years-old code. So you have the C in CI, no problem.

We must remember what the I in CI stands for. What are we integrating?

Any large organization requires breaking down work into departments. Ever wondered why a space agency has a control room with such a huge number of desks with controllers? Rocketry and space flight is such a massive technical undertaking, with different scientific and engineering disciplines working together, that they’ve built an operations process which gives them immediate access to information from each.

Imagine if instead of the big control room, the astronauts were just talking to a “frontend” team who then had to consult what the “backend” team thought, sometimes with a several day round-trip, and that in turn was using outdated manuals for the spacecraft. Each department might be doing the right thing on their own, but they are not “integrated”. They don’t act as a single unit.

The problem I see in companies adopting a microservice architecture is similar. Each team has tested their own components and are ready for deployment. Then when it’s time to get some new software up there to our astronauts (or other users), we discover that it doesn’t work in the QA environment — taking a day to trace the change in some other department. This is really bad for the business. If it takes weeks to integrate the software each time we want to deploy, then critical business initiatives have to build in extra time budget for software changes.

This delay in shipping causes a terrible feedback loop in the organization. As it takes weeks to release a change through “the process”, teams want to have more autonomy to have their own release schedule. It seems like it should accelerate the process, but of course this exacerbates the problem. These new autonomous units only run their own tests, and so the interactions are untested when used with the other systems it will have to integrate with in production.

The usual software architect or consultant replies, “This is not a problem because rigorous API contracts are drawn at the boundaries of each service. Each part is tested against that API”. It sounds like a great answer. Does it work?

There’s a guy who worked on the C++ team at Google named Hyrum Wright. He had to make global changes across Google’s monorepo to migrate everyone to newer versions of some base library API. That API had been specified, and was tested against, and the API was not being changed. Yet library changes would inevitably cause a bunch of test failures across Google. What were we doing wrong?

There’s now an observation called “Hyrum’s Law” based on this experience:

With a sufficient number of users of an API,
it does not matter what you promise in the contract:
all observable behaviors of your system
will be depended on by somebody.

Another way to say this is, while your API surface is constrained in spirit, it is unconstrained in practice. Change how a result is sorted? Someone relied on the prior ordering.

Now comes the incorrect conclusion from some architects: “So then that was a bug in the client of the API. The contract never guaranteed that the data would be sorted. Shame on them”. What is the business to make of this? The client software team made an avoidable error and is therefore negligent? Passing the blame this way doesn’t actually solve the business problem: these integration errors happen and are preventable. API contracts are not a sufficient way to prevent them.

As software engineers, we know that even if we are quite diligent, we’ll make accidental assumptions that happen to work today. The solution to developers making human mistakes is to add QA which catches it. Therefore, we should be writing (and continuously running) tests that exercise the entire stack we’ll deploy on: integrating it continuously. Only this can assure reliable delivery of our new code. And remember, the astronauts are depending on us.

Let's fix it!

In my article I talk about the technical details of how to get Continuous Integration back.