Memory management is a generally hard topic in computer systems operations. Debugging it inside a cloud-hosted build system is even worse!
There are two potential problems:
- The Bazel server runs in a JVM, and it internally tries to allocate more objects than the max heap size its allowed.
- Bazel spawns subprocesses (called "actions", including test actions) and they collectively exhaust the memory in the machine or VM that Bazel runs in.
I'll cover these scenarios separately since they're mostly unrelated.
Of course, the Bazel JVM heap does occupy system memory, so they're related in the sense that a smaller Bazel server footprint would allow for more actions to run, but I've never considered that to be a potential remediation.
Bazel server out-of-memory
How to tell this is happening:
- Bazel exits with code 33 (see
ExitCodes.java
) - Check the output of
bazel info | grep heap
if the Bazel server is still running, see if it is near the max.
Some things you can do about it:
- Give it more RAM! Assuming the system has some available, you can use the
host_jvm_args
startup flag to adjust the usual JVM parameters like-Xmx2g
. - Always turn on the
--heap_dump_on_oom
flag so that you get extra information in this case. - "Memory saving mode" can be useful if you're hosting Bazel in ephemeral CI workers where you expect every build to be cold, however that's slow and not recommended. If you do that, you can avoid Bazel tracking incremental state which saves some memory. bazel.build/configure/memory
- Roll up your sleeves and figure out what's consuming so much memory in Bazel's JVM. Start from Bazel's documentation on memory profiling. An example can be rulesets where data is repeated rather than using depsets, an example analysis: https://github.com/aspect-build/rules_js/pull/391
System out-of-memory
Bazel schedules actions (build steps and test runners) based on the amount of system resources it thinks are available, and using some heuristic about how much RAM a typical action requires. Two kinds of things can go wrong, either Bazel thinks more RAM is available than the system actually has free, or Bazel underestimates the resources to be reserved for a given action.
By default, Bazel's max concurrency is based on the heuristic that each action needs one CPU core, so the --jobs
flag default is the number of (maybe virtual) CPUs on the machine. Note that Bazel reports progress with a "X running" indicator which might lead you to believe that the concurrency is actually higher, but that's a misleading message because it can include actions that are queued waiting for resources
Did you know when @bazelbuild prints progress like
โ Alex ๐ฆ Eagle (@Jakeherringbone) August 17, 2022
[12 / 100] 32 actions, 30 running
That "running" count includes "remote-cache" spawns! If you have --jobs=16 (the default on 16 core) the other 14 of them aren't actually running, they're queued for "local" spawn. https://t.co/a3w7TWZVTM
How to know this is happening:
- The Bazel server gets killed by the operating system like
Bazel server terminated abruptly (error code: 14, error message: 'Socket Closed', log file: ...)
Some things you can do about it:
- If Bazel is running inside a container, it may calculate available RAM based on the host system rather than what is allocated for the container, due to Default local resources should respect cgroup limits on Linux. The remediation is to explicitly tell Bazel's scheduler what the container limits are by setting the
--local_ram_resources
flag to match the container runtime. - Reduce
--jobs
so that fewer things run concurrently. This is a blunt approach and makes all builds take longer, but saves you the effort of figuring out which actions didn't get enough resource reservation. - Figure out which actions consume a lot of RAM, and tell Bazel's scheduler to reserve more resources for them. For tests, use the
test_size
attribute - a larger size gets more reserved memory per the table in that documentation. For build actions, the Bazel team recommends using execution properties though to be honest, that looks really complex and I haven't used it myself. - Try out the
--experimental_local_memory_estimate
flag to make Bazel smarter about knowing the available system resources at the time it's scheduling the subprocess to spawn. - Investigate using Remote Build Execution so that heavy workloads move off the machine and run on a cloud of executors.