The Silo Dance
One of the fundamentals of professional software engineering is understanding how to build well at scale. It’s a timeless topic, and one where the I wanted to share some of the fundamentals. One of my old teams, the process we went through when promoting code was jokingly called “the silo dance”. It was a critical ritual (and yes, sometimes we literally did a funky little dance while waiting for the builds to go). To help explain it, let me give a bit of background: in software development, there’s always dependencies between components, often spanning teams. There are also different layers (and tightness) of dependencies. How to manage that is tricky.
Often, it’s a very clear separating line between layers, like to the OS, but it’s commonly something closer like version of a framework (eg: node.js 16, etc..) and utility open-source libraries. The trickiest are very tight linkages like between components that share deep knowledge of data layout in memory such as the just-in-time (JIT) compiler and the garbage collector (GC) of a language runtime. It’s usually different teams working on different components, so they develop in their own silos until they promote their work to other teams. The underlying concern is the same in the end: every time there’s a change, there’s the possibility of incompatibilities between components that lead to broader failures.
Some kinds of change are safer than others. For instance, I rarely worry now about Linux kernel version changes on machines, but I do worry about Linux distro version changes. For example, going from Ubuntu 18.04 to 20.04 is not something I treat as trivial, even though it’s robust and well tested software. Moving framework versions (node.js 12 to 14) needs planning and considerable testing. How to do that without “swap it and pray” is critical.
During development in the large (ie: when spanning across teams, especially in platform development), one of the hardest phases is integration testing — meaning when all the components are assembled and being tested together for the first time. Doing it well means timely and a stable system. Doing it poorly will lead to weeks or months of lost productivity where the integration builds never work because the various components end up in a “it all has to work for any of it to work”, and teams struggle to ever get it all right at the same time. The path to success is said simply: ensure that the number of changed components at any one time is low (ideally one), and that testing is broad across all components.
Going deeper, the idea is to continuously integrate component changes in side-channels from the main stable build channel. All teams should have their devs using “all stable build channel” components, modulo the private/new version of their own component. When component X wants to deliver a new level (call it X.v2) to the main channel, then a side-channel build/test would get launched that is “all components from main but with X.v2 replacing X.v1”. If it succeeds, then X.v2 gets added to the main channel. If it fails, it’s rejected immediately, and the main channel is unaffected (and still works).
You might be saying “but what happens if 2 teams both add new components at the same time?” There’s a couple of approaches: add both to the same build, or run separate builds. Let’s explore both paths…
If you add both into the same side-channel build, there’s two problem to watch for. First is that it’s hard to assign ownership to problems if it fails. It’s now two teams that need to figure something out, which is more communication than is ideal (and there’s more potential for finger-pointing). The second problem is that pairs of teams start trying to “time” their builds so that API changes get slipped through at the same time. This is an anti-pattern as APIs should evolve gracefully, and we don’t want to encourage that by making it easy (or even possible).
The preferred approach is doing 2 separate builds. You’re always only testing one new thing at a time. There’s a question on whether they can be done in parallel… and the answer is “it depends”. If in parallel, it’ll be more build resources but gives teams more wall-clock responsiveness. At the same time, it leads to an increased risk of breaking the main channel. This is because each component might test well alone, but when combined with the other new component, the bigger product no longer works. This is a trade-off teams need to decide for themselves. A clearly established “back-out” protocol is needed too as you never, ever want the main channel broken.
This brings us back to where we started: the “silo dance”. The team I was part of was struggling with broken builds due to a very common and simple approach to integration. It was a “take latest from every team from the (single) nightly build and stir it all together along with my change”. This worked well at times, but when regressions got into the main channel, things went poorly. It was causing major frustration because the developers felt like they were working on quicksand. When you feel like you’re debugging across dozens of changes from across the organization, including in components you don’t own all at once, it’s hard to move fast. An ideal scenario is that I’m only debugging one set of changes at once: mine. We fixed this by having a series of “silo builds” where it was always the good pattern: “all stable but with one component changed”. To get a feature all the way through to delivered to the rest of the org, there’d be a set of moves (thus the “dance”) to ensure all the current stable components were pulled into the various silos, proven stable locally, then the change was applied, the results examined, and if good, the change was promoted.
There are more sophisticated approaches to doing integration at scale, but the fundamentals remain the same: minimize the number of changes in each build, ensure there’s strong automated testing to catch problems before they escape, and never allow code that has failed side-channel builds into the main channel.