Organizations and technical Jenga

Systems are held together by the teams and human beings supporting them. What happens when that's suddenly disrupted?

November 14, 2022

Infrastructure staff occasionally joke with their colleagues about how long their systems would survive without them.

If you've done your job well, it's actually quite difficult to answer that question. The simple things have been fixed: you've automated away the obvious failures; you've got autoscaling, your corporate credit card is in the black so the auto-provisioning you implemented will work as long as you have money in the bank, and so on. Congratulations!

But the simple things are not all of the things, and the basic reality is that most of these systems are actually held together, at the limit, by the teams and human beings supporting them.

Let me give you an example: disaster recovery (DR).

The Google DR program started in 2007 and got Google to a pretty good place - eventually. It happened every year for a week, and the main goal was to try to successfully run the company from outside of HQ in California.

The early years were pretty funny. Teams ran tests on each other. People flailed around and mostly coped, though sometimes tests had to be rolled back and the exercise ended early.

We discovered a bunch of interesting things. The search engine had a hard dependency on the contents of a file hosted on an individual engineer’s workstation in Mountain View. It was also discovered that Europe couldn't authorise the expenditure to buy enough diesel to cover the datacenter backup generators during an outage. (That was actually relatively easy to fix, compared to other issues.) Interestingly, we’d usually get new dependency problems cropping up in between the anniversaries (unless you actively took steps to prevent it).

After many years of dogged work, eventually it got to the point where the more mature systems were relatively stable and could (in most cases) survive with only half the infra staff – the European half, in this case - looking after them. The less mature systems, and under more chaotic conditions? Unknown.

The great success of this program was to get Google to a point where it could survive for a week under controlled circumstances. A week! Any longer period of time and the answer is essentially “don’t know”.

Azure was very different: much more autonomy for individual business units led to very different approaches to DR. The over-arching disaster recovery programs existed but were much less prioritised and mostly focused on exec communication. (That might have changed since I left, but there was no parallel centralised effort to run the company outside of Redmond at that point.) There were a lot of non-US staff, but permissions, capabilities and so on varied greatly per team. Ultimately, Azure definitely required the active input of humans on a day-to-day basis in order to keep going, but it knew that fact organizationally. My experience is a little less relevant here, but as I understand it, Amazon is more resilient, not because the active input of humans isn't required, but because careful attention is paid to cross-system dependencies and ops culture. Does that stop unexpected failures and large outages? No, alas.

All of the above is a complicated way of saying, the act of organizational and technical Jenga which is currently going on cannot have assess its own ultimate outcomes. Those outcomes are unknowable. Incalculable in some cases, even. Some things are clear and simple (e.g. storage limits) but many things are less so (e.g. the degradation of ML model behaviour over time).

It's a vast act of violence and if it does work out, it works out because humans behind the scenes are keeping it going while the unknown-unknowns are happening.

My sympathies to everyone involved.