What’s Still Hard about Software Operations in 2022?

In 2022 we have tools we couldn’t have dreamed of ten years ago - amazing open source projects like Kubernetes, OpenTelemetry, Prometheus, and Backstage, and many, many SaaS and Cloud infrastructure products, including a number aimed specifically at operations concerns like CI/CD, SLOs, and incident management.

‍

‍

Despite all these tools and services, most organizations running large production software systems still experience challenges.

Software infrastructure projects tend to be of two kinds: platforms that do everything, but require your application to be architected in a very specific way, and specialized tools that do one thing.

All-in-one platforms are not practical past a certain scale, and they may not be workable with certain compliance requirements (for example, data residency or PCI DSS compliance).

Specialized tools can be very useful, but they will generally solve only one problem for you - and only a relatively self-contained problem, at that. Certain types of crosscutting concerns - such as how to build a ‘big red button’ that can stop all of your automation during a production incident, or how to manage multiple overlapping processes safely, such as loadtesting, chaos testing, rollouts and releases, migrations, and scaling - can’t be solved by buying a specialized tool.

Many specialized tools also come with overhead. In the case of SaaS tools, that overhead is tool selection, verifying compliance, managing billing, budgets and quota, and managing authentication and authorization. You may have little insight into changes or problems occurring with the service: you might experience an outage because of a bug in a SaaS application, or you might experience persistent and hard-to-debug performance issues in a tool that you don’t have much visibility into. If you are using multiple SaaS products, even from the same cloud vendor, it can sometimes be difficult to determine quickly what is causing problems, and escalation can be frustratingly slow.

However, with specialized open source software, you will incur overhead in keeping that system up to date, a process which might include managing breaking changes to configurations and APIs. You will have to deal with tracking CVEs and rolling out patches for your OSS tools. You may find you need to fork your software or push changes upstream, and you may need to develop significant expertise in operating and debugging that software.

By the time they are a few years old, most organizations develop a bespoke infrastructure with a fair degree of complexity: fragile internal tooling, custom integrations between platforms, all with their own quirks and limitations. There is organizational pressure to do so. In a world of specialized single use tools, you build glue code to survive. Before you know it, it’s all too easy to end up with an ‘infrastructure hairball’ of cron jobs, lambdas, sidecars, automation tools, run scripts and custom lifecycle hooks, with important logic scattered across many different systems, some of which are not often modified and easily forgotten about. Problems in these systems can be very difficult to diagnose (the SNAFUCatchers Report provides some great case studies).

Teams using Kubernetes exclusively aren’t immune to this phenomenon either: read Charlie Egan of Jetstack’s article How a simple admission webhook lead to a cluster outage, in which interactions between a Kubernetes post start webhook and an OPA agent led to an outage during an upgrade.

This kind of complex setup does not lend itself easily to discoverability or to monitoring. It does lend itself to the creation of ‘islands of knowledge’: specific individuals who know how this complex skein is supposed to hang together and can diagnose faults when things go wrong. These environments are high-context - a lot of background knowledge can be needed to be productive. Onboarding is tough. It’s difficult and risky to reduce tech debt and simplify, particularly if there aren’t a lot of experts around with deep context.

One of the core tasks for any software operations team is to build and maintain a shared mental model of our systems and how they work. We update that mental model when we build something new, when we make changes, when we start or finish a migration, when we investigate something strange and find out how things really work.

There is no tool that helps us to build and maintain that mental model of production. Watch any engineer investigating an anomaly: they will likely flit between source code management tools, monitoring graphs, SaaS consoles, logs, ssh terminals and more. When they figure it out, what do they do? Maybe nothing, maybe a post in a team messaging channel, maybe an update to a wiki page, or a note on a pull request. Will the next engineer who needs that knowledge be able to find it easily? Or can we build better tools, to help us move from high context operations towards lower context operations?

Software catalogs like Backstage can help us to track what services and infrastructure exists. But we need to know how our control systems behave, why they are the way they are, what no longer matters and what the known issues and compromises are. We need to be able to see into the half-hidden infrastructure glue that we too-often forget. This is not an easy problem to solve: it’s a sprawling, cross-cutting set of concerns.

Too many of our tools hide the complexity of production from us: instead we need systems that support us in dealing with that complexity. Running reliable software is a team activity, with learning and knowledge sharing at its core. This is hardly reflected in the tools we have, as an industry. It is time to move beyond wikis and the chaos of messaging apps and shared documents and create dedicated products to support software operations teams in building and sharing knowledge. Confusion, toil, and stale documentation are out; discoverability, collaboration and shared understanding of our systems are in.