Should we apologize for the Site Reliability Engineering book?

Published in 2016, it’s fair to say that the SRE Book has had a lot of impact. It contributed to broad awareness of the SRE role, and it popularized concepts such as Service Level Objectives (SLOs), and Post-Incident Reporting (sometimes known as post-mortems). It also drove a wave of general consulting service companies offering recycled material from the book as ‘SRE consulting’, not to mention the deluge of regurgitated articles about ‘Five SRE Practices You Can Implement Today’. But the book is not universally praised.

The SRE book is not a manual for running your production operations (the subtitle is ‘How Google Runs Production Systems’, not ‘How You Should Run Your Production Systems’). The book describes some techniques for running production systems, but there are no guarantees that they will work for you (or anyone else, for that matter). It’s definitely not a win-this-argument-for-free card. Your engineering organization has different problems and different resources for tackling those problems to Google: this means that your prioritization will be different and your solutions may make different tradeoffs to Google’s solutions.

SRE at Google, although a small fraction of the overall Google engineering headcount, is nevertheless a pretty sizable organization with dozens of teams dedicated to running different services. The diversity of projects described in the SRE book is partly a consequence of that organizational size. No single team could take on a workload like that - not even SREs!

The diversity of SRE work highlights another facet of SRE: identifying and solving problems in our systems is a core part of the job, and that’s not something that can ever be standardized, or something you can write a fully-specified guaranteed-terminating algorithm for. Some SRE practices - such as SLOs or incident response - are fairly portable from one organization to another. However, other SRE practices are not so portable. Figuring out how to solve hotspotting, or determining how best to manage overload, will look slightly different in each system. Assessing risks is context dependent; so is planning for growth - you have to know what your bottlenecks are. A good post-incident analysis requires detailed understanding of the systems involved. There are lots of other examples of this kind of non-portable and non-standardizable systems work. This type of systems work is the core of the profession.

SREs are fundamentally systems thinkers. We want to know how our systems work, in as much detail as possible, so we can make them more robust and efficient. The SRE book is a set of examples of systems and practices at Google. Even if you’re not going to replicate them in your organization, they still have value as what Donella Meadows called a ‘systems zoo’: a set of case studies of ways of thinking about our systems, both human and technical, and effectively solving problems in those systems.

It is tremendously educational to see how people identify and solve difficult problems, even if those problems aren’t exactly identical to our own. We all only get to see a certain number of problems and solutions firsthand in our careers, and learning from others in our profession can broaden our perspective. Often some elements of previously-successful solutions can be adapted to novel situations. Over time, useful patterns emerge.

Another thing that is valuable about the SRE book is its focus on making the work of operations-focused teams sustainable. This is a cross-cutting concern throughout the book. It shows up in the repeated references to avoiding getting bogged down in operations work at the expense of engineering - because teams need to have time to do engineering work to reduce operational load, as well as the fact that teams with a lot of manual operations work have higher attrition. The same principle makes an appearance in the chapters on oncall, on managing interrupts, and on recovering from operational overload. The emphasis throughout is on making work sustainable: by making oncall infrequent enough not to be an onerous burden; by fixing or pushing back on unsustainable interrupts work; fixing overload by adding expertise and identifying systemic causes of the overload rather than through working longer hours.

In comparison to all the ink spilled (and webinars recorded and startups founded) about SLOs and incident management, the repeated points made in the book about sustainable working do not seem to have captured the imagination of industry thought leaders to the same degree. This is a shame because burnout and attrition are still problems that many organizations are grappling with, particularly since the pandemic.

So don’t rip up your copy of the SRE book just yet. Treat it as what it is: not a manual, but a useful set of case studies about how Google (not your organization) runs its production systems. What you take away from Google’s approach is, of course, your own choice.