What does an SRE do?

Are you a software engineering director in charge of some Site Reliability Engineers (SRE) and wondering what they’re doing - or should do? Then read on!

January 10, 2023

What does an SRE do: high-level view?

From a career point of view, software engineers are quite a diverse bunch. Sure, you can find them writing business logic for a web-app, but you can also find them designing large language models, formally deriving safety properties for nuclear power plant control firmware, and supporting a business analytics framework for the enterprise.

These days, as most SRE’s have a software engineering background, SRE could be seen as one more of those specializations. A key difference is that while SREs will write code, they’re more concerned with the operation of the software and how it interacts with the broader ecosystem. Consequently a wide variety of types of work can fall under the SRE umbrella. 

Here are some popular flavours of SRE:

  • A senior-IC or TL working on the next-gen version of an existing key product. The bulk of contributions probably weigh towards design, but often significant code work is done as well. The task is to take the lessons learned from running v1 in production and use them to inform the design & operation of v2.
  • A senior IC working on a project to have zero-downtime cluster expansion/turndown designs an automation framework to achieve this without any manual action. Their key contribution is to understand the business requirements, the best implementation given how things stand today, and what is tractable: in other words, whole-systems thinking.
  • A mid-level IC or manager working on post-mortems and post-incident-response across a suite of systems. In the ITIL context, this is known as problem management, but in the SRE context is part of the feedback loop that improves not only the system you’re working on, but cognate systems. A particularly popular new focus is on the social and organizational aspect of improving systems.
  • A junior SRE writing monitoring and trace span code to improve visibility into how a key production system works. Once this is done, unnecessary duplication of metric collection can be turned off and large cost-savings achieved.
  • Other emerging functions of SRE include: looking after machine learning systems (particularly useful with the “systems thinking” flavour of SRE), and platform engineering.

Without wishing to be less than inclusive, there are also things which are called SRE that aren’t. You can find more discussion of this at my piece on what SRE isn’t

What does an SRE do: day-to-day?

  • Monitoring. Writing monitoring code, adapting dashboards, piloting monitoring services (like Datadog, Honeycomb, Prometheus or Dynatrace), discovering new ways in which existing visibility is not good enough, or is too much.
  • Cost management, efficiency, and performance. SREs often find themselves caring about performance, cost, and efficiency, while product developers ship features. This involves lots of looking at dashboards, creating spreadsheets, and code or design work.
  • Automation. An SRE drives toil out of the system, either by core code contributions, or leveraging/writing automation frameworks. The idea is to drive the manual touches required to keep the system going down to zero. This helps to scale the people needed sublinearly, even if the system/customer numbers are growing faster than that.
  • On-call and incident response. SREs are often to be found on-call or doing incident response. This is natural and fine, as long as the on-call work is held at a reasonable level. If it’s not, or if on-call is the sole idea, then it’s an operations team, not SRE.
  • Post-mortems & post-incident response. This is a continuous improvement loop, intended to improve the systems and make them more resilient over time. We analyse the outages, write post-mortems, discover action items, and implement them.
  • Testing. In the limit, everything to do with release engineering, including automatic deployment, rollout, figuring out how to do effective rollout, assess the quality of releases, and so on. Often involves a lot of platform work.
  • Capacity planning. Manage capacity for systems. Make sure they have room to grow into. Discover cliff-edges before they are walked off (for example, hidden limits on lambda invocation). Can overlap a lot with performance work, since you can buy yourself more envelope by improving performance.
  • Architecture. Consulting on or implementing good architectural practices. This is typically whiteboard and design-document work.
  • Security. Some SRE teams (notably LinkedIn) add various components of security work to their SRE teams; I’m not very familiar with this so I don’t cover it here, but I suspect it’s mostly response rather than analysis.

You’ll see we use the word platform above a few times, and you may be wondering how this interacts with the phrase of the moment, Platform Engineering. There’s a lot to be written here, but for now what we mean to say by the above list is that platform work is SRE work. Work that puts useful functionality into a platform that other teams can use definitely falls in the SRE. You’ll find SREs doing platform work, operational work, development, work, and everything we write above.”

From this brief overview you can see the value that SRE brings to an organization. While every engineer should have some concern for reliability, safety and efficiency, a specialist eye and skillset helps us reach the standard demanded by the indispensable nature of the systems we run today.