SRE in Stormy Weather.

What do you do when the graph doesn't go up and to the right? Welcome to the doldrums. SREing in challenging times, from Tiarnán

Tiarnán de Burca

November 30, 2022

You know the way you go to the internet and it’s there? You’re welcome. – Old SRE Proverb. 

For the last few years Site Reliability Engineering (SRE) has been growing like hell and we needed it. When you tell the whole world to go to their room and order take-out food, someone better be there to answer the call for Szechuan.

Given that industry growth it’s probable that this is your first downturn. For my many strengths, I’m not psychic, so I can’t tell you how long it’ll last or how bad it’ll get, but as an old fart, I can give recommendations of what helps keep the lights on and the plates spinning.

What exactly is it you do here?

Since the last time we had a significant global economic disruption we have only put more of our infrastructure ‘on-line’, in fact we’ve put so much of our lives ‘on-line’ that it feels awkward to say it. Like, I mean you could talk to a human being and have them rent you a DVD (or book a hospital appointment).

That’s the incidence of the word ‘multimedia’ in all books published from 1980 till today. When was the last time you interacted with something on your computer that wasn’t multimedia? Once something becomes ubiquitous, it becomes invisible, but that doesn’t make it worthless. 

There’s a joke about a grouchy goldfish, and his friend swims by and says ‘isn’t the water lovely today’, the goldfish thinks ‘I have to get me some of that water he’s talking about’. We’re in the business of providing the water.

Communicating Value

Now more than ever, it’s important to be able to clearly communicate the value of reliability. What is it and why do the folks in the C-suite with the sharp suits care? 

In SRE’s case it can be illustrative to think about where the approach came from — Google — and where Google made its money: ads. Here’s the thing about ads: nobody asks for them and nobody misses them when they’re gone. If you don’t serve the best possible ad when someone asks for it they don’t ask again. That opportunity is gone, like money in the rain. 

Every ad query you drop on the floor has a dollar value. That focuses the mind.

It might not be as easy for your business to put a dollar value on each request, but something the computers are doing is helping the business and it behooves you to know what it is.

9’s don’t matter if Finance isn't happy. 

Do you know how your company makes its money? And I don’t mean in a hand wave, “it happens over there” sense. I mean, which routes, paths, and binaries give the beancounters the ability to pay you. Because if you don’t know, right now would be a good time to learn. 

Does your business care deeply about paid customers, but not about free trials? Or are your users really sticky so those folks in their first 7 days are the most important to serve? In a pinch could you turn off reporting to allow credit card processing to continue? 
To steal a phrase from Mike Julian at Duckbill group, “cost management is primarily an engineering problem, not a financial problem”. What’s your most expensive feature? Does the team that wrote it know?

While we’re talking about expensive teams, let’s make sure we’re not one of them. Are we replicating data unnecessarily between clouds? How long are we keeping logs for? Are we keeping a record of every HTTP 200 we’ve ever served? Did we mean to?

When we’re focused on growth we can lose track of our unit cost. When our focus changes, we need to know how much it costs us to serve each of our customers - both the model citizens and the noisy neighbors.

Inefficiency is an opportunity for great engineering work, and there are plenty of examples to inspire you. In this piece from Netflix they detail the journey down the stack to understand the CPU performance and latency of a microservice. The changes they made led to a better than 3x improvement in requests per second on the same hardware. Or how about this piece from Honeycomb showing the potential to eventually save 40% off their AWS EC2 instance bill by switching to ARM instances?

Not to sound like a wall hanging from Bed Bath and Beyond, but this Challenge is an Opportunity!

Nobody has more access than us. Nobody has more data than us. Nobody has more context than us. This is our opportunity.

It’s not what you do, it's the way that you do it.

In a piece entirely about being good for the company, here’s a bit about being good to you. Before you make a change that improves something, record the “before” state, preferably in a pretty graph. Measure the transition. There is nothing as useful at performance review time as art that shows that a change you made paid your salary.

Sisyphus did not have a point. 

What is the magic that allows us to keep pushing more of our lives over the internet? 

In our industry we tend to think of this as a reflection of Moore’s law, but it turns out that Moore’s law is just a special case of Wright’s Law: the more you do something the cheaper it gets. From making McDonalds to solar cells, the rule holds. 

And we, my friends, are the living embodiment of Wright’s Law. 

The next computer will be able to do more work, the next version of the software should be more battle tested, and we get better at understanding our systems over time. In return for this we will demand more features, we will give the service to more users, and they will give us more data, but we will still get better at our job. 

In hard times we have to focus even more on eliminating toil. I define toil here as any work you’re doing that isn’t making your service better. Work you’re doing to stand still. Work that, if nothing changes, you’ll have to do again in a week, a month, a year. 

Sublinear scaling is one of the core value propositions of SRE, we tend to think of this in terms of growth, but in the absence of growth it still holds. If we’re doing our job correctly systems should get easier to run over time. Note that the last sentence isn’t passive, there’s a job for us to do. 

Tanya Reilly talks about giving your future self gifts. What did you do today to make your life easier tomorrow? What process is manual that you can automate? Are there processes that SREs are approving that could be fully delegated to product engineers?

It’s not grubby to care about money. It’s literally part of the job. For the last few years we’ve been optimizing for growth and velocity, and that was the right choice at the time. Right now it might be time to reintroduce ourselves to efficiency.