Silence is Golden, for 48 hours- after that, check your alerts. - Old SRE Proverb
SRE’s are hired in the presence of instability. I’m sure there exists a company with perfect foresight who sees the challenges of growth coming, but I’m equally sure I don’t know its name. So let’s think about the common case.
Too often on-call is trial by fire. It shouldn’t be.
Can you answer these questions about your team’s alerts?
- How many happened out-of-hours?
Do you even have a definition of out-of-hours?
- How many resolved without intervention?
We find a good proxy for this is ‘under 5 mins’ because most humans can’t move that fast.
- Is your alerting load getting better or worse?
What’s your noisiest alert? Is it useful?
When was the last time you deleted an alert?
Charity Majors is fond of saying “every dashboard is an answer to some long-forgotten question”, in a similar vein, alerts are monuments to past outages. It is good to remember; it is not good to be woken by a memory in the night. That’s a nightmare.
Does this alert tell you something now or did it tell you something once? Does it spark joy?
Ok, so joy is a high bar. Instead ask: does it bring you anger or knowledge? If we assume that with some small regrettable set of situations receiving a page is worse than whatever you were going to do. Walk the dog, make a cup of tea, continue being asleep. So, given that it’s strictly worse than what you were doing, there needs to be a great reason for it to interrupt you.
Is it timely? Is it actionable? Your response to every page should include reflecting on does the next person (who might be me) want to receive this?
By timely I mean: could it have waited till morning? If I called you at 4am to say “THE DISK WILL FILL IN THREE WEEKS”, you’d want to beat me about the head and body with a fish. Why should we accept it from computers?
By actionable I mean, will I, the receiver of the page, do something right now that will positively impact my users right now. By way of example, “Our primary site is on battery power and if we don’t failover right now we go dark” is a great page. Alerting because “There was one time when we a saw slightly increased 99th percentile latency on database queries when memory got high on this box” is a crime.
For every alert you should be able to succinctly explain “Why do I care?”.
If the system alerted in the night and nobody was there to hear it, did it alert at all?
Do you have a Slack channel full of alerts that the team has been marking-as-read or muting for years? Why? Are you planning to mint NFTs to commemorate an outage? No? Delete it.
All alerts should go to the person who’s going to own it. Note, that may not be the person who fixes the problem - some problems are gnarly all-hands-on-deck affairs - but when the bell rings it should be calling someone to the fight.
While I’m on the topic - heroism. It’s bad, and we’re not the first to work that out. This is from NASA:
“Early airline captains were solo performers whose technical skills were sharpened by absolute necessity. Their selection, their environment, and their culture reinforced strong, independent personalities and isolated, individual decision making.”
Does this remind you of anyone we know? As large aircraft became more complex, a single pilot could no longer operate the aircraft, or as NASA puts it:
“The rugged, isolated individual was perhaps not the ideal model for what was now clearly a team activity.“
One person catches a page, but when a page is answered successfully it’s a team success. This team success can be in the form of training the person on-call, or in the form of docs/playbooks the receiver has access to, or help that the receiver pulls in if they need it.
Ask the question: how many people on the team could have fixed every page that went off in the last month? If that question fills your team with fear, then then too much knowledge is kept in too few heads.
What can I do for you?
Managers, I know you’ve been nodding along so far thinking “Yes, Yes! My engineers should totally do all those things!”, but maybe, just maybe you were feeling left out. If so, here's something for you.
The person catching the page needs certain things to be true or they’re going to have a bad time. These are controllable things, but they’re often not controllable by the engineer on-call.
If something can page you, you must have the power to fix it. Visibility on the things that impact the services I’m responsible for and - this is important - the delegated authority (social and technical) to modify them so it's better next time.
If all you can do is acknowledge a page and do bare minimum remediation the situation isn’t getting better. And if it’s not getting better, it’s getting worse.
Engineers shouldn’t be expected to do sprint work while on-call. If your on-call shift is already excruciatingly peaceful and you’re worried the on-call engineer might fall asleep then opportunistic improvements in tooling are acceptable. But work that can block other people’s work should be banned during weeks on-call.
What does winning look like? When is on-call good enough?
For any process we should be able to define success. When is on-call good enough?
Some organizations are moving towards a “volunteer only” on-call model. A reasonable payment for a reasonable amount of out of hours work. This allows management to have a very strong signal about the life of an on-call engineer. If this role is easy to staff, on-call is healthy. If it’s hard to staff, then more time must be spent improving the operational health of the supported services.
While not every organization will be interested in pursuing this approach, I think it’s illustrative of the hallmarks of a healthy alerting load.
Operational health is achieved when you’re confident that you’re meeting the needs of your users, and your team is working in a sustainable way
If it’s not getting better, it’s getting worse. Entropy doesn’t sleep, and if you don’t fight it, neither will you.