Incidents and Accidents

Dear Reader,

I have a confession.

I, an SRE in good standing, adherent to blamelessness, lesser saint of SLO’s, minor daemon of the pager, have broken with orthodoxy.

Even as I kneel at the altar of the good book, (which in his defense, really upsets Niall) I am a heretic!

I believe in repeat incidents. Whisper it! Shout it! Believe it! I believe it because I have seen it.

“I saw Goody Proctor with the Pager!”

There is a trope in safety systems writing that says “no two incidents are ever the same”, and in some ways it’s right. Maybe it’s the first time this combination of folks have caught this page, maybe it’s the first time that you caught a page in the shower (spoilers, if you do this job for any length of time, it won’t be the last), it might be the time that you caught a page in your delightful new green sweater.

Some of these are distinctions without a difference.

Repeat incidents happen and they happen all the time.

Ship a config that didn’t parse? That’s an incident.

Do it again? That’s the same incident!

Truncate a config? That’s an incident.

Do it again? That’s the same incident!

And finally a favorite of mine, ship a web site that renders blank on old versions of Internet Explorer, that’s an incident.

Heraclitus might say “No man ever steps in the same river twice, for it's not the same river and he's not the same man”, but I’ll say “if you rolled your ankle the first time you crossed the river, wear boots the second time”.

My goal is to take the positive parts of ‘no two incidents are the same’ i.e. give people respect and the benefit of the doubt and understand that sometimes the fog of war during an incident causes us to make the wrong decisions, but to not extend that mantra so that we never attempt to eliminate entire classes of error.

Jennifer Mace in her piece on “Generic Mitigations” for O’Reilly broke down a number of excellent approaches that you can prepare and drill to make you more ‘ready for anything’. Rollback, scale, isolate, these are all great and you should read the piece. What I want to talk about today are more “Generic Preparations” (patent pending).

What can you do to make sure you’re not making a mistake right now? What are the robust checks you can add that protect you from common issues?

Let’s start with the most fundamental; You’re probably already doing this, I’m just mentioning it because it illustrates the point: Requiring a working build before committing to Main.

Why do we do this? Because it’s an easy way to check that we haven’t accidentally added local dependencies and that our code works with whatever the most recent changes are.

What are other things that are in the same form?

Should this file ever get shorter?

Two of the biggest companies you know, famous for their operational excellence, have had significant outages in their corporate environments because they removed half of a zone file. Do you have an interest in resolving hosts that begin with any letter after F? Hosts like LDAP? NFS? PHONE? OOPS!

Rebuilding enough of your internal zone file from the contents of your ssh .known_hosts file is a lot less fun than it sounds.

If the default case for a file is ‘it grows’, why not try making the process of shrinking it by more than 1-2% trigger a warning.

What operating system are you using? Is it the same as your users? How about browsers?

What happens if a web hosting company has a significant portion of their users running an old version of Internet Explorer but they all run the latest browsers? You’re left in a situation where sometimes the javascript doesn’t load and some percentage of your users are looking at a blank page.

Does it matter if the bug that caused the page to be blank is different from the last bug? No. All the user sees is an outage, and depending on how you’re monitoring your site the first you hear about it might be the folks with pitchforks coming for your customer operations team.

Could we test to make sure we render ‘not blank’ on all the major browsers?

Seems like a choice we could make.

This particular issue was the largest single contribution to SLO burn for a very large web hosting company whose name you know.

Why make new mistakes when we can make old mistakes, in a new order?

A man walks up to the microphone. You feel it before he says it: "more of a comment than a question".

He says: “Tiarnán! We're hip, we're modern! Every behavioral change in our code is behind a… a Feature Flag!”

Applause fills their ears. A single tear rolls down their cheek. The beauty of it. The wonder.

Then I whisper: “How many of the flags have been deployed at 100% for months? What would happen if your feature flag service failed?”

Well, I mean, the flag would revert.

Well, more correctly "those" flags would revert.

And has that combination ever been run together?

I'm not asking for tested.

I'm not asking for run in production.

Has that binary ever run with those flags before?

Automate the auditing of all flags to find any that have been 100% for more than a month.

Then nail them on.

In code.

Thank me later.

Gaze not into the JSON, for the JSON gazes also into you.

Are you committing JSON? Which is better at ensuring that it’s valid JSON, you or the machine?

At SREcon this year the brave and noble souls of Linkedin talked about what happened when they shipped a broken config to their Apache hosts and they started to crashloop. What I found interesting was that a number of generic preparations they had used historically had been removed during an upgrade.

If they’d made the same mistake a year or two earlier, their editor would have caught the non-valid file format, the CI system would have noted that –config-test failed, and the crash loop would have automatically rolled back to a known good config.

These were all removed by well meaning folks during an upgrade.

Vuja De, the strange feeling that, somehow, none of this has ever happened before.

It’s not good enough to put the fix in, we need to document and communicate the value of that fix so that the knowledge propagates forward through time.

Even with these preparations, outages will happen, and they’ll be exciting and you’ll get to tell stories about them, but can we try and keep them interesting by making new mistakes?