A Visit From the Ghosts of Failure Past, Present, and Future

Unlike poetry, learning from incidents is rarely done by rote, but in this case we'll make an exception.

Laura Nolan

January 10, 2023

'Twas the night before Christmas, when all through the land,

Not a pager was alerting, on the keyboard no hands.

SRE and Devops and sysadmins, all without care,

Friends and family and festive treats were there.

The servers were nestled all snug in their racks,

Serving requests, writing logs, and the odd webhook to Slack.

Now deep in the system, problems begin,

User visible failures now, we’re in a tailspin,

Triage and apply a generic mitigation,

Hold your breath until full system stabilization.

The aftermath begins, VP calls to customers,

They want RCAs now and no further unpleasantness,

They say repeat incidents are always unacceptable,

To fail in this way again would be just reprehensible.

The engineers to an incident review are summoned,

The executives sit there, all a-dudgeoned,

“How could this happen”, they angrily cry,

“We could be more reliable if you would only try,

If we had this alert, that process, this runbook”.

Counterfactuals aplenty and engineers on the hook.

They worked through best part of the holidays,

Action items were completed, and they even fixed the interface,

Executives were satisfied and happily they said,

“The problem is fixed, we need not live in dread!”

For three more nights all was jolly and well,

Festive cheer was had and the parties were swell,

When suddenly the problem happened again,

Overload that the system just could not contain.

The CTO went striding out on the warpath,

When he was stopped short by the Ghost of Failure Past,

Unearthly and strange with bright scintillations,

It showed the CTO the problems with shallow remediations:

“If you do nothing but add more alarms and more process,

Then learning and analysis is what you must suppress,

If you think only of quick wins and what-ifs,

Then you cannot make a real engineering fix,

You should focus less on shallow action items,

And spend much more time on real learning outcomes.”

The CTO was wholly unconvinced,

“We know the root cause: it’s just bad engineers!

We must train them better, and then perhaps,

They’ll stop messing up and write reliable apps.”

Now the Ghost of Failure Present appears,

A sinister glowing beast with long pointed ears,

“What happens”, it asks, “when you blame engineers?,

They learn to hide problems out of doubt and fear,

Nobody will tell the truth if their job is on the line,

They’ll cover things up and pretend all is fine,

And now you’re truly in trouble, my friend,

Your teams will make the same mistakes, time and again.”

“No no no!” cries the CTO, “I will not accept this!”,

“Engineers must get it right and no more excuses.”

But here now emerges the Ghost of Failure Future,

Pale and smooth like an eerie white sculpture.

Silently it leads the CTO onwards

To a terrifying vision set in the office,

Dashboards are red, SLOs are shattered,

Pagers are howling, responders are exhausted.

“As soon as I put one fire out another gets started,

These system problems have us totally outsmarted,

If only we understood what was really going on,

Everything we try just turns out to be wrong.”

Looking at the office the CTO sees,

How many people are brand-new to their teams.

“Why do my best staff leave?” he wonders,

The ghost shows him the answer as they turn the corner.

An incident meeting, engineer by the whiteboard,

Head bowed sadly as her manager roars,

Disapproval and blame for human error,

She’s planning the end of her three-year tenure.

“Maybe I was wrong”, the CTO thinks,

“We need better incident reviews or we’re going to sink”.

Cracking the HOWIE guide he settles down to implement

Better ways to learn from an unplanned event. 

The next day he convenes an all-hands meeting,

“In future our emphasis will be on reaching,

A deeper understanding of how we got here,

In each incident, without blame or fear”.

And incidents were analysed with care and attention,

Responder perspectives replaced the previous aggression,

Patterns were noticed and problems eliminated,

The engineering teams felt quite reinvigorated.

 

And the CTO said, thinking on that spectral night,

“Learning from incidents to all, and to all deep insight!”

Laura Nolan

Doing great engineering doesn't mean taking yourself seriously. Laura values whimsicality as much as she values clarity, precision, and learning.

Laura’s experience in operating distributed systems at scale ranges from data pipelines to edge networking, e-commerce, and messaging. She is hands-on and technical while also being involved in setting strategy.

As a resilience engineering advocate, Laura is integrating human factors awareness in Stanza's designs, from the start.Laura is a member of the board of the USENIX Association and volunteers with the Campaign to Stop Killer Robots. Laura lives in rural Ireland, where she is chief-of-staff to two demanding cats.