Nodeconf EU Reflections

My name is Tiarnán and I’m a systems-sided site reliability engineer. I grew up a sysadmin, and while I can write enough code to be dangerous, I am generally most comfortable far from the front end, down in the bit mines, where GUIs are things that happen to other people.

I’ve just returned from spending time with folks who live in a very different part of the stack than me, and I had a whale of a time. Nodeconf EU is the leading Node.js event in Europe and home to an amazingly friendly and open community. I love being in the presence of excited technologists and the energy and enthusiasm at Nodeconf EU was second to none.

**Tiarnán and a bunch of Node.js devs assess risk**

As with any time I’m allowed to soak in the revelations of others, it got me thinking. Let’s consider a few talks.

First up, James Snell with “Transferable AbortController: The story of a Node.js performance bug”. This was a compelling deep dive into the causes of a performance degradation that had been introduced, by James himself, in the core of Node. The bug amounted to ‘crossing the Javascript/C boundary is slow, this operation did it a lot’, but the story was told with ownership and humility. James is a key contributor to the Node.js community, well known by everyone at the conference and was owning a mistake. This is about as good an example of ‘blamelessness’ as I’m going to have in my back pocket for quite a while.

Then Yagiz Nizipli with “The Road to a fast url parser in Node.js”, in this talk he presented a series of experiments he’d done as part of his master’s thesis. I’d gotten a preview over dinner the night before and the thing that struck me was its similarity to problems elsewhere in the stack. The decision that Yagiz was working on was deciding ‘is copying this data to the faster processing worth it based on the size of the data which defines the speed of the copy’. In SREland this is a very common issue but is usually at the scale of network/machine boundary rather than between Javascript and C.

We also had Marian Villa & Santiago Gimeno with Dot, line,Trace! They showed how to add metrics, logs and traces to Node.js applications and broke down what insights each of those sources provide.

And finally, Vitaly Aminev talked about scaling real time sports statistics to hundreds of thousands of clients in real time. Load testing, scaling, all the architectural challenges you’d expect from that type of undertaking.

These were great talks that were eaten up by the audience and they covered some very SRE-like values. Blamelessness! Scaling! Reliability! Observability!

The ecosystem of technologies and specialities is broad, and none of us can know it all, but it does a body good to spend time ‘undercover’ and realize that so many of our concerns are cross-cutting with lots of lessons to be learned from the other side.

While writing this post and reflecting on the conference with Maggie, she made a great point: We run around making developers get code out the door 7 seconds faster while ignoring how long it takes to do anything quickly or with confidence once that work hits production.

Amazing things are done that make it faster and easier to ship code. We polish and polish the Inner Dev Loop. Where’s my Inner Ops Loop!

Same but different.

I was surrounded by people who desperately wanted to talk about the thing they just did, the thing they just built, or the last thing that caught fire and dragged them out of bed.

Here’s the observation: The Node folks don’t brag about the last thing.

They had those stories, because when I asked they had tales to tell, but it wasn’t where their mind goes to. It’s not the excitement they’re interested in. This is not because they’re a breed apart, it’s because they’re rewarded for different things, and they seem to feel a lot less control of the things that happen beyond their realm.

There was a general feeling that the further they strayed from the core of their expertise the harder it was to understand what was going on or where to push when they wanted the system’s behavior to change. Observability has advanced a lot in node over the last few years, but is still maturing, and the practitioners are too busy within their scope to be ready to reach up to the load-balancer, or down to the database.

We need to build the tools to allow developers to rapidly understand the ground they’re standing on and the weather overhead. Product experts, when pulled in to deal with incidents, or even just looking for operational insights, must be able to rapidly establish context without needing to gain a deep understanding in every tool and every layer in the stack.

“Business Intelligence” has transformed how CEOs, CFOs, and COOs make decisions. Information that was available to specialists, buried in logistics apps, and finance files, is transformed into actionable information.. We need to make ‘Production Intelligence’ available to the software engineers closest to the users without them needing to learn every nuance of every component they rely on.

And that’s where we come in. Stanza helps engineers understand what’s happening, collaborate efficiently, and act intelligently by fusing SRE wisdom with a real-time model of your production environment. We will deliver insights to you, as you need them, regardless of where that knowledge can be found from across your production environment. Or at least we will soon. :) Stay Tuned!