Stanza is a company founded by SREs. When we started out, we kicked around a few ideas about what to focus on first, but the one that kept coming through loud and clear from our conversations with a myriad of organizations was 'I know I need to evolve my operations in order to get SRE-like results, but I don't know how to start.’
Like good new founders, we set out to talk to customers more deeply about this problem. There were a lot of discussions with people about having better incident management processes, defining and implementing SLOs, and monitoring more effectively. There was a lot of pain around the ‘sprawl’ of tools out there, and the confusing myriad of options. There was also a consistent struggle to make reliability intelligible to product organizations and the business. In all of this, one thing we saw consistently was a struggle of teams and tools in cutting to the real heart of the problem - not just building a better process, or measuring more effectively, or getting more information, but actually making the system more reliable to the users who use it.
Good is a dashboard that says what is broken.
Great is a system that fixes itself, and tells you about it afterwards.
Of course, “How do I make a system that fixes itself?” is tough to answer. After some contemplation, we went back to the first principle of creating developer tools: identify the core primitives, and build a great API for controlling them.
This may sound abstract, but it’s actually what every great developer tool out there does. For example:
- Git gives developers control over their code in the form of ‘commits’
- React gives front end developers control over each element of their UI in the form of ‘components’
- Stripe gives developers control over money in the form of ‘payments’
Each of these tools takes that core primitive, gives it a great API, and builds an expansive ecosystem around it that affords engineers incredible control and flexibility.
But what is the core primitive of a large-scale distributed system? One might think it is services, but in fact in a world with so much PaaS and so much innovation in the browser and on the edge, “service” is an incredibly complex and abstract concept that, at minimum, exists beyond a kubernetes pod - not a primitive construct. The core primitive needs to be something else. In wrestling with this, the key insight came from (who else?) a hands-on SRE of 20 years experience, and one of Stanza’s first engineers. Matthew Girard said it in one sentence:
“Control requests, and you control everything.”
This is something that SREs at hyperscale companies know at a cellular level - that so many big bad problems have this aspect of requests getting out of control. And it’s something that every SRE at a growing company learns in a series of trials-by-fire, often struggling against suboptimal tools and a complex ecosystem that requires duct-taping and papercliping together various tools to get results.
In fact, we dug in on the data in VOID and found that “too many requests” was a pronounced commonality in incidents - somewhere around 40% of incidents having a traffic management component.
When we spoke to customers, we also found this problem everywhere - hidden in things that all look rather different. Stuff like:
- Cloud region outages causing processing power to drop in half.
- Getting rate limited by GitHub, Stripe or Open API at high load.
- Hanging UIs that are constantly refreshed during backups - making the problem worse.
- Run-away serverless bills due to bot attacks or accidental recursion.
- Failures in Redis with no clear cause, that can be solved by reducing requests.
- Fan-outs blocking completion of requests on one thread.
In addition, we found that existing tools in this space were afflicted with two serious problems:
- No concept of user experience - requests are dropped without consideration to their importance to customers. Because all work is done in a proxy-layer, developers have no options to create fallback behaviors that provide a better user experience.
- Bolted to the “advanced infrastructure” stack - requiring specialized operators to configure and manage them, and only usable if you already have other specific technologies.
We can do better.
Stanza is a re-imagining of how engineering teams control requests - bringing powerful tooling directly to developers in good, clean code. At one level, it’s fancy rate limiting. At another level, it is a new way of protecting critical customer experiences. Stanza enables you to:
- Protect the customer experience by defending the most critical user journeys of your software from overload and interference by other workloads, and building UIs responsive to the infrastructure itself.
- Transparently communicate across your organization about the priority of features, whether the infrastructure can serve them, and when to gracefully degrade.
- Optimize reliability by calculating availability of critical features, and giving you insight into how to tune the infrastructure that supports them for optimum performance.
More concretely, with Stanza you can:
- Prioritize serving transactional workloads, like entering new data, over resource-heavy workloads, like reporting when calling to shared resources.
- Adaptively remove UI features that you cannot support right now, with custom messages.
- Calculate the SLO of a feature, or of requests of a certain priority.
- Prioritize paying customers over free customers in calls to rate-limited services like Open API or GitHub.
- Prevent services from being slammed by bugs in other services.
- Protect from infinite recursion in serverless environments.
- Write custom load balancers to distribute workload across resources.
- Enable customers of APIs to define the priority of their requests, within their own rate limit.
We are really excited to hear your feedback!