Why Rate Limiting?

Stanza is a company founded by SREs. When we started out, we kicked around a few ideas about what to focus on first, but the one that kept coming through loud and clear from our conversations with a myriad of organizations was 'I know I need to evolve my operations in order to get SRE-like results, but I don't know how to start.’

Like good new founders, we set out to talk to customers more deeply about this problem. There were a lot of discussions with people about having better incident management processes, defining and implementing SLOs, and monitoring more effectively. There was a lot of pain around the ‘sprawl’ of tools out there, and the confusing myriad of options. There was also a consistent struggle to make reliability intelligible to product organizations and the business. In all of this, one thing we saw consistently was a struggle of teams and tools in cutting to the real heart of the problem - not just building a better process, or measuring more effectively, or getting more information, but actually making the system more reliable to the users who use it.

‍

Good is a dashboard that says what is broken.

Great is a system that fixes itself, and tells you about it afterwards.

‍

Of course, “How do I make a system that fixes itself?” is tough to answer. After some contemplation, we went back to the first principle of creating developer tools: identify the core primitives, and build a great API for controlling them.

This may sound abstract, but it’s actually what every great developer tool out there does. For example:

Git gives developers control over their code in the form of ‘commits’
React gives front end developers control over each element of their UI in the form of ‘components’
Stripe gives developers control over money in the form of ‘payments’

‍

Each of these tools takes that core primitive, gives it a great API, and builds an expansive ecosystem around it that affords engineers incredible control and flexibility.

‍

But what is the core primitive of a large-scale distributed system? One might think it is services, but in fact in a world with so much PaaS and so much innovation in the browser and on the edge, “service” is an incredibly complex and abstract concept that, at minimum, exists beyond a kubernetes pod - not a primitive construct. The core primitive needs to be something else. In wrestling with this, the key insight came from (who else?) a hands-on SRE of 20 years experience, and one of Stanza’s first engineers. Matthew Girard said it in one sentence:

“Control requests, and you control everything.”

‍

This is something that SREs at hyperscale companies know at a cellular level - that so many big bad problems have this aspect of requests getting out of control. And it’s something that every SRE at a growing company learns in a series of trials-by-fire, often struggling against suboptimal tools and a complex ecosystem that requires duct-taping and papercliping together various tools to get results.

‍

In fact, we dug in on the data in VOID and found that “too many requests” was a pronounced commonality in incidents - somewhere around 40% of incidents having a traffic management component.

When we spoke to customers, we also found this problem everywhere - hidden in things that all look rather different. Stuff like:

Cloud region outages causing processing power to drop in half.
Getting rate limited by GitHub, Stripe or Open API at high load.
Hanging UIs that are constantly refreshed during backups - making the problem worse.
Run-away serverless bills due to bot attacks or accidental recursion.
Failures in Redis with no clear cause, that can be solved by reducing requests.
Fan-outs blocking completion of requests on one thread.

‍

In addition, we found that existing tools in this space were afflicted with two serious problems:

No concept of user experience - requests are dropped without consideration to their importance to customers. Because all work is done in a proxy-layer, developers have no options to create fallback behaviors that provide a better user experience.
Bolted to the “advanced infrastructure” stack - requiring specialized operators to configure and manage them, and only usable if you already have other specific technologies.

‍

We can do better.

‍

Stanza is a re-imagining of how engineering teams control requests - bringing powerful tooling directly to developers in good, clean code. At one level, it’s fancy rate limiting. At another level, it is a new way of protecting critical customer experiences. Stanza enables you to:

Protect the customer experience by defending the most critical user journeys of your software from overload and interference by other workloads, and building UIs responsive to the infrastructure itself.
Transparently communicate across your organization about the priority of features, whether the infrastructure can serve them, and when to gracefully degrade.
Optimize reliability by calculating availability of critical features, and giving you insight into how to tune the infrastructure that supports them for optimum performance.

‍

More concretely, with Stanza you can:

‍

Prioritize serving transactional workloads, like entering new data, over resource-heavy workloads, like reporting when calling to shared resources.
Adaptively remove UI features that you cannot support right now, with custom messages.
Calculate the SLO of a feature, or of requests of a certain priority.
Prioritize paying customers over free customers in calls to rate-limited services like Open API or GitHub.
Prevent services from being slammed by bugs in other services.
Protect from infinite recursion in serverless environments.
Write custom load balancers to distribute workload across resources.
Enable customers of APIs to define the priority of their requests, within their own rate limit.

‍

Today Stanza officially launched our private beta with the open-sourcing of our JavaScript SDK (Go coming next week), Go demo, and documentation. We are issuing accounts upon request, and hanging out in discord to answer questions. Email hello@stanza.systems to get started.

We are really excited to hear your feedback!

‍