Why Rate Limiting?

An announcement of Stanza's evaluation beta, and a discussion of why traffic management is so important to software operations.

Maggie Johnson-Pint

July 10, 2023

Stanza is a company founded by SREs. When we started out, we kicked around a few ideas about what to focus on first, but the one that kept coming through loud and clear from our conversations with a myriad of organizations was 'I know I need to evolve my operations in order to get SRE-like results, but I don't know how to start.’

Like good new founders, we set out to talk to customers more deeply about this problem. There were a lot of discussions with people about having better incident management processes, defining and implementing SLOs, and monitoring more effectively. There was a lot of pain around the ‘sprawl’ of tools out there, and the confusing myriad of options. There was also a consistent struggle to make reliability intelligible to product organizations and the business. In all of this, one thing we saw consistently was a struggle of teams and tools in cutting to the real heart of the problem - not just building a better process, or measuring more effectively, or getting more information, but actually making the system more reliable to the users who use it.

Good is a dashboard that says what is broken.

Great is a system that fixes itself, and tells you about it afterwards.

Of course, “How do I make a system that fixes itself?” is tough to answer. After some contemplation, we went back to the first principle of creating developer tools: identify the core primitives, and build a great API for controlling them.

This may sound abstract, but it’s actually what every great developer tool out there does. For example:

  • Git gives developers control over their code in the form of ‘commits’
  • React gives front end developers control over each element of their UI in the form of ‘components’
  • Stripe gives developers control over money in the form of ‘payments’

Each of these tools takes that core primitive, gives it a great API, and builds an expansive ecosystem around it that affords engineers incredible control and flexibility.

But what is the core primitive of a large-scale distributed system? One might think it is services, but in fact in a world with so much PaaS and so much innovation in the browser and on the edge, “service” is an incredibly complex and abstract concept that, at minimum, exists beyond a kubernetes pod - not a primitive construct. The core primitive needs to be something else. In wrestling with this, the key insight came from (who else?) a hands-on SRE of 20 years experience, and one of Stanza’s first engineers. Matthew Girard said it in one sentence:

“Control requests, and you control everything.”

This is something that SREs at hyperscale companies know at a cellular level - that so many big bad problems have this aspect of requests getting out of control. And it’s something that every SRE at a growing company learns in a series of trials-by-fire, often struggling against suboptimal tools and a complex ecosystem that requires duct-taping and papercliping together various tools to get results.

In fact, we dug in on the data in VOID and found that “too many requests” was a pronounced commonality in incidents - somewhere around 40% of incidents having a traffic management component.

When we spoke to customers, we also found this problem everywhere - hidden in things that all look rather different. Stuff like:

  • Cloud region outages causing processing power to drop in half.
  • Getting rate limited by GitHub, Stripe or Open API at high load.
  • Hanging UIs that are constantly refreshed during backups - making the problem worse.
  • Run-away serverless bills due to bot attacks or accidental recursion.
  • Failures in Redis with no clear cause, that can be solved by reducing requests.
  • Fan-outs blocking completion of requests on one thread.

In addition, we found that existing tools in this space were afflicted with two serious problems:

  1. No concept of user experience - requests are dropped without consideration to their importance to customers. Because all work is done in a proxy-layer, developers have no options to create fallback behaviors that provide a better user experience.
  2. Bolted to the “advanced infrastructure” stack - requiring specialized operators to configure and manage them, and only usable if you already have other specific technologies.

We can do better.

Stanza is a re-imagining of how engineering teams control requests - bringing powerful tooling directly to developers in good, clean code. At one level, it’s fancy rate limiting. At another level, it is a new way of protecting critical customer experiences. Stanza enables you to:

  • Protect the customer experience by defending the most critical user journeys of your software from overload and interference by other workloads, and building UIs responsive to the infrastructure itself.
  • Transparently communicate across your organization about the priority of features, whether the infrastructure can serve them, and when to gracefully degrade.
  • Optimize reliability by calculating availability of critical features, and giving you insight into how to tune the infrastructure that supports them for optimum performance.

More concretely, with Stanza you can:

  • Prioritize serving transactional workloads, like entering new data, over resource-heavy workloads, like reporting when calling to shared resources.
  • Adaptively remove UI features that you cannot support right now, with custom messages.
  • Calculate the SLO of a feature, or of requests of a certain priority.
  • Prioritize paying customers over free customers in calls to rate-limited services like Open API or GitHub.
  • Prevent services from being slammed by bugs in other services.
  • Protect from infinite recursion in serverless environments.
  • Write custom load balancers to distribute workload across resources.
  • Enable customers of APIs to define the priority of their requests, within their own rate limit.

Today Stanza officially launched our private beta with the open-sourcing of our JavaScript SDK (Go coming next week), Go demo, and documentation. We are issuing accounts upon request, and hanging out in discord to answer questions. Email hello@stanza.systems to get started.

We are really excited to hear your feedback! 

share