Reliability, the buzziest of SRE buzzwords. It’s right there in the name really. Site Reliability Engineering. We want everything to be reliable, right? Who wants things to be mediocre? Haphazard? Unreliable?
Okay, so we’re designing a reliable service. What do we need?
Of course true reliability is more than just availability, but I think it’s fair to say that’s the first order bit. (If you aren’t available you aren’t reliable.)
Modern high availability follows 3 basic principles:
- Elimination of single points of failure (aka “redundancy”)
- Reliable crossover (aka “failover”)
- Detection of failures as the occur (aka “triggering”)
Okay – we need to build a redundant system, with reliable crossover, that can detect failure.
Our first building block of high availability is building redundant systems, which is often referred to as horizontal scalability.. This involves running identical, redundant services that are “hot” and ready to handle traffic. This approach is a traffic problem, raising questions like; How do I balance incoming load over N services?. Running redundant systems “cold” (i.e. “disaster recovery” sites) also presents a traffic problem as it requires moving load from the hot to the cold systems.
In order to achieve high availability, the mechanisms for spreading load (active/hot) or for shifting load (spare/cold) must be put in place. The design patterns discussed below help us to achieve reliable crossover when failures occur.
The most straightforward and easily understandable of the three principles is to detect when failure occurs and initiate a reliable crossover. Note that we are not talking about human intervention, such as paging someone, although that is an important last line of defense. We’re talking about automated failure detection with automated remediation. A good example of High Availability failure detection is a load balancer health checking a backend service and removing it from the pool of “healthy” serving services. Another example is the TCP/IP protocols network congestion-avoidance algorithm, which is so effective that unless you have a lot of experience working on networking you probably don’t realize it’s even there.
So, we’ve got redundant systems, with reliable crossover, and failure detection initiating automated remediation. 100% uptime, right?
A significant (second only to code/config changes) cause of major service interruption is overload. Overload is simply what happens when your systems are asked to do more than they are capable of doing. But what are the most common causes of overload and what happens to your services when they become overloaded? How do we avoid overloading our own services?
Top-down overload is a classic unexpected surge in demand. A long time ago we used to call this “slashdotting” but the idea is simple – something external (a larger website linking to you, a viral video/meme/tweet, or your own marketing team starting a new campaign, aka “the Super Bowl Ad”) increases your real system load. It may sound simple, but how do your systems behave when they get 10 times normal traffic. What about 1,000 times normal? What about even more than that?
Bottom-up overload is not always seen as overload, but it plays out the same way. Bottom-up overload is when capacity becomes reduced or constrained somewhere further down in your stack. In a bottom-up overload scenario you are seeing normal incoming traffic levels but you are no longer able to handle them as normal. Bottom-up overload always involves some form of loss so we may not think of it as overload but simply as a failure of a critical system. However, the effects of that loss are, typically, more load on the remaining services and possibly overload if there isn’t enough slack in the system to account for the loss. Examples of bottom-up overload are losing a database shard, a redis cluster, or losing an entire AWS/Azure/GCP region.
Both forms of overload can lead to cascading failure – one service or part of your infrastructure is overloaded and fails which shifts the load to the remaining services or nodes which also become overloaded and fail, and so on. Cascading failures impact and duration are multiplied as we scale into massively interconnected systems. Cascading failure is how a small outage of one component ends up becoming a major global outage which may take hours to resolve.
Which brings us to…
You have external dependencies
You have external dependencies, everyone does. Even the largest cloud providers depend on BGP to route to other networks, and DNS depends on root name servers. Sometimes external dependencies are explicit. For example, I know I am depending on a given Cloud Provider, or I know I am making a call out to a 3rd party payment provider. Sometimes they are implicit. A Frontend developer may feel confident they can make a call to a Backend service and it will always be there, ready to answer a users call, but what about the DNS lookup to resolve that domain name? What about the network hops and path you take to get there. What happens when the call to a payment processing API doesn’t respond?
Resiliency By Design
Resiliency is not about perfection, it’s about harm reduction. “Graceful degradation” or “Fail Better”, refers to the assumption that overload is going to happen at some point. Knowing this, we can intentionally design our systems to handle it in the most effective and least detrimental way possible.
For services running in Cloud Provider availability zones, that means running in multiple zones with enough spare capacity that you can lose at least one zone at peak and still prevent a cascading failure situation. Additionally, plan and regularly test for shifting load between your availability zones.
Resilience, however, is not just an infrastructure problem. We build resilient network paths on top of resilient BGP and TCP/IP, and resilient hostname lookup on resilient DNS. We build resilient Applications on top of resilient Cloud Providers. These building blocks are amazing, but they can only take you so far. Ultimately you have to plan for failure and implement resilience strategies into every individual service to reach reliability goals of 99.9% availability and higher.
Let's discuss four important software resilience patterns.
Pattern 1: Retry
Most people already know the humble “retry” – a simple and powerful tool to smooth over small, fleeting hiccups (heck, retransmission is a core feature of TCP/IP!). When done well, this resilience pattern can be quite potent.
If you say to an SRE that you are implementing retries, their immediate response is likely to be: “with exponential backoff?” – and for good reason! Retries frequently help resolve minor transient issues, but they represent new requests and traffic, which still needs to be served by the system processing them. For a system on the verge of overload, a “retry storm” can easily cause a cascading failure. So definitely use retries, but do so safely!
📝Safe Retry Tips:
- Evaluate received response messages and codes to determine if it may represent a transient issue or not.
- Use an exponential backoff algorithm to add more wait time between each retry. Adding jitter as well can help avoid retry storms.
- Ensure you respect any retry or wait timing information received (i.e. HTTP Retry-After headers for example).
- Set a maximum retry limit. Infinitely retrying is a dangerous pattern – computers never give up.
Pattern 2: Throttling
The throttling (aka rate-limiting) design pattern can be used to control the consumption of resources by an instance of an application, an individual tenant, or an entire service. Throttling can allow the system to continue to function, even when an increase in demand is pushing a system towards overload.
- Per system resource based above configured levels (for example CPU, memory, or network saturation).
- Configured request limits (typically set by “load testing” your service in a controlled environment)
- IP address (for example, allowing a maximum number of requests from an IP address)
Even better than basic throttling is prioritized throttling – which is a great way to disable or degrade the functionality of selected nonessential services or requests in order to allow more “business critical” requests to succeed.
Depending on the type of service or resource being protected, some prioritized throttling examples are:
- Serve /login requests but drop /report requests.
- Serve Enterprise requests before Paid requests before Free requests.
- Serve search requests but drop all other requests.
Throttling, often used in conjunction with retries, is built into the HTTP specification. When generating 429’s or 503’s responses, consider setting a Retry-After response header to inform clients when it’s safe to try again.
The most important part of throttling is thinking about and planning for failure. Determine what is “least bad” for end users and the business when overload occurs. Ruthlessly prioritize the importance of each dependency accordingly.
Pattern 3: Circuit breaking
Like retries, the circuit breaker design pattern is used to detect failures, but contains logic to prevent a failure from constantly recurring. Typically it’s implemented as a finite state machine with three normal states: CLOSED, OPEN and HALF_OPEN, and two special states, DISABLED and FORCED_OPEN,to maintain emergency manual controls. (Never give up your emergency controls!)
The idea is to use count or time-based sliding windows to measure failure or latency rates over a specified period, which determines the current circuit breaker state.
Closed: Referring to “closed circuit,” meaning failure and latency rates are below a triggering threshold, indicating that everything is operating as intended.
Open: Failure or latency rates exceed the triggering threshold for the given time or count window. The circuit breaker “pops” open, and subsequent calls following the same circuit breaker path will “fail fast,” avoiding unnecessary load to the external system. This has three key benefits:
- Failing fast is efficient it eliminates waiting to see if the request might succeed. We know it won’t, so promptly handle the error and move on.
- Failing before making a request to the external service reduces load on that external service and could potentially prevent a cascading failure.
- It compels service owners to consider failure and integrate appropriate failure paths into the service. Addressing questions like; How should we fail here? What can we display back to the user if we get an error here?
It’s worth saying again, resilience is not about perfection, it’s about harm reduction.
Half Open: How do you know if an open circuit has recovered? In the electrical world, you don’t. When a circuit trips in your house, you have to find the breaker and manually flip it closed again (hopefully after fixing whatever issue caused it!). In the digital world however, a half-open state allows a limited number of requests to pass through, allowing us to determine if the problem persists. A properly implemented circuit breaker will switch from an Open to a Half Open state, after an appropriate-for-your-system amount of wait time in an attempt to probe for automated recovery.
Pattern 4: Bulkhead
Bulkhead isolation is an application design pattern that promotes fault-tolerant architecture. It isolates application elements into pools, so that if one fails, others continue to function. Named after a ship’s sectioned partitions (bulkheads), if the hull of a ship is compromised, only the damaged section fills with water, preventing the ship from sinking.
Also known as a “concurrency rate limiter,” this approach creates separate, distinct thread pools to isolate business logic. Typical implementations include pre-allocating fixed thread pools or using semaphores to dynamically control the concurrent thread count.
When implementing bulkhead isolation on the server side you want to partition service instances into different groups, based on load and availability requirements. When implementing on the client side you can partition so that resources used to call one service can’t starve out calls to a different service.
An example of server side bulkhead isolation is creating two separate pools of (the same) service – one for “heavy” requests that you expect to be expensive and long running and one for “normal” requests. With this setup you can prevent the less-frequently, expensive, long-running requests from starving out the cheaper, faster, high frequency requests.
📝Bulkhead Isolation Tips:
- Think carefully about what you want out of your bulkhead resource pools and how they will be separated from each other (i.e. separate VMs, containers, or processes?)
- Remember that bulkhead isolation is a trade-off between reduced efficiency (that “heavy” pool might end up being under-utilized while the “normal” tier is struggling) and increased reliability (through isolation) – you’ll want to monitor your partitions closely to optimize this trade off.
- Consider combining bulkhead isolation with retry, throttling, and circuit breaking! It’s not an either or choice, these patterns build upon each other.
Your reliability problems are really traffic problems
In cases where you rely on external or third-rd party resources, you can use established and proven software resilience patterns – already extensively used and tested in load balancing layers – at the application layer. Here, access to vital information – such as local system resource state and post-login user or team information – is available. Designing for reliability in the application layer enables each service to make independent decisions regarding retry, throttling, circuit breaking, and bulkheading, resulting in higher reliability and better scale.
When building a service or application, assume failures will happen and design for them accordingly. You’ll be very glad you did someday.