Graceful degradation and SLOs

A discussion of graceful degradation mechanisms and their interaction with SLOs.

What is graceful degradation?

In our context, graceful degradation is the idea that, when you can’t serve the user precisely what they wanted, instead of serving the user an error, you serve them some in-between thing.

The details of this depend a lot on what exactly it is you’re trying to do. Let’s look at a few examples.

Image server

Generally speaking, an image server has a very large collection of images stored on disk of some kind, accessed via a filename or unique identifier in the URL, and the aim is to serve it as quickly as possible. (A very close cousin of blob server - binary large object server - except the blobs are explicitly designated as images). If you can’t access the file (or the filesystem), or in other ways the content isn’t there, picking another image from the filesystem to serve is not tenable, for obvious random roulette reasons.

There are three possible approaches here.

Taking advantage of the fact you know it’s supposed to be an image, sometimes you have enough hints from the URL parameters or metadata to provide an image of the right size/format back, with hard-coded 404-equivalent content in the image itself. This is often useful in and of itself, because people know you received the request and processed it, but the image is missing.

Another technique relies on the situation that sometimes the image content is intended to be versioned - i.e. that version 2 of an image might be an evolution of version 1. In that case, you could return an old version (perhaps with out-of or in-band signalling that it’s old, or what precise version it is). 

Finally, and this is a very generically reusable technique, it’s possible to keep a cache (usually LRU, or equivalent), so that even if you can’t find the file on disk, there’s a decent chance it can be found in a cache you’re keeping elsewhere. In cases like this, the user might not even perceive an error in most cases (unless the cache is unevenly distributed, or you need to update the image), but the fact that this happened at all should be recorded somewhere for all kinds of good operational reasons.

Blog

A blog often has a newsfeed like structure, where there are posts organised by date, each of which can be individually accessed, or some notion of accessing “the latest content for poster X”.

Here we observe that if you can’t access the content for the specific post the requestor is looking for, there is generally not much value to serving them a different one. Sometimes, rarely, posts have strong mappings to particular topics, and a user could be provided with an opportunity to pick other posts on those topics, if the desired one is not available.

There is often value in serving them the entire corpus of posts available, even if not ostensibly complete, if that’s what the user has requested - they may be interested in looking at the set in total without a commitment to any particular one of them.

As above, caching and versioning can often be deployed to good effect, but the key enabling technique is that a newsfeed architecture allows you some leeway to provide non-up-to-date content for some subset of users without them even realising things are broken. (Of course, there are some who will be very much aware…)

Compute provision

The above examples are focused on provided content which is, in some sense, already computed: files on disk, posts in databases, etc. Things for which the request/response mapping is inherently clear, and the work has already been done. Another set of services are dynamic compute provision services, where the answer is not to ship something static out to the network as quickly as possible, but instead to dynamically calculate it. Examples include: ML model questions and answers, fractal image calculation, lambdas/functions-as-a-service, and so on.

This is a significant change from the above, creating both constraints and opportunities. For (say) fractal image creation, if your local compute function to perform the same fails, it’s possible to retry on-site or even potentially off-site with different enough chances of failure to make it worth trying. If it succeeds, then the graceful degradation has merely come at the expense of latency, though with a large enough delay, some users just abandon the result before it’s provided. Similar opportunities for providing lower-resolution or cached images also exist.

For ML model answers, it might be that the user has a strong preference for e.g. an answer from GPT-5, but if Claude is available then perhaps that answer would be good enough for them. This is similar to certain high-availability architectures where you send a request to a number of back-ends at once, and the first one to respond wins; in this case you could imagine some matrix of response quality, latency and availability which might go towards selecting the right response, though of course it necessarily leads to wasted compute. Arguably anything other than the top ranked result, should everything work correctly, would be a graceful degradation. Again caching techniques can potentially be useful here, but the wider your query-space is, the higher the risk that the questions won’t display the kind of power-law distribution that makes caching very relevant.

Sometimes this is combined with static content provision: for example, searching and ranking, where the dynamic computation piece is to return a set of documents ordered in some way which is (generally) dynamically calculated, but the content itself is static. Caching can often work wonders here, but if your index is unavailable, a graceful degradation might be to provide some set of the documents which are known to contain the tokens in question as a consequence of previous cached results.

In general, dynamic compute provision can extend the variety of substitute results which can be provided, at the expense of other aspects of the user experience.

SLOs and graceful degradation?

Now to consider the question of measurement.

If we consider a simple HTTP request/response service, the classic approach of dividing the total requests by the successful results has some difficulties in the context of graceful degradation. Those difficulties can be summarised as, what do we consider a successful request?

Suppose, in the image server case above, we cannot find the actual image but provide a cached version (maybe at some essentially negligible latency cost). Is that a success? Well, the user got what they want, which is good, but we have an image missing on disk, which is bad. Should we include it as a success or not? If we don’t, we’re potentially missing SLOs where the user population has actually had a totally fine user experience, which is contrary to the point of SLOs; if we do include them as successes, we’re operating in ignorance of key files missing on disk, and at some point when we run out of cache, we could have a nose-diving user-experience.

Furthermore, suppose we made a determination that we will include cached results as successes, but then we “upgrade” the cache system and now it operates about 50% more slowly. Do we still count them as successes, even though we’re now affecting web performance stats such as LCP (longest-content-paint) and the user experience generally? Or one popular image is much slower while the rest are fine, and so on and so forth.

You can see where I’m coming from - if there’s a set of criteria that would allow you to include something as a success, you can probably find a situation relevant to graceful degradation where a particular result could be argued to be on either side of the boundary for inclusion/exclusion.

The key problem to avoid is having different parts of your SLO system/pipeline/organisational architecture make different decisions about the semantics. That way lies perdition. Make the same decisions across your org, or centralise the decision-making, or you’ll be unable to agree on what the user experience is or should be. Bad news.

In general, though, there are two approaches to handling graceful degradation in SLO calculations.

Business-as-usual. The first is as outlined above: to find a set of criteria allowing you to keep some set of the GD responses within the “normal” framework. Precisely how is obviously domain-specific. Handling caching is probably the easiest of these - it could also fit naturally within an overall latency SLO for your systems.

Separate-by-design. The second is to absent all GD responses from the usual calculations, and to essentially pretend it’s an entirely separate serving system with its own goals, measurements, and (indeed) even SLOs. Again the details of this are very domain-specific, but the notable thing about GD responses is that typically, by the time you are engaging them, something has already gone wrong - so your time budget for responding is correspondingly constrained.

Overall, we would argue that, if the range of potential degradations is large and there is a significant infrastructure around them, keeping separate SLOs for those systems probably makes sense, though the SLOs themselves are internally-facing. (Of course, observability for those systems is required to do so.) But it doesn't mean separate SLOs for the user experience - after all, they are declarations of what you want the user experience to be, and the GD mechanisms are what help you to maintain that target.