Protecting Critical User Journeys in Online Applications

We explore top-down and bottom-up overload, and the importance of protecting your most important user journeys against them. We discuss practical strategies for managing overload situations, and the challenges to ensuring they provide uninterrupted, optimal user experiences regardless of traffic fluctuations. 

Joseph Bironas

July 13, 2023

Note:

If you haven’t read Matthew Girard’s fantastic read about types of overload and methods to address them, you might want to do that first. This article will be building off of those concepts.

Introduction

Maintaining a resilient online presence is vital for almost any business. As your company grows, so does your online traffic, presenting a complex challenge: managing this increasing demand effectively. If not addressed properly, overload in the form of serving capacity shortfalls, or demand estimation mistakes can disrupt the user experience, potentially harming your brand reputation and revenue. User journeys— the paths users take to achieve goals on your platform, such as making a purchase, subscribing to a service, or finding information— are crucial to your business's online success. Disruptions in these paths can lead to customer dissatisfaction, lost sales, and even customer attrition.

For technical leaders, understanding and managing overload is essential for safeguarding your most critical user journeys. In this article, I'll explore top-down and bottom-up overload, and the importance of protecting your most important user journeys against them. We'll discuss practical strategies for managing overload situations, and the challenges to ensuring they provide uninterrupted, optimal user experiences regardless of traffic fluctuations. 

Understanding Top-Down and Bottom-Up Overload

To recap Matthew Girard’s excellent blog post on the topic, top-down overload refers to a situation where the system's overall capacity is overwhelmed due to a high volume of incoming requests or workload. This usually occurs when there's a sudden spike in traffic or demand, such as during a flash sale, a viral marketing campaign that drives unexpected user demand to your platform. If your system isn't designed to scale or accommodate such peaks, it could lead to slow response times, server timeouts, or complete system failure (aka, the “hug of death”), leaving your users unable to access or use your services.

On the other hand, bottom-up overload arises when a specific component or service within the system becomes a bottleneck, slowing down or halting the entire process. This could be due to inadequate resource allocation, suboptimal architecture, or a single point of failure that can't handle the demand. For instance, a database that's not designed to handle concurrent queries efficiently could slow down the whole system even if there's capacity available elsewhere. Both top-down and bottom-up overload can disrupt the user experience, making it essential to have strategies in place to manage them both effectively.

The Importance of User Journeys

User journeys represent the path a user takes through your digital platform to accomplish a specific goal, whether it's making a purchase, signing up for a service, or finding critical information. These journeys are the crux of your user's interaction with your platform and directly influence their perception of your brand, and overall satisfaction. Understanding and optimizing user journeys ensures an intuitive, and positive experience, which can boost user engagement, conversion rates, and customer retention.

User journeys are not just crucial from a user experience standpoint; they're also vital to your business's revenue stream. Each step in a user journey can be a potential point of conversion or a chance to upsell or cross-sell. However, any disruption or friction in user journey’s due to system overload or other technical issues can leave users frustrated, cause them to abandon transactions, or worse, stop using your business altogether. Protecting user journeys is vital to maintaining business continuity and securing your revenue.

Strategies to Protect Critical User Journeys from Overload

Protecting critical user journeys from overload requires a multifaceted strategy involving robust technologies, proactive monitoring, predictive analytics, system design for fault tolerance, and user experience tooling.

As a first step, employing technology solutions to manage overload is essential. Common solutions involve load balancing to evenly distribute network traffic across multiple servers, or autoscaling to adjust resources based on the current demand. Less common, but more precise methods of management include rate limiting and throttling to avoid or at least be intentional about overload, or support from frontend applications to retry with backoff when making backend calls. 

Proactive monitoring and predictive analytics play a vital role in early detection of potential overload situations. Monitoring can track system performance and alert for anomalies, while predictive analytics can forecast traffic patterns, helping to prepare for anticipated surges in demand.

Designing systems for fault tolerance and resiliency is another crucial strategy. This involves creating redundancy in the system, so if one component fails or becomes overloaded, others can pick up the demand. Techniques such as circuit breakers can help isolate faults and prevent them from propagating across the system.

In parallel, UX strategies can be employed to minimize the impact of overload on user journeys. This could involve graceful degradation—providing a reduced level of functionality when the system is under stress— or user-friendly error messages that keep users informed about any issues.

Problems with Existing Overload Management Strategies

You may read the list of strategies above and think, “We’re doing all of those things” but still not seeing the results you’re expecting. There are a number of challenges to building an integrated and intelligent platform that works well and puts the user experience front and center. Here are a few:

Integration and Coordination Across Frontend/Backend Teams: In most companies we’ve spoken with, different teams manage frontend (FE) and backend (BE) applications or services, and integrating across the stack is often a complex task. This problem can lead to communication gaps, causing inefficiencies, and often leading to misaligned objectives. For example, the FE team might prioritize user interface improvements while the BE team might be more concerned with database resiliency or query optimization. Without effective communication and coordination, these disjointed efforts can result in missed coordinated solution opportunities, leading to a suboptimal user experience, where the platform looks great but performs poorly or vice versa.

Streaming and Real-Time Predictive Capabilities: Building real-time or streaming predictive capabilities is a massive technical challenge. It requires high computational power, robust data pipelines, and advanced algorithms. These systems are expensive to build and maintain, both in terms of infrastructure and the specialized skills required. Furthermore, they need to handle large volumes of data in real-time, which pose challenges around data storage, processing, and latency.

Accuracy of Predictions and Opex Readiness: Prediction accuracy is a crucial aspect of managing overload. Inaccurate predictions can result in over-provisioning, leading to unnecessary costs, or under-provisioning, causing service disruptions. For example, if you're self-hosting services on premises (“on-prem”) or in your own datacenter provider and overestimate your future load, you might invest in excess servers that remain underutilized. Similarly, in cloud-based systems, inaccurate predictions can lead to higher-than-expected opex costs. Furthermore, many businesses lack robust opex readiness capabilities, such as well-equipped SRE or Capacity planning teams, resulting in reactive rather than proactive management of system performance and costs.

Resilience Against Thundering Herd Problems and Sawtooth Patterns: Even with resilient systems, sometimes the solution can be as bad as the problem. Issues like thundering herd problems, where retries cause a spike in traffic, or sawtooth patterns, where recovery signals kick in too early, and allow too much traffic to reintroduce overload conditions and delay recovery can pose significant challenges to system resiliency and health.

Cost of Overload Management Strategies: Overload management strategies often come with significant costs. For instance, implementing load balancing might require additional hardware or more expensive cloud services with load balancing features. Autoscaling, while effective for managing demand fluctuations, can lead to unexpected costs if not capped or managed properly. Similarly, advanced monitoring and analytics tools, which are crucial for proactive overload management, can also be quite expensive.

Implementing Overload Protection

Stanza is currently building an intelligent solution that offers a comprehensive approach to overload protection while prioritizing your crucial user journeys. Designed with the pain points of modern digital businesses in mind, we’re addressing the challenges of preserving the most important user experiences in situations of top-down and bottom-up overload.

Intelligent Control Plane for FE/BE Integration: At the heart of our product is an intelligent control plane that seamlessly integrates frontend and backend processes. This not only bridges the communication gap between frontend and backend teams but also provides a unified view of the entire system. It simplifies cross-team collaboration, enabling your teams to work cohesively towards enhancing user experiences and system performance.

Easy-to-Integrate Developer Experience for UX Prioritization: We provide easy-to-integrate tools that empower developers to prioritize the user experiences most important to your business. It offers the flexibility to define and adjust priorities, ensuring that resources are allocated effectively to protect critical user journeys, even during overload situations.

Real-Time, Easy-to-Understand Insights: With Stanza, you get insights that are not only easy to understand but also available in real-time. This dual capability makes it an invaluable tool both for predictive planning and for reactive measures during an incident. It helps you stay a step ahead, preparing for potential overload scenarios, and enables swift, informed decision-making during critical situations.

Protection of Business Priorities: Stanza is designed to protect your business priorities while preventing self-harm. It ensures that your critical user journeys remain unaffected even during overload scenarios, safeguarding your revenue streams. At the same time, it incorporates safeguards to prevent overprovisioning or inappropriate scaling or recovery actions that can lead to unnecessary costs or further system instability.

Cost-Effective and Focused: We understand that cost is a crucial factor for businesses. Stanza keeps costs low by offering capabilities and features you need, eliminating the need for multiple disparate teams and tooling. It allows your developers to focus on what truly matters - building and enhancing features that make your business successful, rather than getting bogged down becoming experts in overload management.

Conclusion

In this article I’ve attempted to stress the complex yet essential nature of managing both top-down and bottom-up overload. It's not just about maintaining server stability; it's about ensuring seamless user journeys, which are central to your business's success. We believe that by encouraging tighter integration between frontend and backend teams, fostering the use of real-time data, understandable insights for predictive and reactive measures, and ensuring that all of these things are done in a cost-effective manner, we can not only help navigate the technical hurdles of system overload but also enable you to focus on your main goal – providing superior user experiences that drive your businesses ultimate success.

As we move forward, being well-prepared to manage overload scenarios is not just a technical necessity, but a strategic business decision. Looking ahead, the future of Reliability and Resilience engineering lies in the deployment of intelligent, integrated, data driven solutions. If you’d like to learn more about what we’re building at Stanza, and how we might be able to help you protect critical user journeys, contact us at hello@stanza.systems.

Joseph Bironas

Joseph is passionate about how engineering, leadership, and infrastructure combine to shape reliability efforts. He spent years at Google where he wound up leading multiple automation and tooling initiatives.

He's held a number of roles from building infrastructure and SRE programs, to product and data software efforts, to bringing cloud modernization and reliability best practices to cloud customers.

When he's not thinking about software, and its impact on people, he's probably playing guitar, tinkering with strings of LED lights, or growing hot chilies and making sauces out of them.