When the Cloud Fails: Operational Resilience Lessons from October 2025's Major Outages
Author : Abdul Khader Abdul Hanif, Founder, Zazo Tech
Date : 31 October 2025
Reading Time: 10 minutes
On 20 October 2025, Amazon Web Services experienced a catastrophic outage that brought down thousands of websites and applications across the globe. Snapchat, Roblox, Duolingo, and Signal went dark. UK banks including Lloyds, Halifax, and Bank of Scotland became inaccessible. Even Amazon's own retail site and Ring doorbells stopped working. Over 8.1 million users reported problems worldwide. The estimated cost to Amazon alone: $581 million.
Just nine days later, Microsoft Azure suffered a similar fate. Teams, Xbox, Minecraft, and countless customer service applications went offline. Peak user reports exceeded 18,000. The outage struck hours before Microsoft's quarterly earnings call—a timing that underscored just how unpredictable and damaging cloud failures can be.
These weren't isolated incidents. They were the latest in a series of high-profile cloud outages that have exposed a uncomfortable truth: the infrastructure underpinning the modern economy is more fragile than we'd like to admit. And for organisations that have embraced cloud transformation without designing for resilience, these outages represent existential risks.
The root cause of October's disruptions wasn't technical sophistication. AWS's outage originated from a DNS automated systems error—a mundane failure in a subsystem responsible for monitoring load balancers. Azure's issues stemmed from similar infrastructure problems. These weren't sophisticated cyber-attacks or unprecedented technical challenges. They were the kinds of failures that happen in complex systems
What made them catastrophic was concentration risk. The global internet now depends on a handful of providers—primarily AWS, Microsoft Azure, and Google Cloud—to operate. When one of these platforms fails, thousands of businesses fail simultaneously. As Dr Corinne Cath-Speth from Article 19 observed, "We urgently need diversification in cloud computing. The infrastructure underpinning democratic discourse, independent journalism and secure communications cannot be dependent on a handful of companies."
The UK government is taking notice. Following the AWS outage, the House of Commons Treasury Committee wrote to the Economic Secretary asking why Amazon hasn't been designated a "critical third party" to the UK's financial services sector—a designation that would expose AWS to financial regulatory oversight. The implication is clear: cloud providers have become systemically important infrastructure, and their failures have systemic consequences.
The Operational Resilience ImperativeFor UK financial services firms, operational resilience isn't a theoretical concern. It's a regulatory requirement. As of March 2025, firms must have embedded operational resilience frameworks that identify important business services, set impact tolerances, and implement measures to remain within those tolerances during severe disruption.
The October outages demonstrated why these requirements exist. When AWS went down, UK banks lost access to critical systems. Customer-facing services became unavailable. Transaction processing stalled. The firms that weathered the outage best were those that had designed their systems with resilience from the start—multi-region architectures, failover capabilities, and disaster recovery procedures that had been tested repeatedly.
The firms that struggled were those that had treated cloud migration as a simple lift-and-shift exercise. They had moved workloads to the cloud for cost savings and agility but hadn't invested in the architectural patterns that make cloud systems truly resilient. When their single cloud provider failed, they had no fallback. Their important business services exceeded impact tolerances. And they faced uncomfortable conversations with regulators about why they hadn't adequately prepared for a foreseeable risk.
I've spent years designing cloud platforms for organisations where operational resilience isn't optional. At the UK's financial regulator, we architected systems that had to remain operational even during major disruptions. At major UK banks, I led the implementation of cyber resilience capabilities designed to withstand not just technical failures but sophisticated attacks.
The approach we developed rests on a fundamental principle: assume failure. Not as a pessimistic worldview, but as an architectural constraint. Design every system assuming that components will fail, regions will become unavailable, and even entire cloud providers might experience outages. Then build mechanisms to detect, contain, and recover from those failures automatically.
Multi-region architectures distribute workloads across geographically separate data centres. When AWS's US-East-1 region failed in October, organisations with multi-region deployments could failover to US-West-2 or EU-West-1 automatically. Their users experienced brief disruptions, not day-long outages. The key is active-active configurations where multiple regions serve traffic simultaneously, not active-passive setups where failover is a manual, error-prone process
Multi-cloud strategies take resilience a step further by distributing workloads across different cloud providers. This is more complex—different providers have different services, APIs, and operational models—but for truly critical systems, it's increasingly necessary. The October outages demonstrated that relying on a single provider, no matter how reliable, introduces unacceptable concentration risk for important business services.
Immutable backups protect against both technical failures and malicious attacks. We implemented backup strategies where data was replicated across regions and providers, with immutability guarantees that prevented even privileged accounts from deleting backups. When ransomware attacks or accidental deletions occurred, we could restore systems to known-good states within minutes, not hours
Automated failover and recovery eliminate the human delays that turn incidents into disasters. We designed systems that detected failures automatically, initiated failover procedures without manual intervention, and validated that recovery was successful before resuming normal operations. This automation was tested regularly—not just in theory, but through chaos engineering exercises that deliberately introduced failures to validate our recovery procedures.
Comprehensive monitoring and alerting provided visibility into system health before failures became outages. We implemented monitoring that tracked not just infrastructure metrics but business outcomes. If transaction processing rates dropped, if API latency increased, if error rates spiked, alerts triggered automatically. This early warning system allowed us to detect and respond to degradations before they impacted users.
The October outages have accelerated conversations about multi-cloud strategies. Organisations that previously dismissed multi-cloud as unnecessarily complex are now reconsidering. The question is no longer whether to adopt multi-cloud, but how to do so effectively
The challenge is that multi-cloud isn't simply running the same workload on AWS and Azure. Cloud providers offer different services with different capabilities. A Lambda function on AWS doesn't map directly to an Azure Function. An RDS database isn't identical to Azure SQL Database. Building truly portable applications requires abstracting away provider-specific services—using Kubernetes for compute, Terraform for infrastructure provisioning, and open-source tools for observability and security
This abstraction comes with trade-offs. Provider-specific services are often more feature-rich, better integrated, and more cost-effective than their open-source equivalents. A multi-cloud strategy that prioritises portability over optimisation can result in systems that are more expensive and less capable than single-cloud alternatives.
The solution is pragmatic multi-cloud. Not every workload needs to run on multiple providers. Critical systems—those that support important business services with tight impact tolerances—should be architected for multi-cloud resilience. Less critical systems can remain on a single provider, accepting the risk in exchange for simplicity and cost savings
For financial services firms, this means identifying which systems are truly important. Customer-facing banking applications, payment processing systems, and regulatory reporting platforms likely qualify. Internal HR systems, development environments, and data warehouses might not. The operational resilience framework mandated by UK regulators provides a structured approach to making these determinations.
SMEs face a particular challenge. They lack the resources of large enterprises but face similar operational resilience requirements, especially if they serve regulated clients. The October outages demonstrated that cloud provider reliability, whilst generally high, isn't sufficient to guarantee business continuity.
Here are practical steps SMEs can take to improve operational resilience without enterprise budgets.
Start with impact tolerance mapping. Identify your important business services—the capabilities that, if disrupted, would cause unacceptable harm to customers, market integrity, or your business viability. For each service, define your maximum tolerable downtime. This exercise clarifies which systems require investment in resilience and which can tolerate occasional outages
Implement multi-region deployments for critical systems Most cloud providers make multi-region architectures straightforward. AWS offers services like Route 53 for DNS-based failover, RDS read replicas across regions, and S3 cross-region replication. Azure provides Traffic Manager, geo-redundant storage, and Azure Site Recovery. These services add cost—typically 20-30% more than single-region deployments—but dramatically improve resilience.
Automate your disaster recovery procedures. Manual failover processes fail under pressure. Document your recovery procedures, then automate them. Use infrastructure-as-code to define your entire environment so you can recreate it in a different region or provider if necessary. Test your automation regularly—monthly for critical systems, quarterly for others
Establish backup strategies that span providers Don't rely solely on your primary cloud provider for backups. Use services like Veeam, Commvault, or open-source tools like Velero to replicate critical data to a second provider or on-premises storage. Ensure backups are immutable and test restoration procedures regularly
Monitor for degradation, not just failure. Implement monitoring that tracks business metrics—transaction success rates, API response times, user login success—not just infrastructure health. Set alerts that trigger when these metrics degrade, giving you early warning of problems before they become outages.
Develop relationships with multiple providers. Even if you're not running production workloads on multiple clouds, establish accounts, understand pricing, and build familiarity with alternative providers. When an outage strikes, you'll have options. Some organisations maintain "warm standby" environments on secondary providers—minimal infrastructure that can be scaled up quickly if the primary provider fails.
Communicate with stakeholders proactively. When outages occur, customers and regulators want to know what happened, what you're doing about it, and when services will be restored. Develop communication templates in advance. Establish escalation procedures. Practice incident response so that when real outages occur, your team responds effectively rather than panicking.
The Regulatory LandscapeThe October outages will likely accelerate regulatory scrutiny of cloud providers. The UK Treasury Committee's letter about designating AWS as a critical third party signals a shift in how regulators view cloud infrastructure. If implemented, such designation would subject cloud providers to financial services regulation, including operational resilience requirements, incident reporting obligations, and potentially direct oversight by the Financial Conduct Authority and Prudential Regulation Authority.
For organisations using cloud services, this regulatory evolution has implications. Regulators are increasingly holding firms accountable not just for their own operational resilience but for the resilience of their critical third parties. If your cloud provider experiences an outage that causes you to breach impact tolerances, "it was AWS's fault" isn't an acceptable excuse. You're expected to have designed your systems to withstand provider failures
This expectation aligns with the operational resilience frameworks now in force across UK financial services. Firms must identify dependencies on third parties, assess the risks those dependencies create, and implement measures to mitigate those risks. For cloud-dependent organisations, this means multi-region architectures, disaster recovery capabilities, and potentially multi-cloud strategies for the most critical systems.
The October 2025 cloud outages were a wake-up call. They demonstrated that cloud providers, despite their scale and sophistication, are not immune to failures. They exposed the concentration risk inherent in the modern internet's reliance on a handful of platforms. And they underscored the importance of operational resilience—not as a compliance exercise, but as a business imperative.
For organisations that have embraced cloud transformation, the lesson is clear: resilience must be designed in from the start. Multi-region architectures, automated failover, comprehensive monitoring, and tested disaster recovery procedures aren't luxuries. They're necessities for any system that supports important business services.
For SMEs, the challenge is balancing resilience with cost and complexity. Not every system needs enterprise-grade resilience. But critical systems—those that, if disrupted, would cause unacceptable harm—must be architected to withstand provider failures. The investment required is significant but far less than the cost of extended outages or regulatory sanctions.
The cloud isn't going away. Its benefits—scalability, agility, cost efficiency—are too compelling. But the era of blind trust in cloud provider reliability is over. The organisations that thrive in the post-October 2025 landscape will be those that embrace operational resilience as a core architectural principle, designing systems that remain operational even when the cloud fails.
Because as October demonstrated, the cloud will fail. The only question is whether your business will fail with it.
About the AuthorAbdul Khader Abdul Hanif is the founder of Zazo Tech, a UK-based consultancy specialising in security-first digital transformation. He has led operational resilience initiatives for major UK financial institutions and designed cyber resilience capabilities for cloud platforms handling sensitive regulatory data. He holds SC Clearance, AWS Professional and Azure Expert certifications.
Contact us at admin@zazotech.com or Call 020 3576 3613