How AWS’ Outage Exposes the Risks of Cloud Dependency

Share this article
Share this article
Prioritise Us on Google
UK businesses are adopting AI every minute, reveals a new AWS study (Credit: AWS)
AWS’ outage halted thousands of services worldwide, spotlighting cloud fragility and pressing the need for multi-cloud strategies and resilience planning

AWS’ outage hit headlines around the world on Monday, with millions of people around the world affected by downtime across apps from Zoom to Slack and monday.com to Duolingo.

Following Monday’s massive cloud disruption, AWS has completed its root-cause analysis, confirming that an internal automation fault triggered a cascade of DNS failures in its US-East-1 region. 

While full service was restored within hours, the aftermath was catastrophic – reigniting debate over resilience strategies, multi-cloud architecture and the sheer dependence global businesses have on Amazon’s infrastructure. 

AWS outage: What went wrong?

AWS says the issue originated from an error in a configuration automation process that prevented domain names from resolving properly to IP addresses within DynamoDB, one of its core data services. 

Youtube Placeholder

This DNS failure disrupted connections across more than 1,000 interconnected sites globally, with Lloyds Bank and Venmo among those affected.

According to Amazon’s post-event summary, the fault appeared after a routine update and “caused a backlog of messages that took several hours to process”.

The incident highlights how a single regional glitch in AWS’ oldest and busiest data hub can ricochet across industries – freezing transactions, blocking communication tools and taking streaming and shopping platforms offline.

“We apologise for the impact this event caused our customers,” Amazon’s statement says.

“We know how critical our services are to our customers, their applications and end users and their businesses. We know this event impacted many customers in significant ways.”

Industry responds: Lessons in resilience to be taken from AWS outage

For cloud leaders and engineers, the outage has served as yet another reminder that hyperscale doesn’t mean infallible.

Jamil Ahmed, Distinguished Engineer at Solace, says: “Even as cloud technology evolves, failures within the system will inevitably happen. 

Jamil Ahmed, Distinguished Engineer at Solace

“'One-of-a-kind', extremely rare outages or issues continue to plague every service provider from time to time, which is why the need to store valuable information on multiple provider services, known as an event mesh, have arisen... It is now ‘later on’ and the strategy of using one cloud service is demonstrably dangerous and negligent.”

Cybersecurity experts also warn of the broader risks that follow infrastructure failures. 

Christian Espinosa of Blue Goat Cyber adds: “This widespread outage is a stark reminder that even massive infrastructure providers are not immune to cascading failures. 

Christian Espinosa, Founder and CEO of Blue Goat Cyber

“What makes it more dangerous for businesses is how these disruptions magnify cyber-risk. When platforms go dark, organisations inadvertently shift into backup systems, remote tools are stressed and control lapses become exploitable.”

Analysts at Ookla recorded more than 17 million outage reports globally within the first few hours, the majority from US-based users connected to AWS’ East Coast infrastructure. 

According to estimates from Deployflow, enterprise downtime during this incident cost between US$5,000 and US$9,000 per minute.​

Jake Madders, Director and Co-Founder at Hyve Managed Hosting, shares how organisations can dodge similar risks.

Jake Madders, Director and Co-Founder at Hyve Managed Hosting

“Even the largest and most reliable cloud providers can experience significant outages – but these risks can be mitigated,” he says. 

“The key lies in building resilience into your infrastructure from the outset. Diversifying across multiple cloud providers and geographic regions is essential to ensure redundancy and enable seamless failover when disruption occurs.”

Rob van Lubek, EMEA Vice President at Dynatrace, adds: “Global incidents like this are a clear reminder of how dependent our world has become on software and digital systems.

Rob van Lubek, EMEA Vice President at Dynatrace

“The difference between disruption and recovery often comes down to visibility and speed – how fast an organisation can pinpoint what’s gone wrong, understand why and act to restore service continuity.”

Executives