Subscribe to Our Newsletter

Success! Now Check Your Email

To complete Subscribe, click the confirmation link in your inbox. If it doesn’t arrive within 3 minutes, check your spam folder.

Ok, Thanks
When the Cloud Broke: AWS Trips Over Its Own DNS and Takes Half the Internet With It
AI-generated image

When the Cloud Broke: AWS Trips Over Its Own DNS and Takes Half the Internet With It

Five days after the Amazon Web Services outage that left much of the internet clutching its coffee and muttering about “multi-region failover,” the post-mortem finally arrived.

Mr Moonlight profile image
by Mr Moonlight

Turns out, the world’s biggest cloud tripped over one of its oldest dependencies: DNS, the internet’s phone book.

On Monday, October 20, 2025, a quiet software glitch deep inside DynamoDB’s automated DNS management spawned an empty record in the US-EAST-1 region, Amazon’s notoriously overworked data-centre hub in Northern Virginia. The result: cascading failures across more than a hundred AWS services and thousands of customer applications. Alexa stopped listening, Fortnite stopped fighting, and developers everywhere stopped pretending they weren’t sweating.

What Happened

The trigger was a rare race condition in DynamoDB’s automated DNS process. In plain English: two bits of software tried to update a record at the same time, and the result was digital gibberish. That malformed record essentially erased the address for DynamoDB’s main U.S. endpoint, breaking name resolution for everything that relied on it.

Since AWS’s internal services are as interconnected as a suburban power grid, the outage spread fast. Loss of DNS resolution in DynamoDB crippled control-plane operations and higher-level managed services, including EC2 launches, IAM updates, and global identity checks.

When DNS fails, nothing knows where anything is. Cue the dominoes: Alexa, Ring, Reddit, and Fortnite all face-planted. It was less “cloud computing” and more “cloud collapse.”

How AWS Recovered

To its credit, AWS engineers moved quickly once the scope became clear. They disabled the faulty automation, pushed out emergency patches, and throttled some operations to stabilise load.

By Monday afternoon U.S. time, Amazon reported “significant recovery,” though users were still clearing backlogs into Tuesday. Internet measurement firms like ThousandEyes tracked the incident from the first error spikes at 07:48 UTC to full restoration around 22:00 UTC, showing how deeply one bad DNS entry can ripple through the world’s digital bloodstream.

Amazon’s official post-incident update was suitably contrite. The company confirmed a “latent software defect” in DynamoDB’s DNS automation and said it has now added additional safeguards to prevent similar issues. The bug has been patched, but the trust bruise will take longer to heal.

Why It Spread So Far

US-EAST-1 isn’t just another AWS region. It’s the beating heart of the company’s global infrastructure. Many AWS control-plane features and even third-party services are still anchored there because it’s been around the longest and houses a disproportionate share of global workloads.

That centralisation made the blast radius enormous. The outage wasn’t the result of an internet backbone failure; it was the digital equivalent of locking your car keys inside the car and realizing the car is every other car on Earth.

Lessons for Amazon

For Amazon, the takeaway is automation humility. DNS is a global primitive, and no automation that touches it should run without a leash. That means dual-control approvals, canary rollouts, and automatic rollbacks when health checks twitch.

The company has already paused the offending process and promised more isolation between regional control planes so a problem in Virginia doesn’t ruin everyone’s Monday.

Lessons for Everyone Else

For businesses built on AWS, the lessons are older but now impossible to ignore. Design for DNS failure. Cache critical lookups. Tune TTLs wisely. Validate resolvers.

Go multi-region by design. If everything depends on one AWS region, you don’t have redundancy—you have a single point of global regret. And for the love of uptime, test your runbooks. Schedule chaos drills that simulate DNS blackouts, not just CPU spikes. If you can’t survive your own practice outage, you won’t survive Amazon’s.

Also, vendors: stop making devices that go brain-dead when the cloud sneezes. If your “smart” doorbell turns dumb the moment a DNS server blinks, it’s not smart; it’s decorative.

Convenience Over Resilience

Even after the dust settles, one question remains: Why does so much of the internet still hinge on a single AWS region?

Amazon has promised more architectural decoupling, but the company’s DNA—and its customers’ convenience - keep steering everyone back to US-EAST-1. The same region has been the epicentre of multiple historic outages, yet it continues to operate global control functions. It’s the digital equivalent of rebuilding your house on the same floodplain every time it washes away.

So yes, AWS fixed the bug. But the bigger flaw is cultural: convenience over resilience, scale over separation. Until that changes, we’ll keep living in a world where a typo in Virginia can knock out your toaster in Berlin.

The cloud didn’t break because it’s fragile. It broke because we treat it like it isn’t.

Mr Moonlight profile image
by Mr Moonlight

Read More