A single data centre’s cooling system fell behind. AWS shifted traffic away from the affected zone and warned that fully restoring the remaining services would take longer than expected.
Amazon Web Services said on Thursday that one of its data centres in northern Virginia was running hot enough to disrupt customer workloads, and that engineers were still bringing the site fully back online when most users had gone to bed for the night.
The trigger was prosaic: increased temperatures inside a single data centre, attributed to a cooling-system shortfall, forced AWS to throttle and then partly reroute traffic away from the affected Availability Zone.
By the company’s account, additional cooling capacity began coming online a couple of hours after the first impact reports, and “early signs of recovery” appeared shortly after.
A later update was less reassuring: bringing in enough extra cooling to safely restart the remaining systems was taking longer than expected, and AWS was unwilling to put a clock on full restoration.
Coinbase confirmed that its trading platform problems were caused by the AWS event. After several hours of degraded markets, the exchange said all markets had been re-enabled, and trading was back to normal.
CME Group, the world’s largest derivatives marketplace, also reported issues with its CME Direct platform during the same window, although it described the cause only as “essential maintenance” and did not say whether the AWS event was a factor. Both companies declined further comment outside business hours.
The northern Virginia cluster, US-East-1 in AWS terminology, is the company’s oldest, busiest, and most concentrated region.
An Availability Zone in that region groups one or more physical data centres that are designed to operate independently, and AWS’s official guidance during recovery was the standard recommendation: customers running in the affected zone should fail over to one of the others. That works well for engineering teams who have built for it. It works less well for those who have not.
The pattern is becoming familiar. AWS suffered a far larger outage last October when a DNS resolution failure in DynamoDB cascaded across more than a hundred services and took offline platforms ranging from Snapchat and Reddit to United Airlines and Coinbase. That event lasted roughly fourteen hours and was the largest internet-wide disruption since the CrowdStrike software malfunction of 2024.
A month later, CME suffered one of its longest trading outages in years, traced back to a cooling failure at a CyrusOne data centre in the Chicago area.
The repetition matters. Cooling failures, configuration errors and DNS misfires are different technical events, but they share an outcome: a single physical or logical site becomes the bottleneck for an outsized share of public-facing traffic. The northern Virginia region carries that load by historical accident more than design.
AWS launched the region in 2006 and US-East-1 has accumulated workloads, regulatory dependencies and customer inertia ever since. The hyperscalers are spending tens of billions to expand other regions, but customer concentration in US-East-1 is unlikely to shift quickly.
Coinbase’s exposure to the cloud sits inside a longer arc. The Cloudflare-driven outage that took down Coinbase and other exchanges in 2019 was a different failure mode, but the same lesson, and it is part of why crypto exchanges have spent the years since architecting for multi-region failover.
Thursday’s incident demonstrates that even with that work, a single warm-room shutdown still ripples into a market that is supposed to be open around the clock.
CME’s situation is more delicate. Derivatives markets sit on top of complex margin and clearing pipelines that do not gracefully degrade easily; an outage at peak Asia hours, as Thursday’s was, hits clearing-cycle deadlines that move money the next morning.
Whether the CME issue was directly tied to the AWS event will determine how the trading-resilience conversation lands with regulators.
AWS has not estimated the affected workload count, and Amazon has not yet said why the cooling system fell behind, whether the issue was equipment, ambient conditions, or a combination.
The northern Virginia region has spent the past year absorbing a wave of new AI-training and inference capacity, which runs hotter and denser than traditional cloud workloads; whether that is incidentally relevant to Thursday’s failure or substantively part of the cause is the question the post-incident report will need to address.
For most customers, the fix is the one AWS recommended in its first update: stop running everything in a single Availability Zone in a single region. That advice has been on AWS’s own architecture-best-practice page for years. Each failure of this kind raises the cost of having ignored it.
Get the TNW newsletter
Get the most important tech news in your inbox each week.
