AWS DynamoDB Incident Analysis: October 2025 N. Virginia Region Outage

Resiliency Architect

AWS N. Virginia Region Service Disruption

October 19-20, 2025
Region: us-east-1

Key Details

Start Time: Oct 19 @ 2:48 AM EST

End Time: Oct 20 @ 5:20 PM EST

Duration: 14h 32m Critical

Region: US East (N. Virginia)

Scope: Regional (us-east-1)

Root Cause Analysis

Primary Trigger: A latent race condition in the DynamoDB DNS management system between two independent components: DNS Planner (monitors health, creates DNS plans) and DNS Enactor (applies plans to Route53). The system manages hundreds of thousands of DNS records across multiple endpoints including regional, FIPS, IPv6, and account-specific endpoints using a single shared regional plan for simplified capacity management.

Race Condition Mechanism: One DNS Enactor experienced unusual processing delays while working through endpoint updates. During this delay, a second Enactor completed applying a newer generation plan and triggered the cleanup process to delete older plans. The delayed Enactor then applied its stale plan to the regional endpoint (dynamodb.us-east-1.amazonaws.com), which was immediately deleted by the cleanup process, leaving an empty DNS record. The initial staleness check had become invalid due to the processing delays, allowing the outdated plan to be applied.

Cascading Failure Chain: Empty DNS record → All new DynamoDB connections failed → EC2 DWFM lost state check capability → Droplet leases timed out → EC2 launches failed → DWFM congestive collapse → Network Manager backlog → NLB health check oscillations → Lambda, ECS/EKS/Fargate, Connect, STS, Redshift, IAM, and Support Center failures.

Recovery Challenge: The empty DNS state was inconsistent and prevented automatic recovery mechanisms from functioning, requiring manual operator intervention at 2:25 AM EST to restore DNS information.

Service Impact Timeline (EST)

Click on any service row to view detailed impact analysis, timelines, and recovery actions Interactive

Oct 19

2 AM

4 AM

Oct 20

6 AM

8 AM

10 AM

Noon

12 PM

2 PM

4 PM

6 PM

DynamoDB ▼

Total Outage

💥 Impact Overview

Complete service outage from 2:48 AM to 5:40 AM EST (2 hours 52 minutes). DNS endpoint became unresolvable, blocking all new connections.

Duration

2h 52m

Impact

100%

Recovery

Manual

🔗 Dependency Chain

Direct Dependencies:

DNS Planner DNS Enactor Route53

Cascading Impacts:

EC2 DWFM Lambda ECS/EKS STS Redshift IAM Connect Support

⏱️ Detailed Timeline

2:48 AM DNS record for dynamodb.us-east-1.amazonaws.com becomes empty

2:48 AM All new DynamoDB connections begin failing

3:38 AM Engineers identify DNS state as source of outage

4:15 AM Temporary mitigations applied for internal services

5:25 AM All DNS information restored

5:40 AM Full recovery as DNS caches expire

🔧 Technical Root Cause

Component: DNS Enactor race condition

Trigger: One DNS Enactor experienced unusual delays processing endpoints. While delayed, another Enactor completed a newer plan and triggered cleanup of older plans. The delayed Enactor then applied its stale plan to the regional endpoint, which was immediately deleted by cleanup.

Failure Mode: Empty DNS record left the system in an inconsistent state that prevented automatic recovery.

⚠️ Customer Impact

100% of new connection attempts failed
Existing connections remained functional
Global tables experienced replication lag to/from us-east-1
Other regions remained accessible
All internal AWS services dependent on DynamoDB impacted

🛡️ Prevention Measures

DNS Planner and Enactor automation disabled worldwide
Race condition scenario will be fixed before re-enabling
Additional protections to prevent incorrect DNS plan application
Enhanced validation before plan deletion
Automated recovery mechanisms for DNS inconsistencies

EC2 ▼

Total Outage

Degraded

Recovering

💥 Impact Overview

New instance launches failed for 5 hours 40 minutes, followed by 5 hours 55 minutes of network connectivity issues.

Total Duration

11h 35m

Outage Phase

5h 40m

Degraded Phase

5h 55m

🔗 Dependency Chain

Depends On:

DynamoDB (DWFM)

Cascading Impacts:

ECS/EKS/Fargate NLB Lambda Redshift

⏱️ Detailed Timeline

2:48 AM DWFM state checks begin failing

2:48-5:24 Droplet leases slowly timeout across fleet

5:25 AM DWFM begins re-establishing leases

5:25-7:14 DWFM enters congestive collapse

7:14 AM Engineers throttle work, restart DWFM hosts

8:28 AM All droplet leases established

8:28-11:21 Network Manager backlog causes delays

11:36 AM Network propagation normalized

2:23 PM Request throttles begin relaxing

4:50 PM All throttles removed, full recovery

⚠️ Error Symptoms

Phase 1: Total Outage (2:48-8:28 AM)

Customers received: InsufficientCapacity and RequestLimitExceeded

Phase 2: Degraded (8:28 AM-2:23 PM)

Launches succeeded but instances had no network connectivity

Phase 3: Recovering (2:23-4:50 PM)

Gradual throttle relaxation, occasional RequestLimitExceeded

🔧 Technical Details

DWFM Congestive Collapse: DropletWorkflow Manager queued work to re-establish droplet leases faster than it could complete them. Work timed out and was re-queued, creating a feedback loop.

Network Manager Backlog: Large backlog of delayed network state changes (accumulated during DWFM failure) overwhelmed Network Manager, causing propagation delays of 2-3 hours.

🛡️ Prevention Measures

Enhanced test suite for DWFM recovery workflow
Improved throttling based on queue depth
Rate limiting for incoming work during high load
Automated circuit breakers for congestive collapse scenarios
Network Manager capacity scaling improvements

📋 Recovery Actions Taken

1. Engineers identified DynamoDB dependency at 3:38 AM

2. Throttled incoming DWFM work to prevent queue overflow at 7:14 AM

3. Performed selective DWFM host restarts to clear queues

4. Monitored Network Manager backlog processing from 8:28 AM

5. Gradually relaxed throttles as systems stabilized from 2:23 PM

Lambda ▼

Total Outage

Degraded

Recovering

💥 Impact Overview

Function creation/updates failed. SQS/Kinesis processing stopped. Internal systems under-scaled due to NLB failures.

Total Duration

11h 24m

Outage Phase

6h 9m

Degraded Phase

5h 27m

🔗 Dependencies

DynamoDB EC2 NLB

Impacts:

Amazon Connect Customer Apps

⏱️ Detailed Timeline

2:51 AM Function creation/updates begin failing

2:51 AM SQS/Kinesis event processing delays start

3:00 AM Internal SQS polling subsystem fails

5:24 AM Service operations recover (except SQS)

7:40 AM SQS polling subsystem manually restored

9:00 AM All message backlogs processed

10:04 AM NLB failures cause instance terminations

11:27 AM Sufficient capacity restored

2:27 PM Throttling gradually reduced

5:15 PM Full recovery, all backlogs processed

⚠️ Customer Impact

Function creation and updates failed completely
SQS event sources stopped processing (large backlogs)
Kinesis event sources experienced delays
Synchronous invocations prioritized over async
Event Source Mappings throttled
Asynchronous invocations throttled

🛡️ Prevention & Mitigation

1. Enhanced monitoring for SQS polling subsystem to detect failures faster

2. Automated recovery for internal polling systems

3. Improved throttling strategies to protect synchronous invocations

4. Better coordination with NLB health checks during EC2 recovery

Network Load Balancer ▼

Degraded

Recovering

💥 Impact Overview

Health check oscillations caused connection errors for some load balancers.

Total Duration

8h 39m

Degraded Phase

4h 6m

Recovery Phase

4h 33m

🔗 Dependencies

EC2 Network Manager

Impacts:

Lambda Connect STS

⏱️ Timeline

8:30 AM Health check failures begin

9:52 AM Monitoring detects issue

12:36 PM Automatic failover disabled

5:09 PM Failover re-enabled, full recovery

🔧 Root Cause

Health check subsystem brought new EC2 instances into service before Network Manager propagated network state. Health checks alternated pass/fail, causing nodes to be repeatedly removed and added.

🛡️ Prevention

1. Velocity control mechanism to limit capacity removal during health check failures

2. Coordination with Network Manager to prevent premature health checks

ECS / EKS / Fargate ▼

Container Launch Failures

💥 Impact Overview

Container launch failures and cluster scaling delays across all three services from 2:45 AM to 5:20 PM EST (14 hours 35 minutes).

Duration

14h 35m

Impact

High

Scope

All 3

🔗 Dependency Chain

Depends On:

EC2 DynamoDB Lambda

Services Affected:

ECS EKS Fargate

⚠️ Customer Impact

Container launch failures across all three platforms
Cluster scaling operations delayed or failed
Auto-scaling unable to provision new capacity
Fargate task launches completely blocked
EKS pod scheduling failures
ECS service deployments stuck in pending

🔧 Technical Root Cause

Primary Dependency: All three services rely on EC2 for underlying compute capacity.

Failure Chain: DynamoDB outage → EC2 DWFM failure → No new instances → Container launch failures

Impact Duration: Services remained impaired throughout the entire EC2 recovery period including the degraded and recovering phases.

🛡️ Prevention Measures

1. Enhanced graceful degradation when EC2 capacity is constrained

2. Improved error messaging to clearly indicate EC2 dependency failures

3. Better retry logic with exponential backoff during EC2 recovery periods

Amazon Connect ▼

Total Outage

Chat Errors

Elevated Errors

💥 Impact Overview

Two distinct impact periods: initial DynamoDB-related failures, then NLB/Lambda-related errors.

Total Duration

13h 24m

Phase 1

5h 4m

Phase 2

6h 16m

🔗 Dependency Chain

Depends On:

DynamoDB NLB Lambda

Customer-Facing Impacts:

Calls Chats Tasks Emails Cases

⏱️ Detailed Timeline

2:56 AM Initial errors across all channels

5:40 AM Most Connect features recovered

5:40-8:00 Continued elevated chat errors

10:04 AM Second wave: NLB & Lambda issues

4:20 PM Full service restoration

⚠️ Customer Impact

Inbound Calls:

Busy tones and error messages
Failed connections and dead air
Prompt playback failures
Agent routing failures

Outbound Calls:

Agent-initiated calls failed
API-initiated calls failed

Other Channels:

Chat, task, email, case handling errors
Agent sign-in failures
API and Contact Search errors

📊 Data Impact

• Real-time dashboards experienced delays during the incident

• Historical dashboard data updates were delayed

• Data Lake updates were delayed, backfilled by October 28

STS (Security Token) ▼

API Errors

NLB Impact

💥 Impact Overview

Two brief impact periods: initial DynamoDB dependency, then NLB health check failures.

Period 1

1h 28m

Period 2

1h 28m

Total

2h 56m

🔗 Dependencies

DynamoDB (internal) NLB

Service Function:

STS provides temporary security credentials for AWS services and federated users.

⏱️ Timeline

Period 1: DynamoDB Dependency

2:51 AM API errors and latency begin

4:19 AM Recovered after DynamoDB restoration

Period 2: NLB Health Checks

11:31 AM API errors from NLB failures

12:59 PM Normal operations resumed

⚠️ Customer Impact

STS API error rates elevated
Increased latency for token generation
Temporary credential requests failing
Federation and role assumption impacted
Cross-service authentication delays

🔧 Technical Details

1. First period caused by internal DynamoDB endpoints becoming unavailable

2. Recovered automatically once DynamoDB internal endpoints restored at 4:19 AM

3. Second period triggered by NLB health check oscillations removing capacity

4. Recovered as NLB health check issues were mitigated

Redshift ▼

Query Failures

Cluster Unavailable

💥 Impact Overview

Initial query processing failures, followed by prolonged cluster unavailability due to EC2 dependency.

Query Impact

2h 34m

Extended To

Oct 21

Scope

Global

🔗 Dependency Chain

Depends On:

DynamoDB EC2 IAM (global)

⏱️ Detailed Timeline

2:47 AM API errors for create/modify clusters

2:47 AM Query processing fails (DynamoDB dep)

5:21 AM Query operations resume

9:45 AM Stopped workflow backlog growth

5:46 PM Replacement instances start launching

Oct 21 7:05 All impaired clusters restored

⚠️ Error Symptoms

Phase 1: DynamoDB Impact (2:47-5:21 AM)

Query processing failed due to inability to read/write DynamoDB metadata

Phase 2: EC2 Impact (5:21 AM onwards)

Clusters stuck in "modifying" state as credential refresh triggered replacement workflows that couldn't complete due to EC2 launch failures

Global IAM Impact (2:47-4:20 AM)

IAM user credential queries failed globally due to Redshift calling IAM APIs in us-east-1

🔧 Technical Root Cause

Multiple Failure Modes:

1. Query Processing: Redshift uses DynamoDB to read/write cluster metadata. DynamoDB outage blocked all query operations.

2. Credential Refresh Loop: As node credentials expired, Redshift automation triggered EC2 instance replacements. EC2 launch failures caused workflows to stall, leaving clusters in "modifying" state.

3. Cross-Region Defect: Redshift hardcoded IAM API calls to us-east-1 for user group resolution, causing global IAM user authentication failures.

⚠️ Customer Impact

Create/modify cluster operations failed
Query processing completely blocked
Clusters became unavailable for workloads
Clusters stuck in "modifying" state
Global: IAM users couldn't execute queries in any region
Local users unaffected in other regions

🛡️ Prevention Measures

1. Fix cross-region IAM dependency - remove hardcoded us-east-1 API calls

2. Improved graceful degradation when EC2 capacity unavailable

3. Better handling of credential expiration during EC2 outages

4. Automated detection and recovery for stuck workflows

IAM ▼

Auth Failures

💥 Impact Overview

Console authentication failures for IAM users and Identity Center in us-east-1.

Duration

1h 34m

Scope

Console

Impact

🔗 Dependencies

DynamoDB

Affected User Types:

IAM Users Identity Center Root (other regions)

⏱️ Timeline

2:51 AM IAM user console auth failures begin

2:51 AM Identity Center sign-in blocked (us-east-1)

2:51 AM Root/federated users error (other regions)

4:25 AM All authentication methods restored

⚠️ Customer Impact

AWS Management Console:

IAM users unable to sign in (us-east-1)
Identity Center sign-in failures (us-east-1)
Root credential sign-in errors (other regions)
Federated users unable to sign in (other regions)

Unaffected:

IAM role assumption via APIs continued working
Existing console sessions remained active

🔧 Technical Details

• IAM console authentication uses DynamoDB for session state management in us-east-1

• Cross-region console access (signin.aws.amazon.com) also depended on us-east-1 endpoints

• Automatic recovery occurred when DynamoDB endpoints were restored at 4:25 AM

Support Center ▼

Case Access Blocked

💥 Impact Overview

Support case creation, viewing, and updates blocked despite successful failover to backup region.

Duration

2h 52m

Impact

100%

Channels

All

🔗 System Architecture

Components:

Support Console Support API Account Metadata

Failover Status:

✓ Regional failover successful
✗ Metadata subsystem returned invalid responses

⏱️ Timeline

2:48 AM Support Center case operations fail

2:48 AM Automatic failover to backup region

2:48 AM Metadata subsystem blocks access

5:40 AM Initial mitigation applied

5:58 AM Additional prevention measures added

🔧 Technical Root Cause

Design Flaw: Account metadata subsystem had three possible response states:

1. ✓ Valid response (access granted)

2. ✓ Error response (bypass enabled)

3. ✗ Invalid response (incorrectly blocked)

Failure: During the incident, the subsystem returned invalid responses instead of error responses, which prevented the bypass logic from activating.

⚠️ Customer Impact

Unable to create new support cases
Unable to view existing cases
Unable to update or respond to cases
Support Console completely blocked
Support API calls failed
Critical during a major outage event

🛡️ Prevention Measures

Enhanced response validation for metadata subsystem
Improved bypass logic to handle invalid responses
Additional fallback mechanisms for metadata unavailability
Better monitoring for metadata response patterns
Graceful degradation when metadata is inconsistent

💡 Lessons Learned

1. Successful regional failover insufficient - all subsystems must handle degraded dependencies

2. Bypass logic should treat "invalid" responses the same as "error" responses

3. Support Center availability is critical during outages - requires highest resiliency

Total Outage

Partial Outage / Degraded

Recovering

Resolved

Recovery Actions & Timeline

Phase 1: Detection & Initial Response

12:38 AM Engineers identified DynamoDB DNS state as outage source (50 minutes after start)

1:15 AM Temporary mitigations enabled internal services to connect, restored critical tooling

2:25 AM Manual intervention restored all DNS information for DynamoDB

2:32 AM DynamoDB global tables replicas fully caught up

2:40 AM Customer DNS caches expired, DynamoDB endpoint fully resolved

Phase 2: EC2 Recovery

4:14 AM Throttled incoming DWFM work, began selective host restarts to clear queues

5:28 AM All droplet leases re-established, EC2 launches began succeeding

10:36 AM Network Manager backlog cleared, network propagation normalized

11:23 AM Began gradually relaxing API throttles as systems stabilized

1:50 PM All EC2 throttles removed, full service recovery achieved

Phase 3: Dependent Services

4:40 AM Lambda SQS polling subsystem manually restored

9:36 AM NLB automatic health check failover disabled to stabilize capacity

9:45 AM Redshift workflow backlog growth stopped

2:09 PM NLB health check failover re-enabled after EC2 recovery

2:15 PM Lambda fully recovered, all backlogs processed

4:20 PM Amazon Connect fully restored

5:20 PM ECS/EKS/Fargate fully recovered

Oct 21, 7:05 AM All impaired Redshift clusters restored (extended recovery)

🛡️ Prevention Measures Implemented

DynamoDB DNS: Disabled DNS Planner/Enactor automation worldwide. Will fix race condition and add protections against incorrect DNS plan application before re-enabling.

EC2 DWFM: Building additional test suite for recovery workflow scale testing. Improving throttling based on queue depth to prevent congestive collapse.

Network Manager: Enhanced capacity scaling and rate limiting for backlog scenarios.

NLB: Adding velocity control mechanism to limit capacity removal during health check failures.

Lambda: Enhanced monitoring and automated recovery for internal polling subsystems.

Redshift: Removing hardcoded us-east-1 IAM API calls to prevent cross-region impacts.