RESILIENCY ARCHITECT
INTELLECTUAL PROPERTY
© 2025 Resiliency Architect
All Rights Reserved
AWS N. Virginia Region Service Disruption
October 19-20, 2025
Region: us-east-1

Key Details

Start Time: Oct 19 @ 2:48 AM EST
End Time: Oct 20 @ 5:20 PM EST
Duration: 14h 32m Critical
Region: US East (N. Virginia)
Scope: Regional (us-east-1)

Root Cause Analysis

Primary Trigger: A latent race condition in the DynamoDB DNS management system between two independent components: DNS Planner (monitors health, creates DNS plans) and DNS Enactor (applies plans to Route53). The system manages hundreds of thousands of DNS records across multiple endpoints including regional, FIPS, IPv6, and account-specific endpoints using a single shared regional plan for simplified capacity management.

Race Condition Mechanism: One DNS Enactor experienced unusual processing delays while working through endpoint updates. During this delay, a second Enactor completed applying a newer generation plan and triggered the cleanup process to delete older plans. The delayed Enactor then applied its stale plan to the regional endpoint (dynamodb.us-east-1.amazonaws.com), which was immediately deleted by the cleanup process, leaving an empty DNS record. The initial staleness check had become invalid due to the processing delays, allowing the outdated plan to be applied.

Cascading Failure Chain: Empty DNS record → All new DynamoDB connections failed → EC2 DWFM lost state check capability → Droplet leases timed out → EC2 launches failed → DWFM congestive collapse → Network Manager backlog → NLB health check oscillations → Lambda, ECS/EKS/Fargate, Connect, STS, Redshift, IAM, and Support Center failures.

Recovery Challenge: The empty DNS state was inconsistent and prevented automatic recovery mechanisms from functioning, requiring manual operator intervention at 2:25 AM EST to restore DNS information.

Service Impact Timeline (EST)
Click on any service row to view detailed impact analysis, timelines, and recovery actions Interactive
Oct 19
2 AM
4 AM
Oct 20
6 AM
8 AM
10 AM
Noon
12 PM
2 PM
4 PM
6 PM
DynamoDB
Total Outage

💥 Impact Overview

Complete service outage from 2:48 AM to 5:40 AM EST (2 hours 52 minutes). DNS endpoint became unresolvable, blocking all new connections.

Duration
2h 52m
Impact
100%
Recovery
Manual

🔗 Dependency Chain

Direct Dependencies:

DNS Planner DNS Enactor Route53

Cascading Impacts:

EC2 DWFM Lambda ECS/EKS STS Redshift IAM Connect Support

⏱️ Detailed Timeline

2:48 AM DNS record for dynamodb.us-east-1.amazonaws.com becomes empty
2:48 AM All new DynamoDB connections begin failing
3:38 AM Engineers identify DNS state as source of outage
4:15 AM Temporary mitigations applied for internal services
5:25 AM All DNS information restored
5:40 AM Full recovery as DNS caches expire

🔧 Technical Root Cause

Component: DNS Enactor race condition

Trigger: One DNS Enactor experienced unusual delays processing endpoints. While delayed, another Enactor completed a newer plan and triggered cleanup of older plans. The delayed Enactor then applied its stale plan to the regional endpoint, which was immediately deleted by cleanup.

Failure Mode: Empty DNS record left the system in an inconsistent state that prevented automatic recovery.

⚠️ Customer Impact

  • 100% of new connection attempts failed
  • Existing connections remained functional
  • Global tables experienced replication lag to/from us-east-1
  • Other regions remained accessible
  • All internal AWS services dependent on DynamoDB impacted

🛡️ Prevention Measures

  • DNS Planner and Enactor automation disabled worldwide
  • Race condition scenario will be fixed before re-enabling
  • Additional protections to prevent incorrect DNS plan application
  • Enhanced validation before plan deletion
  • Automated recovery mechanisms for DNS inconsistencies
EC2
Total Outage
Degraded
Recovering

💥 Impact Overview

New instance launches failed for 5 hours 40 minutes, followed by 5 hours 55 minutes of network connectivity issues.

Total Duration
11h 35m
Outage Phase
5h 40m
Degraded Phase
5h 55m

🔗 Dependency Chain

Depends On:

DynamoDB (DWFM)

Cascading Impacts:

ECS/EKS/Fargate NLB Lambda Redshift

⏱️ Detailed Timeline

2:48 AM DWFM state checks begin failing
2:48-5:24 Droplet leases slowly timeout across fleet
5:25 AM DWFM begins re-establishing leases
5:25-7:14 DWFM enters congestive collapse
7:14 AM Engineers throttle work, restart DWFM hosts
8:28 AM All droplet leases established
8:28-11:21 Network Manager backlog causes delays
11:36 AM Network propagation normalized
2:23 PM Request throttles begin relaxing
4:50 PM All throttles removed, full recovery

⚠️ Error Symptoms

Phase 1: Total Outage (2:48-8:28 AM)

Customers received: InsufficientCapacity and RequestLimitExceeded

Phase 2: Degraded (8:28 AM-2:23 PM)

Launches succeeded but instances had no network connectivity

Phase 3: Recovering (2:23-4:50 PM)

Gradual throttle relaxation, occasional RequestLimitExceeded

🔧 Technical Details

DWFM Congestive Collapse: DropletWorkflow Manager queued work to re-establish droplet leases faster than it could complete them. Work timed out and was re-queued, creating a feedback loop.

Network Manager Backlog: Large backlog of delayed network state changes (accumulated during DWFM failure) overwhelmed Network Manager, causing propagation delays of 2-3 hours.

🛡️ Prevention Measures

  • Enhanced test suite for DWFM recovery workflow
  • Improved throttling based on queue depth
  • Rate limiting for incoming work during high load
  • Automated circuit breakers for congestive collapse scenarios
  • Network Manager capacity scaling improvements

📋 Recovery Actions Taken

1. Engineers identified DynamoDB dependency at 3:38 AM
2. Throttled incoming DWFM work to prevent queue overflow at 7:14 AM
3. Performed selective DWFM host restarts to clear queues
4. Monitored Network Manager backlog processing from 8:28 AM
5. Gradually relaxed throttles as systems stabilized from 2:23 PM
Lambda
Total Outage
Degraded
Recovering

💥 Impact Overview

Function creation/updates failed. SQS/Kinesis processing stopped. Internal systems under-scaled due to NLB failures.

Total Duration
11h 24m
Outage Phase
6h 9m
Degraded Phase
5h 27m

🔗 Dependencies

DynamoDB EC2 NLB

Impacts:

Amazon Connect Customer Apps

⏱️ Detailed Timeline

2:51 AM Function creation/updates begin failing
2:51 AM SQS/Kinesis event processing delays start
3:00 AM Internal SQS polling subsystem fails
5:24 AM Service operations recover (except SQS)
7:40 AM SQS polling subsystem manually restored
9:00 AM All message backlogs processed
10:04 AM NLB failures cause instance terminations
11:27 AM Sufficient capacity restored
2:27 PM Throttling gradually reduced
5:15 PM Full recovery, all backlogs processed

⚠️ Customer Impact

  • Function creation and updates failed completely
  • SQS event sources stopped processing (large backlogs)
  • Kinesis event sources experienced delays
  • Synchronous invocations prioritized over async
  • Event Source Mappings throttled
  • Asynchronous invocations throttled

🛡️ Prevention & Mitigation

1. Enhanced monitoring for SQS polling subsystem to detect failures faster
2. Automated recovery for internal polling systems
3. Improved throttling strategies to protect synchronous invocations
4. Better coordination with NLB health checks during EC2 recovery
Network Load Balancer
Degraded
Recovering

💥 Impact Overview

Health check oscillations caused connection errors for some load balancers.

Total Duration
8h 39m
Degraded Phase
4h 6m
Recovery Phase
4h 33m

🔗 Dependencies

EC2 Network Manager

Impacts:

Lambda Connect STS

⏱️ Timeline

8:30 AM Health check failures begin
9:52 AM Monitoring detects issue
12:36 PM Automatic failover disabled
5:09 PM Failover re-enabled, full recovery

🔧 Root Cause

Health check subsystem brought new EC2 instances into service before Network Manager propagated network state. Health checks alternated pass/fail, causing nodes to be repeatedly removed and added.

🛡️ Prevention

1. Velocity control mechanism to limit capacity removal during health check failures
2. Coordination with Network Manager to prevent premature health checks
ECS / EKS / Fargate
Container Launch Failures

💥 Impact Overview

Container launch failures and cluster scaling delays across all three services from 2:45 AM to 5:20 PM EST (14 hours 35 minutes).

Duration
14h 35m
Impact
High
Scope
All 3

🔗 Dependency Chain

Depends On:

EC2 DynamoDB Lambda

Services Affected:

ECS EKS Fargate

⚠️ Customer Impact

  • Container launch failures across all three platforms
  • Cluster scaling operations delayed or failed
  • Auto-scaling unable to provision new capacity
  • Fargate task launches completely blocked
  • EKS pod scheduling failures
  • ECS service deployments stuck in pending

🔧 Technical Root Cause

Primary Dependency: All three services rely on EC2 for underlying compute capacity.

Failure Chain: DynamoDB outage → EC2 DWFM failure → No new instances → Container launch failures

Impact Duration: Services remained impaired throughout the entire EC2 recovery period including the degraded and recovering phases.

🛡️ Prevention Measures

1. Enhanced graceful degradation when EC2 capacity is constrained
2. Improved error messaging to clearly indicate EC2 dependency failures
3. Better retry logic with exponential backoff during EC2 recovery periods
Amazon Connect
Total Outage
Chat Errors
Elevated Errors

💥 Impact Overview

Two distinct impact periods: initial DynamoDB-related failures, then NLB/Lambda-related errors.

Total Duration
13h 24m
Phase 1
5h 4m
Phase 2
6h 16m

🔗 Dependency Chain

Depends On:

DynamoDB NLB Lambda

Customer-Facing Impacts:

Calls Chats Tasks Emails Cases

⏱️ Detailed Timeline

2:56 AM Initial errors across all channels
5:40 AM Most Connect features recovered
5:40-8:00 Continued elevated chat errors
10:04 AM Second wave: NLB & Lambda issues
4:20 PM Full service restoration

⚠️ Customer Impact

Inbound Calls:
  • Busy tones and error messages
  • Failed connections and dead air
  • Prompt playback failures
  • Agent routing failures
Outbound Calls:
  • Agent-initiated calls failed
  • API-initiated calls failed
Other Channels:
  • Chat, task, email, case handling errors
  • Agent sign-in failures
  • API and Contact Search errors

📊 Data Impact

Real-time dashboards experienced delays during the incident
Historical dashboard data updates were delayed
Data Lake updates were delayed, backfilled by October 28
STS (Security Token)
API Errors
NLB Impact

💥 Impact Overview

Two brief impact periods: initial DynamoDB dependency, then NLB health check failures.

Period 1
1h 28m
Period 2
1h 28m
Total
2h 56m

🔗 Dependencies

DynamoDB (internal) NLB

Service Function:

STS provides temporary security credentials for AWS services and federated users.

⏱️ Timeline

Period 1: DynamoDB Dependency
2:51 AM API errors and latency begin
4:19 AM Recovered after DynamoDB restoration
Period 2: NLB Health Checks
11:31 AM API errors from NLB failures
12:59 PM Normal operations resumed

⚠️ Customer Impact

  • STS API error rates elevated
  • Increased latency for token generation
  • Temporary credential requests failing
  • Federation and role assumption impacted
  • Cross-service authentication delays

🔧 Technical Details

1. First period caused by internal DynamoDB endpoints becoming unavailable
2. Recovered automatically once DynamoDB internal endpoints restored at 4:19 AM
3. Second period triggered by NLB health check oscillations removing capacity
4. Recovered as NLB health check issues were mitigated
Redshift
Query Failures
Cluster Unavailable

💥 Impact Overview

Initial query processing failures, followed by prolonged cluster unavailability due to EC2 dependency.

Query Impact
2h 34m
Extended To
Oct 21
Scope
Global

🔗 Dependency Chain

Depends On:

DynamoDB EC2 IAM (global)

⏱️ Detailed Timeline

2:47 AM API errors for create/modify clusters
2:47 AM Query processing fails (DynamoDB dep)
5:21 AM Query operations resume
9:45 AM Stopped workflow backlog growth
5:46 PM Replacement instances start launching
Oct 21 7:05 All impaired clusters restored

⚠️ Error Symptoms

Phase 1: DynamoDB Impact (2:47-5:21 AM)

Query processing failed due to inability to read/write DynamoDB metadata

Phase 2: EC2 Impact (5:21 AM onwards)

Clusters stuck in "modifying" state as credential refresh triggered replacement workflows that couldn't complete due to EC2 launch failures

Global IAM Impact (2:47-4:20 AM)

IAM user credential queries failed globally due to Redshift calling IAM APIs in us-east-1

🔧 Technical Root Cause

Multiple Failure Modes:

1. Query Processing: Redshift uses DynamoDB to read/write cluster metadata. DynamoDB outage blocked all query operations.

2. Credential Refresh Loop: As node credentials expired, Redshift automation triggered EC2 instance replacements. EC2 launch failures caused workflows to stall, leaving clusters in "modifying" state.

3. Cross-Region Defect: Redshift hardcoded IAM API calls to us-east-1 for user group resolution, causing global IAM user authentication failures.

⚠️ Customer Impact

  • Create/modify cluster operations failed
  • Query processing completely blocked
  • Clusters became unavailable for workloads
  • Clusters stuck in "modifying" state
  • Global: IAM users couldn't execute queries in any region
  • Local users unaffected in other regions

🛡️ Prevention Measures

1. Fix cross-region IAM dependency - remove hardcoded us-east-1 API calls
2. Improved graceful degradation when EC2 capacity unavailable
3. Better handling of credential expiration during EC2 outages
4. Automated detection and recovery for stuck workflows
IAM
Auth Failures

💥 Impact Overview

Console authentication failures for IAM users and Identity Center in us-east-1.

Duration
1h 34m
Scope
Console
Impact
Login

🔗 Dependencies

DynamoDB

Affected User Types:

IAM Users Identity Center Root (other regions)

⏱️ Timeline

2:51 AM IAM user console auth failures begin
2:51 AM Identity Center sign-in blocked (us-east-1)
2:51 AM Root/federated users error (other regions)
4:25 AM All authentication methods restored

⚠️ Customer Impact

AWS Management Console:
  • IAM users unable to sign in (us-east-1)
  • Identity Center sign-in failures (us-east-1)
  • Root credential sign-in errors (other regions)
  • Federated users unable to sign in (other regions)
Unaffected:
  • IAM role assumption via APIs continued working
  • Existing console sessions remained active

🔧 Technical Details

IAM console authentication uses DynamoDB for session state management in us-east-1
Cross-region console access (signin.aws.amazon.com) also depended on us-east-1 endpoints
Automatic recovery occurred when DynamoDB endpoints were restored at 4:25 AM
Support Center
Case Access Blocked

💥 Impact Overview

Support case creation, viewing, and updates blocked despite successful failover to backup region.

Duration
2h 52m
Impact
100%
Channels
All

🔗 System Architecture

Components:

Support Console Support API Account Metadata

Failover Status:

✓ Regional failover successful
✗ Metadata subsystem returned invalid responses

⏱️ Timeline

2:48 AM Support Center case operations fail
2:48 AM Automatic failover to backup region
2:48 AM Metadata subsystem blocks access
5:40 AM Initial mitigation applied
5:58 AM Additional prevention measures added

🔧 Technical Root Cause

Design Flaw: Account metadata subsystem had three possible response states:

1. ✓ Valid response (access granted)

2. ✓ Error response (bypass enabled)

3. ✗ Invalid response (incorrectly blocked)

Failure: During the incident, the subsystem returned invalid responses instead of error responses, which prevented the bypass logic from activating.

⚠️ Customer Impact

  • Unable to create new support cases
  • Unable to view existing cases
  • Unable to update or respond to cases
  • Support Console completely blocked
  • Support API calls failed
  • Critical during a major outage event

🛡️ Prevention Measures

  • Enhanced response validation for metadata subsystem
  • Improved bypass logic to handle invalid responses
  • Additional fallback mechanisms for metadata unavailability
  • Better monitoring for metadata response patterns
  • Graceful degradation when metadata is inconsistent

💡 Lessons Learned

1. Successful regional failover insufficient - all subsystems must handle degraded dependencies
2. Bypass logic should treat "invalid" responses the same as "error" responses
3. Support Center availability is critical during outages - requires highest resiliency
Total Outage
Partial Outage / Degraded
Recovering
Resolved

Recovery Actions & Timeline

Phase 1: Detection & Initial Response
12:38 AM Engineers identified DynamoDB DNS state as outage source (50 minutes after start)
1:15 AM Temporary mitigations enabled internal services to connect, restored critical tooling
2:25 AM Manual intervention restored all DNS information for DynamoDB
2:32 AM DynamoDB global tables replicas fully caught up
2:40 AM Customer DNS caches expired, DynamoDB endpoint fully resolved
Phase 2: EC2 Recovery
4:14 AM Throttled incoming DWFM work, began selective host restarts to clear queues
5:28 AM All droplet leases re-established, EC2 launches began succeeding
10:36 AM Network Manager backlog cleared, network propagation normalized
11:23 AM Began gradually relaxing API throttles as systems stabilized
1:50 PM All EC2 throttles removed, full service recovery achieved
Phase 3: Dependent Services
4:40 AM Lambda SQS polling subsystem manually restored
9:36 AM NLB automatic health check failover disabled to stabilize capacity
9:45 AM Redshift workflow backlog growth stopped
2:09 PM NLB health check failover re-enabled after EC2 recovery
2:15 PM Lambda fully recovered, all backlogs processed
4:20 PM Amazon Connect fully restored
5:20 PM ECS/EKS/Fargate fully recovered
Oct 21, 7:05 AM All impaired Redshift clusters restored (extended recovery)

🛡️ Prevention Measures Implemented

DynamoDB DNS: Disabled DNS Planner/Enactor automation worldwide. Will fix race condition and add protections against incorrect DNS plan application before re-enabling.
EC2 DWFM: Building additional test suite for recovery workflow scale testing. Improving throttling based on queue depth to prevent congestive collapse.
Network Manager: Enhanced capacity scaling and rate limiting for backlog scenarios.
NLB: Adding velocity control mechanism to limit capacity removal during health check failures.
Lambda: Enhanced monitoring and automated recovery for internal polling subsystems.
Redshift: Removing hardcoded us-east-1 IAM API calls to prevent cross-region impacts.