Fault in DynamoDB system cascaded through AWS services, knocking major sites offline for hours
theregister.co.ukAmazon has published a detailed postmortem explaining how a critical fault in DynamoDB's DNS management system cascaded into a day-long outage that disrupted major websites and services across multiple brands – with damage estimates potentially reaching hundreds of billions of dollars.
The incident began at 11:48 PM PDT on October 19 (7.48 UTC on October 20), when customers reported increased DynamoDB API error rates in the Northern Virginia US-EAST-1 Region. The root cause was a race condition in DynamoDB's automated DNS management system that left an empty DNS record for the service's regional endpoint.
The DNS management system comprises two independent components (for availability reasons): a DNS Planner that monitors load balancer health and creates DNS plans, and a DNS Enactor that applies changes via Amazon Route 53.
Amazon's postmortem says the error rate was triggered by "a latent defect" within the service's automated ...
Copyright of this story solely belongs to theregister.co.uk . To see the full text click HERE

