December 16th, 10:41 - 11:20 PDT
Some customers experienced high wait times running tests in our PC, Mac, &/or Android clouds in our US-West datacenter.
Why it happened:
A failure within our DNS infrastructure resulted in a loss of connectivity from an availability zone to our control plane for 20 to 40 minutes at a time.
How we fixed it:
We’ve modified the configuration of our DNS servers to respond to the error condition and recover immediately.
What we are doing to prevent it from happening again:
- We’ve developed extensive playbooks that will allow us to diagnose and respond more rapidly to an event where our VMs have lost connectivity to our control plane.
- We’ve deployed new test infrastructure that validates our configuration and gives us rapid feedback to the conditions and quality of our environment
- We’ve deployed additional logging and monitoring of our Load Balancing and DNS infrastructure.
- We’ve configured additional network level logging to provide advanced diagnostics and replay of failure scenarios.