August 8th, 2019 07:35 -13:22 PDT
Intermittently, customers couldn’t start tests and wait times exceeded normal durations on the Windows and Linux cloud.
A routine code deploy caused a short disruption to our service. These short disruptions are normal for any cloud service and ours is designed to recover within seconds, with no impact upon customer testing. However, a combination of recent changes to our service made the normal recovery process fail on the PC/Linux cloud. These factors included slightly slower boot times on our VMs and recent increases to the size of the cloud. The code deploy also coincided with a high volume of tests, which made the recovery process yet more difficult. Consequently, the PC/Linux cloud was unable to boot enough VMs of the most requested images to keep up with demand.
We corrected this issue by throttling traffic and manually adjusting our scheduling logic until the system cleared the backlog of customer demand.
We’ve improved the performance and logic of the key cloud state management service, addressed slow boot times on the affected hypervisors and added better alerting. In addition, we developed a new set of tools and procedures to deal with this much more quickly should a similar issue arise. We are also reviewing our test coverage for the scheduler to find any similar corner cases we may have missed.