July 31st, 2019 1:20 PM - 3:14 PM PT
At 12:21 PM we received alerting that indicated high wait times for jobs in our PC, Mac and Android clouds. Our on-call engineers quickly reviewed and noted that there was no correlation to real usage. We began triaging by restarting a core scheduling service and extending the number of instances available to schedule jobs in our PC cloud. We additionally identified potential issues with underlying infrastructure and relocated this service to healthier nodes resulting in an improvement in job wait times.
Higher than usual test load combined with the assignment of certain service instances to nodes in our control cluster resulted in a few saturate network links. This negatively impacted the cloud scheduling service, resulting in thrashing of new boot requests.
We changed the distribution of service instances to spread the load more evenly across network links.
We significantly expanded the capacity of the links in the control cluster and modified the assignment to more evenly spread load. We’ve added additional alerting and technical documentation to assist in troubleshooting should this issue ever reoccur.