Wait times are intermittently spiking to 100 or more seconds on all clouds. These spikes only last about three minutes and occur about once every two hours. We are investigating the cause.
2019-July-31 Service Incident
Incident Report for Sauce Labs US West Data Center
Postmortem

Dates:

July 31st, 2019 1:20 PM - 3:14 PM PT

What happened:

At 12:21 PM we received alerting that indicated high wait times for jobs in our PC, Mac and Android clouds. Our on-call engineers quickly reviewed and noted that there was no correlation to real usage. We began triaging by restarting a core scheduling service and extending the number of instances available to schedule jobs in our PC cloud. We additionally identified potential issues with underlying infrastructure and relocated this service to healthier nodes resulting in an improvement in job wait times.

Why it happened:

Higher than usual test load combined with the assignment of certain service instances to nodes in our control cluster resulted in a few saturate network links. This negatively impacted the cloud scheduling service, resulting in thrashing of new boot requests.

How we fixed it:

We changed the distribution of service instances to spread the load more evenly across network links.

What we are doing to prevent it from happening again:

We significantly expanded the capacity of the links in the control cluster and modified the assignment to more evenly spread load. We’ve added additional alerting and technical documentation to assist in troubleshooting should this issue ever reoccur.

Posted 2 months ago. Aug 10, 2019 - 16:12 PDT

Resolved
After identifying the root cause of this incident we took remedial action and all services are now fully operational.
Posted 3 months ago. Jul 31, 2019 - 03:14 PDT
Monitoring
Wait times in the PC Cloud have returned to normal. We are monitoring closely.
Posted 3 months ago. Jul 31, 2019 - 02:42 PDT
Update
Wait times on our PC Cloud are still high but recovering. We are working to resolve the issue.
Posted 3 months ago. Jul 31, 2019 - 02:04 PDT
Investigating
Wait times on our PC Cloud are high. We are taking remedial action.
Posted 3 months ago. Jul 31, 2019 - 01:20 PDT
This incident affected: Automated VM Testing (Automated PC Testing).