Thursday February 27 8:06 AM - 9:00 AM PT
The Sauce Labs San Jose datacenter was non-operational for approximately 18 minutes starting at 8:06 am. There was a backlog of jobs for another 22 minutes for both Android and Mac clouds; it took 36 minutes to work through the PC cloud backlog and return to a stable and performant state.
A communication failure between the Sauce Labs platform and database resulted in our inability to process jobs.
Unnecessary connections to the database were closed to assist in its recovery and cloud capacity was diverted to expedite processing of the job backlog.
We’re adding additional database resilience features to isolate failed nodes without causing disruption to the platform. We also have architectural work in progress that will reduce the time required to recover and work through a job queue.