Screener.io will be having a planned emergency maintenance window on Monday, August 24th from 8pm to 10pm PDT.
You may see some errors during that time.
2020-March-3 Service Incident
Incident Report for Sauce Labs US West Data Center
Postmortem

Dates:

Tue March 3 21:25 - Wed March 4 02:05 PT

What happened:

No tests were starting for our Virtual Device Clouds, including Mac, PC, Android and iOS. A series of servers in our control plane started rebooting. Our control plane was recovering, but these were rolling reboots that continued to cause problems until the issue was resolved.

Why it happened:

After an incident last week, we needed to reboot many of our control plane servers to completely resolve that incident. Those servers had been up since they were originally built (over 370 days). Even though we have automated upgrades disabled on control plane hosts, these hosts had taken a kernel upgrade before automated upgrades were disabled. These kernel upgrades do not take effect until reboot, which did not happen until March 2, 2020. The kernel that was installed on these systems as a result has a bug surrounding specific features used heavily in our control plane services which causes quiet kernel panics. When the first set of boxes rebooted, they were taken out of service to run chassis diagnostics. This created more pressure on the remaining systems with the bad kernel. When the first of the remaining servers rebooted, this compounded the issue causing a series of cascading reboots across our control plane.

How we fixed it:

We determined which systems had the affected kernel, then rolled those servers back to a known-stable kernel.

What we are doing to prevent it from happening again:

We are taking the following steps:

  1. Reviewing all control plane hosts to ensure they are not configured for a kernel upgrade upon reboot.
  2. Auditing all production systems to ensure the offending kernel is not present in any other systems
  3. Initiating the process of pinning all production kernels to a known-stable kernel
  4. Implementing a testing and review process for kernel upgrades.
Posted Mar 05, 2020 - 16:39 PST

Resolved
This incident has been resolved.
Posted Mar 04, 2020 - 02:05 PST
Monitoring
We have restored our Virtual Device Clouds, including Mac, PC, Android and iOS to full capacity and tests are running as normal. We continue to monitor.
Posted Mar 04, 2020 - 00:19 PST
Update
Our engineers have partly identified causes for this incident. Virtual Devices and Desktop tests are sporadically available. We consider the cloud to be in an unreliable state and are continuing to investigate.
Posted Mar 03, 2020 - 23:10 PST
Update
Customers are still unable to start virtual device or desktop tests of any kind. Our engineers are continuing to investigate.
Posted Mar 03, 2020 - 22:15 PST
Investigating
No tests are starting for our Virtual Device Clouds, including Mac, PC, Android and iOS. Our engineers are investigating.
Posted Mar 03, 2020 - 21:25 PST
This incident affected: Manual Testing (Manual VM Testing, Manual RDC Testing) and Automated VM Testing (Automated PC Testing, Automated Mac Testing, Automated iOS Simulator Testing, Automated Android Emulator Testing).