2020-December-01 Service Incident
Postmortem

Dates:

Monday, November 30, 23:00 - Tuesday, December 1 02:30 PST

What happened:

The number of available devices decreased over a period of 3 hours to a point where ~40% of the devices in the US data center were unavailable. This affected both private and public devices in the US, and reduced our general capacity as well as the availability of some of our models.

Why it happened:

An automated security upgrade of the 'containerd' package, a dependency for Docker Engine, was started on some of our hosts. These hosts, which the physical mobile devices are connected to, were unresponsive due to that upgrade.

How we fixed it:

A restart of Docker on the affected hosts re-enabled functionality of the connected devices.

What we are doing to prevent it from happening again:

We added additional monitoring to get alerted immediately when a host is unresponsive. Auto-updates are generally disabled on the hosts, and need to be triggered manually.

We added information to our runbook to query for specific subsets of our hosts, to be able to run specific commands only for these.

Posted Dec 08, 2020 - 08:24 PST

Resolved
This incident has been resolved.
Posted Dec 01, 2020 - 01:39 PST
Identified
A significant number of real devices (~40%) in our US data center are unavailable for testing. We are working to resolve the issue.
Posted Dec 01, 2020 - 01:27 PST