Monday, November 30, 23:00 - Tuesday, December 1 02:30 PST
The number of available devices decreased over a period of 3 hours to a point where ~40% of the devices in the US data center were unavailable. This affected both private and public devices in the US, and reduced our general capacity as well as the availability of some of our models.
An automated security upgrade of the 'containerd' package, a dependency for Docker Engine, was started on some of our hosts. These hosts, which the physical mobile devices are connected to, were unresponsive due to that upgrade.
A restart of Docker on the affected hosts re-enabled functionality of the connected devices.
We added additional monitoring to get alerted immediately when a host is unresponsive. Auto-updates are generally disabled on the hosts, and need to be triggered manually.
We added information to our runbook to query for specific subsets of our hosts, to be able to run specific commands only for these.