Tue March 3 21:25 - Wed March 4 02:05 PT
No tests were starting for our Virtual Device Clouds, including Mac, PC, Android and iOS. A series of servers in our control plane started rebooting. Our control plane was recovering, but these were rolling reboots that continued to cause problems until the issue was resolved.
After an incident last week, we needed to reboot many of our control plane servers to completely resolve that incident. Those servers had been up since they were originally built (over 370 days). Even though we have automated upgrades disabled on control plane hosts, these hosts had taken a kernel upgrade before automated upgrades were disabled. These kernel upgrades do not take effect until reboot, which did not happen until March 2, 2020. The kernel that was installed on these systems as a result has a bug surrounding specific features used heavily in our control plane services which causes quiet kernel panics. When the first set of boxes rebooted, they were taken out of service to run chassis diagnostics. This created more pressure on the remaining systems with the bad kernel. When the first of the remaining servers rebooted, this compounded the issue causing a series of cascading reboots across our control plane.
We determined which systems had the affected kernel, then rolled those servers back to a known-stable kernel.
We are taking the following steps: