Sunday February 23, 9:07 PM - Monday February 24, 1:17 AM PT
Our Mac, PC & Virtual Device clouds experienced elevated error rates. Sessions did not start correctly and there were intermittent errors with our Web UI.
A Kubernetes node experienced a sudden spike in memory usage. As a result, the network proxy service on that node was unable to properly manage security rules, leading to inconsistent communication between it and other services used by the cluster.
We isolated and reset the unreliable node. Services were re-established on a good node and we started to recover.
We are implementing stronger limits for memory and CPU utilization on a per-node basis. We are also investigating the ability to defend key services such as Kube-proxy against memory and CPU spikes.