2020-April-27 Service Incident


Monday April 27 13:15 - 14:02 PDT

What happened:

Our dashboard and REST API were not responding and users were unable to login. Sauce Connect Tunnels were down.

Why it happened:

In the process of decommissioning legacy Redis in-memory storage data clusters, a misconfiguration of our REST API surfaced when the final Redis node was shut down.

How we fixed it:

We redeployed the REST API services with the proper configuration to ensure it was referencing the new Redis cluster.

What we are doing to prevent it from happening again:

We have implemented a new policy that requires approval by a four-member committee prior to decommissioning any production service. The people required for a service turn down are a systems engineer, a network engineer, a software engineer (from the team that owns the service in question) and a software lead/manager (from the team that owns the service in question). We are also implementing a new network traffic review of any production service before it is turned down. We will run all production traffic to source, determine if it is required or not, and if it should be swung to a new location prior to decommissioning. Finally, we will be terminating all services via firewall rules and instituting a hold for 24 hours prior to fully decommissioning the services, enabling a rapid return to services in the event something is missed.

Posted Apr 29, 2020 - 09:43 PDT

Dashboard, REST API and Sauce Connect tunnels have recovered.
All services are now fully operational.
Posted Apr 27, 2020 - 14:02 PDT
We are taking remedial action. Sauce Connect tunnels and REST API are recovering.
We continue to investigate.
Posted Apr 27, 2020 - 13:52 PDT
Our dashboard and REST API are not responding and failing to login.
Sauce Connect tunnels are not starting.
We are investigating.
Posted Apr 27, 2020 - 13:15 PDT