Monday April 27 13:15 - 14:02 PDT
Our dashboard and REST API were not responding and users were unable to login. Sauce Connect Tunnels were down.
In the process of decommissioning legacy Redis in-memory storage data clusters, a misconfiguration of our REST API surfaced when the final Redis node was shut down.
We redeployed the REST API services with the proper configuration to ensure it was referencing the new Redis cluster.
We have implemented a new policy that requires approval by a four-member committee prior to decommissioning any production service. The people required for a service turn down are a systems engineer, a network engineer, a software engineer (from the team that owns the service in question) and a software lead/manager (from the team that owns the service in question). We are also implementing a new network traffic review of any production service before it is turned down. We will run all production traffic to source, determine if it is required or not, and if it should be swung to a new location prior to decommissioning. Finally, we will be terminating all services via firewall rules and instituting a hold for 24 hours prior to fully decommissioning the services, enabling a rapid return to services in the event something is missed.