February 13th, 2019 1:35 - 5:00 pm PST
Some tests failed to run and new tunnels would not start reliably.
Our Sauce Connect Tunnel Cloud experienced a burst in peak usage shortly after 1pm that resulted in its inability to boot new tunnels fast enough to meet demand. A bug was also discovered that contributed to elevated boot times when under extreme contention that led to a prolonged recovery.
We throttled requests to start Sauce-Connect to allow our Tunnel Cloud to stabilize
Throughout Feb 13th - Feb 20th we added 50% additional capacity to our tunnel cloud and are working on mitigating changes that allow us to scale more effectively under peak usage. We’ve additionally prepared a number of new nodes with an updated technology stack that we’ll be using to validate for better performance and tunnel density. Additional monitoring and alerting have been put in place that allow us to respond rapidly to degradation in tunnel nodes that have led to poor performance and higher wait times for affected customers over the last two weeks. We have yet to experience tunnels downtime since the outage Wednesday February 13th and continue to proactively respond to tunnel issues before they affect service availability. Longer term we will continue tuning our code, updating our technology stack and implementing better guard rails to prevent irregular behaviour. Furthermore we’ll be making improvements in our Tunnel’s load balancing implementation and capacity modelling to provide a more performant and reliable tunnels experience.