January 17th, 2020 6:29 am - 6:39 am PDT
Some customers intermittently couldn’t start tests and wait times exceeded normal durations on all clouds.
A configuration change resulted in a communication breakage between two services. The change affected all customers that are part of the extended team management platform. The scheduling system was unable to retrieve entitlements which resulted in queueing of customer sessions instead of cloud resources assignment.
We corrected this issue by issuing a revert to the previous stable service version immediately after detecting a problem. This resulted in rapid recovery of the system.
We are adding an extra automated step to the deployment pipeline that will confirm the validity of the configuration from the app perspective. In addition, the monitoring is being reviewed so that we can detect possible connection issues sooner, before it propagates to the rest of the system.