April 4th, 2019 21:21 PM - April 5th, 2019 00:03 AM PDT
Customers of the Real Device Cloud in the US may have experienced failures with starting Appium tests and Live Testing sessions.
A bug fix pushed into production resulted in an increase in the number of overall threads which starved a key service. This issue was not apparent in pre-production environments and only appeared under full production load and high cloud capacity. We did not detect the issue earlier as our application monitoring solution was not well tailored towards observing some key metrics and alerting on anomalies.
We rolled back the change as soon as we managed to identify the root cause of the issue.
We are reworking the bug fix to avoid the thread issue. Application monitoring and alerting is being enhanced so that such issues could be detected, localized, and resolved in a more efficient way. We are also looking into ways to enhance our load tests in order to be able to detect such issues before changes are rolled out to production.