[Alerting] Task Manager doesn't automatically recover if polling fails #74785
Labels
Feature:Alerting
Feature:Task Manager
Team:ResponseOps
Label for the ResponseOps team (formerly the Cases and Alerting teams)
Task Manager doesn't have any built in ability to recover if the polling cycle fails.
We have identified in the past failure cases where the polling cycle broke and addressed those cases, but ideally TM would recover independently when such a case happens by restarting a broken poller.
In order for us to gain full confidence in mission critical usage of alerting, a Nodemon like ability to restart the internal poller seems paramount.
Along side this change, we should expose metrics that can be collected on demand to aid in SDH support once we go GA.
The text was updated successfully, but these errors were encountered: