You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.
P2 Alerts severity should be customized by clusters. e.x., NodeNotReady is not a serious issue in wu2 cluster. Admin should have the ability to modify configmap easily.
P2 auto-set repeat_interval / group_interval by alert severity
Webportal:
P2 filter alert by user/admin, show only related alerts to users
P2: NodeGPUCountChanged: configured GPU count is read from layout.yml, however the configured GPU count is not right. Included in Installation script refinement #5100
Alert only on master node; Questions: what about worker node ?
P3: Issue @scarlett2018 reported in email: For a low priority VMSS bed, nodes get preempted by Azure and get back when available every day. “Node not ready” alerts are too frequent to be triggered. (and seems no actions ops can take for these type of alerts). [Azure Low Priority VM Support] Differentiate preemption type #3806 We may need a add an exporter for this
P3 Add more alerts for PAI service, such as CPU usage, Memory usage, Disk usage. Make sure PAI service not consume too much resource (especially daemon service)
Other Candidates
P5 alert for multi-cluster
P3 auto resolve alerts. Such as run privileged diagnostic pod in worker node. And run command to solve the current alerts
Working items about Alerts:
Email clarity
Webportal:
Delete script
Misleading alerts:
Alerts Summary
Questions: Where to save the statistics ?
Alert coverage
Other Candidates
DONE:
Duplicated alerts: #5052
nodeNotReady-> PaiServicePodNotRunning|PaiServicePodNotReady (in the nodes)
Admin Experiment
Alerts Definition Not Accurate
The text was updated successfully, but these errors were encountered: