[Monitor] Alert System Refine #4810

Binyang2014 · 2020-08-13T03:54:18Z

Working items about Alerts:

Email clarity

P2 Alerts severity should be customized by clusters. e.x., NodeNotReady is not a serious issue in wu2 cluster. Admin should have the ability to modify configmap easily.
P2 auto-set repeat_interval / group_interval by alert severity

Webportal:

P2 filter alert by user/admin, show only related alerts to users
P2 colorize the alerts by severity level, refer to https://hu.pinterest.com/pin/556194622715526398/ (default alert severity)

Delete script

P2 lack delete storage scripts

Misleading alerts:

P2: NodeGPUCountChanged: configured GPU count is read from layout.yml, however the configured GPU count is not right. Included in Installation script refinement #5100
P3 Misleading Worker Node Alerts:
- NodeMemoryUsage Alert: NodeMemoryUsage should be triggered when system's usage is high, not user's usage #2760;
- NodeCPUUsage Alert: NodeCPUUsage should be triggered when system's usage is high, not user's usage #2762
- NodeFilesystemUsage, NodeDiskPressure, NodeOutOfDisk,NodeNotReady, AzureAgentConsumeTooMuch
- Alert only on master node; Questions: what about worker node ?
P3: Issue @scarlett2018 reported in email: For a low priority VMSS bed, nodes get preempted by Azure and get back when available every day. “Node not ready” alerts are too frequent to be triggered. (and seems no actions ops can take for these type of alerts). [Azure Low Priority VM Support] Differentiate preemption type #3806 We may need a add an exporter for this

Alerts Summary

P3 alert logging & daily summary / email to admin; Consider : Using Grafana to generate alert report #4944
Questions: Where to save the statistics ?

Alert coverage

P3 Add more alerts for PAI service, such as CPU usage, Memory usage, Disk usage. Make sure PAI service not consume too much resource (especially daemon service)

Other Candidates

P5 alert for multi-cluster
P3 auto resolve alerts. Such as run privileged diagnostic pod in worker node. And run command to solve the current alerts

DONE:

Duplicated alerts: #5052

P1 Inhibition : some alerts are included in more high level alerts:
nodeNotReady-> PaiServicePodNotRunning|PaiServicePodNotReady (in the nodes)
P1 PaiServicePodNotRunning|PaiServicePodNotReady and PaiServiceNotUp Overlap

Admin Experiment

P1 Alert severity level, group by level; critical/error/warn/info. Alert Severity #5055
P1 Email template refine: Alert Email Template Refine #5064
- different templates for users;
- Instruction on how to deal with the alert: add link to openpai handbook->troubleshooting;
- Kill job alert: alert handler stop-job notice not clear to end user #5021
- Allow users adding customized email template without rebuild alertmanager

Alerts Definition Not Accurate

P1 NodeGpuCountChanged: change this metric to compare current GPU count with configured GPU count. fix NodeGpuCountChanged definition issue #5072

Binyang2014 self-assigned this Aug 13, 2020

scarlett2018 added pai-dev ops-opt labels Sep 2, 2020

mydmdm assigned suiguoxin Sep 10, 2020

suiguoxin mentioned this issue Sep 29, 2020

2020 Sept ~ Oct release plan #4898

Closed

31 tasks

scarlett2018 mentioned this issue Oct 21, 2020

2020 Oct ~ Nov release plan #4988

Closed

38 tasks

suiguoxin mentioned this issue Nov 2, 2020

V1.4 Release Plan #5043

Closed

suiguoxin mentioned this issue Nov 16, 2020

V1.4 Bug Bash #5087

Closed

39 tasks

suiguoxin changed the title ~~[Monitor] Alert rules udpate~~ [Monitor] Alert System Refine Nov 30, 2020

This was referenced Nov 30, 2020

2020 Dec Release Plan #5134

Closed

2021 Jan release plan #5141

Closed

debuggy mentioned this issue Jan 4, 2021

2021 Jan. Release Test Plan #5218

Closed

14 tasks

debuggy mentioned this issue Jan 15, 2021

2021 Feb Release Plan #5253

Closed

55 tasks

fanyangCS mentioned this issue Mar 11, 2021

Alert: NodeMemoryUsage should be triggered when system's usage is high, not user's usage #2760

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Monitor] Alert System Refine #4810

[Monitor] Alert System Refine #4810

Binyang2014 commented Aug 13, 2020 •

edited

Loading

[Monitor] Alert System Refine #4810

[Monitor] Alert System Refine #4810

Comments

Binyang2014 commented Aug 13, 2020 • edited Loading

Email clarity

Webportal:

Delete script

Misleading alerts:

Alerts Summary

Alert coverage

Other Candidates

Duplicated alerts: #5052

Admin Experiment

Alerts Definition Not Accurate

Binyang2014 commented Aug 13, 2020 •

edited

Loading