Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

[Monitor] Alert System Refine #4810

Open
9 of 23 tasks
Binyang2014 opened this issue Aug 13, 2020 · 0 comments
Open
9 of 23 tasks

[Monitor] Alert System Refine #4810

Binyang2014 opened this issue Aug 13, 2020 · 0 comments
Assignees

Comments

@Binyang2014
Copy link
Contributor

Binyang2014 commented Aug 13, 2020

Working items about Alerts:

Email clarity

  • P2 Alerts severity should be customized by clusters. e.x., NodeNotReady is not a serious issue in wu2 cluster. Admin should have the ability to modify configmap easily.
  • P2 auto-set repeat_interval / group_interval by alert severity

Webportal:

Delete script

  • P2 lack delete storage scripts

Misleading alerts:

Alerts Summary

Alert coverage

  • P3 Add more alerts for PAI service, such as CPU usage, Memory usage, Disk usage. Make sure PAI service not consume too much resource (especially daemon service)

Other Candidates

  • P5 alert for multi-cluster
  • P3 auto resolve alerts. Such as run privileged diagnostic pod in worker node. And run command to solve the current alerts

DONE:

Duplicated alerts: #5052

  • P1 Inhibition : some alerts are included in more high level alerts:
    nodeNotReady-> PaiServicePodNotRunning|PaiServicePodNotReady (in the nodes)
  • P1 PaiServicePodNotRunning|PaiServicePodNotReady and PaiServiceNotUp Overlap

Admin Experiment

Alerts Definition Not Accurate

@Binyang2014 Binyang2014 self-assigned this Aug 13, 2020
@suiguoxin suiguoxin mentioned this issue Nov 16, 2020
39 tasks
@suiguoxin suiguoxin changed the title [Monitor] Alert rules udpate [Monitor] Alert System Refine Nov 30, 2020
This was referenced Nov 30, 2020
@debuggy debuggy mentioned this issue Jan 4, 2021
14 tasks
@debuggy debuggy mentioned this issue Jan 15, 2021
55 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants