-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add AlertManager API #10308
Add AlertManager API #10308
Conversation
Jenkins results:
|
b2187b3
to
18bc5f3
Compare
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And, I don't have strong preference for time function to be part of this codebase. You can split it from this code into separate module, which will be easy to test and share with other codebase if necessary.
18bc5f3
to
44f7c7a
Compare
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Erik, the way I see this API being used is:
- we initialize it when initializing a service that can use it (thus, all the defaults that won't change, should be set as an object variable - including the RequestHandler object);
- when we want to trigger an alert, we call the already initialized AlertManagerAPI object and send the alert;
- in the alert itself, we could have a call to another function to put the timestamp in the required standard
- I think it would be important to also have some input data validation/sanitization and ensure that the user provided values supported by AM (like the severity in [low, medium, high]
In short, it would be a very similar usage compared to the current EmailAlert
module.
Please let me know if I'm missing something here. Once we converge on the logical changes, I'll come back here with some aesthetic suggestions ;)
f89015f
to
04706b4
Compare
Jenkins results:
|
Jenkins results:
|
04706b4
to
bda6976
Compare
Jenkins results:
|
+1 |
bda6976
to
170074e
Compare
Jenkins results:
|
170074e
to
141cb67
Compare
Jenkins results:
|
141cb67
to
915bf75
Compare
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Erik, code is looking good. However, I have made several suggestions in the code; and a few questions as well.
Documentation of the new class is good as well, but I think we can enhance it a bit more (like, we could mention which data type is expected for those args in the send() method). This is especially important because these classes under Utils are commonly shared with other DMWM projects.
src/python/Utils/Timers.py
Outdated
|
||
|
||
class LocalTimezone(tzinfo): | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Erik, can you please remove this blank line here?
|
||
def __init__(self): | ||
super(LocalTimezone, self).__init__() | ||
self.ZERO = timedelta(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I follow this logic.
ZERO is the current timezone
STDOFFSET would be UTC timezone
DSTOFFSET would be daylight saving time, so plus or minus 1h of UTC/ZERO.
Is that correct? How about putting a very basic comment in place (or a method docstring) to help others and myself looking at it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should never use this class except for alerts and we should forget about it once we can use the builtin functions from python3. I included a link to the python docs that include this class as an example.
self.mgr = RequestHandler() | ||
self.ltz = LocalTimezone() | ||
|
||
def send(self, alertName, severity, summary, description, tag="wmcore", service="", endSecs=600, generatorURL=""): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps, we could give an even more explicit name to this method, e.g. sendAlert
. What do you think?
The docstring is also missing the tag parameter.
Last comment here is actually a question, as I look at it, I wonder if we should simply make service
a mandatory argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Service being mandatory is probably a good idea. We could go one step further and make a list of "valid" services to check against, since the service name will be used to route alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of validating the service against a list of known and expected services (properly capitalized or not). The only problem will be if others, e.g. CRAB, decide to use this same module.
Ideally the list of valid service names should be configurable, but I fail to see how to enforce that ... and the only feasible logic I see is hard-wiring it in the source code and updating it in the future, if needed. What are your thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now we should probably just trust ourselves to use the correct names instead of putting in additional checks. Something to think about if we find we are unable to handle that.
# sender's hostname is added as an annotation | ||
self.hostname = socket.gethostname() | ||
self.mgr = RequestHandler() | ||
self.ltz = LocalTimezone() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about creating another object attribute self.generatorURL
with the alertmanager front page as default? Then on the send
method, we default that to None, and if no value is provided, we fallback to the dafault value in the instance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure about setting generatorURL
to the AlertManager page. Unless this is sent in the body of the email, it would just be a link to the AlertManager page that is clickable from the AlertManager page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this generatorURL
is optional then? We can simply pass the default empty string from the code below and everything would work, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tested with a blank string and things work. Alert shows up with no "Source" on the AM page.
915bf75
to
185badd
Compare
Jenkins results:
|
Jenkins results:
|
784040c
to
df6a19e
Compare
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me, Erik.
I still have a couple of questions/comments though:
- could you please file another GH issue asking for the removal of this LocalTimezone class and the one line update in the alert manager API (post python3 migration)? Just so we don't forget it.
- do we need Valentin's help in setting up things on the alerting system? We can also follow this up offline or on slack.
- I think we could likely persist some of the content in your gdoc in a WMCore wiki; also providing some guidelines for using this API and properly defining alerts within WMCore. If you prefer, I can draft the first version of such wiki, please let me know.
@amaltaro , I suggest that for all your monitoring needs to open CMS Monitoring Jira ticket. This way any of us (me, Ceyhun, Federica) can work on it. All of has access to configuration files for AM and can add appropriate settings for your use-cases. Once again, we only need to know which tag, emails, slack channels should be used. |
Just for the record. We do have very handy
Therefore, once you confirm your configuration and settle down on tags, etc., you may start using it for your alerts. The alerts can be viewed in plain (human friendly) and json formats. The tool can be used on any linux, and inside or outside of CERN network (for later I just committed the update and it will require a valid token). I always prefer CLI solution over web based one and our alerting infrastructure provides that. |
Thanks Valentin. I added this information to our WMCore wiki as well: https://github.com/dmwm/WMCore/wiki/Prometheus-AlertManager-wrapper-APIs#listing-alerts-with-cli-tools |
ok, and to clarify one common question. The AlertManager does not have API to delete alerts, instead it is achieved in different ways. If alert was fired by Prometheus and its rules come back to normal Prometheus takes care of its resolution. For manually inserted alerts (like your use case), the alert can be deleted by simply adjusting its EndsAt timestamp to current time or time in a past. Our Also, I included in my PR a new tool called token-manager, which will allow to handle (renew) access tokens. With such token AM can be accessed from outside of CERN, i.e. you can inject, inspect or update alerts from distributed workflows. But this is a topic of separate discussion. |
Fixes #10340
Status
API has been tested manually, need another commit (or PR) to replace email alerting into the microservices with AlertManager API calls
Description
This adds a module to provide alerting capabilities via the MONIT AlertManager API.
Is it backward compatible (if not, which system it affects?)
YES
Related PRs
NA
External dependencies / deployment changes
Specific documentation on this: https://github.com/dmwm/WMCore/wiki/Prometheus-AlertManager-wrapper-APIs
JIRA ticket for the CMS Monitoring team: https://its.cern.ch/jira/browse/CMSMONIT-363