Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add AlertManager API #10308

Merged
merged 1 commit into from
Mar 16, 2021
Merged

Add AlertManager API #10308

merged 1 commit into from
Mar 16, 2021

Conversation

goughes
Copy link
Contributor

@goughes goughes commented Feb 24, 2021

Fixes #10340

Status

API has been tested manually, need another commit (or PR) to replace email alerting into the microservices with AlertManager API calls

Description

This adds a module to provide alerting capabilities via the MONIT AlertManager API.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

NA

External dependencies / deployment changes

Specific documentation on this: https://github.com/dmwm/WMCore/wiki/Prometheus-AlertManager-wrapper-APIs
JIRA ticket for the CMS Monitoring team: https://its.cern.ch/jira/browse/CMSMONIT-363

@goughes goughes self-assigned this Feb 24, 2021
@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 4 tests no longer failing
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 12 comments to review
  • Pylint py3k check: succeeded
    • 0 errors and warnings that should be fixed
    • 0 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 2 comments to review
  • Python3 compatibility checks: failed
    • fails python3 compatibility test

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11309/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 4 tests no longer failing
  • Pylint check: succeeded
    • 10 comments to review
  • Pylint py3k check: succeeded
    • 0 errors and warnings that should be fixed
    • 0 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 2 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11310/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@vkuznet vkuznet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And, I don't have strong preference for time function to be part of this codebase. You can split it from this code into separate module, which will be easy to test and share with other codebase if necessary.

src/python/WMCore/Services/AlertManager/AlertManagerAPI.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 4 tests no longer failing
  • Pylint check: succeeded
    • 10 comments to review
  • Pylint py3k check: succeeded
    • 0 errors and warnings that should be fixed
    • 0 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 2 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11313/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erik, the way I see this API being used is:

  • we initialize it when initializing a service that can use it (thus, all the defaults that won't change, should be set as an object variable - including the RequestHandler object);
  • when we want to trigger an alert, we call the already initialized AlertManagerAPI object and send the alert;
    • in the alert itself, we could have a call to another function to put the timestamp in the required standard
    • I think it would be important to also have some input data validation/sanitization and ensure that the user provided values supported by AM (like the severity in [low, medium, high]

In short, it would be a very similar usage compared to the current EmailAlert module.
Please let me know if I'm missing something here. Once we converge on the logical changes, I'll come back here with some aesthetic suggestions ;)

@goughes goughes force-pushed the alertmanager-api branch 2 times, most recently from f89015f to 04706b4 Compare February 25, 2021 21:38
@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 3 tests no longer failing
  • Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 20 comments to review
  • Pylint py3k check: failed
    • 0 errors and warnings that should be fixed
    • 1 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 8 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11331/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 3 tests no longer failing
  • Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 20 comments to review
  • Pylint py3k check: failed
    • 0 errors and warnings that should be fixed
    • 1 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 8 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11332/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 3 tests no longer failing
  • Pylint check: succeeded
    • 17 comments to review
  • Pylint py3k check: failed
    • 0 errors and warnings that should be fixed
    • 1 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 6 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11338/artifact/artifacts/PullRequestReport.html

@vkuznet
Copy link
Contributor

vkuznet commented Feb 26, 2021

+1

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 18 comments to review
  • Pylint py3k check: failed
    • 0 errors and warnings that should be fixed
    • 1 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 5 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11472/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 18 comments to review
  • Pylint py3k check: failed
    • 0 errors and warnings that should be fixed
    • 1 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 5 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11473/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: succeeded
    • 18 comments to review
  • Pylint py3k check: failed
    • 0 errors and warnings that should be fixed
    • 1 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 5 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11479/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erik, code is looking good. However, I have made several suggestions in the code; and a few questions as well.

Documentation of the new class is good as well, but I think we can enhance it a bit more (like, we could mention which data type is expected for those args in the send() method). This is especially important because these classes under Utils are commonly shared with other DMWM projects.



class LocalTimezone(tzinfo):

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Erik, can you please remove this blank line here?


def __init__(self):
super(LocalTimezone, self).__init__()
self.ZERO = timedelta(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure I follow this logic.
ZERO is the current timezone
STDOFFSET would be UTC timezone
DSTOFFSET would be daylight saving time, so plus or minus 1h of UTC/ZERO.

Is that correct? How about putting a very basic comment in place (or a method docstring) to help others and myself looking at it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should never use this class except for alerts and we should forget about it once we can use the builtin functions from python3. I included a link to the python docs that include this class as an example.

src/python/WMCore/Services/AlertManager/AlertManagerAPI.py Outdated Show resolved Hide resolved
src/python/WMCore/Services/AlertManager/AlertManagerAPI.py Outdated Show resolved Hide resolved
self.mgr = RequestHandler()
self.ltz = LocalTimezone()

def send(self, alertName, severity, summary, description, tag="wmcore", service="", endSecs=600, generatorURL=""):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps, we could give an even more explicit name to this method, e.g. sendAlert. What do you think?
The docstring is also missing the tag parameter.
Last comment here is actually a question, as I look at it, I wonder if we should simply make service a mandatory argument?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Service being mandatory is probably a good idea. We could go one step further and make a list of "valid" services to check against, since the service name will be used to route alerts.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of validating the service against a list of known and expected services (properly capitalized or not). The only problem will be if others, e.g. CRAB, decide to use this same module.
Ideally the list of valid service names should be configurable, but I fail to see how to enforce that ... and the only feasible logic I see is hard-wiring it in the source code and updating it in the future, if needed. What are your thoughts here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now we should probably just trust ourselves to use the correct names instead of putting in additional checks. Something to think about if we find we are unable to handle that.

# sender's hostname is added as an annotation
self.hostname = socket.gethostname()
self.mgr = RequestHandler()
self.ltz = LocalTimezone()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about creating another object attribute self.generatorURL with the alertmanager front page as default? Then on the send method, we default that to None, and if no value is provided, we fallback to the dafault value in the instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about setting generatorURL to the AlertManager page. Unless this is sent in the body of the email, it would just be a link to the AlertManager page that is clickable from the AlertManager page.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this generatorURL is optional then? We can simply pass the default empty string from the code below and everything would work, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tested with a blank string and things work. Alert shows up with no "Source" on the AM page.

src/python/WMCore/Services/AlertManager/AlertManagerAPI.py Outdated Show resolved Hide resolved
src/python/WMCore/Services/AlertManager/AlertManagerAPI.py Outdated Show resolved Hide resolved
@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: failed
    • 1 new failures
    • 1 tests no longer failing
  • Pylint check: succeeded
    • 19 comments to review
  • Pylint py3k check: failed
    • 0 errors and warnings that should be fixed
    • 1 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 2 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11484/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 1 tests no longer failing
  • Pylint check: succeeded
    • 17 comments to review
  • Pylint py3k check: failed
    • 0 errors and warnings that should be fixed
    • 1 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 1 comments to review
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11485/artifact/artifacts/PullRequestReport.html

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: succeeded
    • 16 comments to review
  • Pylint py3k check: failed
    • 0 errors and warnings that should be fixed
    • 1 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
  • Python3 compatibility checks: succeeded

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11492/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me, Erik.

I still have a couple of questions/comments though:

  • could you please file another GH issue asking for the removal of this LocalTimezone class and the one line update in the alert manager API (post python3 migration)? Just so we don't forget it.
  • do we need Valentin's help in setting up things on the alerting system? We can also follow this up offline or on slack.
  • I think we could likely persist some of the content in your gdoc in a WMCore wiki; also providing some guidelines for using this API and properly defining alerts within WMCore. If you prefer, I can draft the first version of such wiki, please let me know.

@vkuznet
Copy link
Contributor

vkuznet commented Mar 16, 2021

@amaltaro , I suggest that for all your monitoring needs to open CMS Monitoring Jira ticket. This way any of us (me, Ceyhun, Federica) can work on it. All of has access to configuration files for AM and can add appropriate settings for your use-cases. Once again, we only need to know which tag, emails, slack channels should be used.

@vkuznet
Copy link
Contributor

vkuznet commented Mar 22, 2021

Just for the record. We do have very handy alert CLI tool which is available on cvmfs, see /cvmfs/cms.cern.ch/cmsmon/alert. You may use it to look-up all your alerts, just run it with -help option and you can see how it works. Here are just few examples:

# Get all alerts based on multi filters. Ex service=GGUS, severity=high:
            alert -service=GGUS -severity=high

# Get all alerts of specific service/severity/tag/name. Ex GGUS/high/cms/ssb-OTG0058113:
            alert -service=GGUS
            alert -severity=high
            alert -tag=cms
            alert -name=ssb-OTG0058113

Therefore, once you confirm your configuration and settle down on tags, etc., you may start using it for your alerts.

The alerts can be viewed in plain (human friendly) and json formats. The tool can be used on any linux, and inside or outside of CERN network (for later I just committed the update and it will require a valid token).

I always prefer CLI solution over web based one and our alerting infrastructure provides that.

@amaltaro
Copy link
Contributor

Thanks Valentin. I added this information to our WMCore wiki as well: https://github.com/dmwm/WMCore/wiki/Prometheus-AlertManager-wrapper-APIs#listing-alerts-with-cli-tools

@vkuznet
Copy link
Contributor

vkuznet commented Mar 22, 2021

ok, and to clarify one common question. The AlertManager does not have API to delete alerts, instead it is achieved in different ways. If alert was fired by Prometheus and its rules come back to normal Prometheus takes care of its resolution. For manually inserted alerts (like your use case), the alert can be deleted by simply adjusting its EndsAt timestamp to current time or time in a past. Our alert tool will do that (I submitted corresponding PR cms-sw/cmsdist#6750). Or, you can use your AM wrapper to do this job. The way it is done is following. First, you fetch the alert JSON, then adjust its EndsAt value and POST it back to AM. In other words AM only has GET and POST APIs.

Also, I included in my PR a new tool called token-manager, which will allow to handle (renew) access tokens. With such token AM can be accessed from outside of CERN, i.e. you can inject, inspect or update alerts from distributed workflows. But this is a topic of separate discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Email alerts do not work from k8s pods
4 participants