Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remake MSTransferor alert subject; increase alert expiration time to 1h #10475

Merged
merged 1 commit into from
May 6, 2021

Conversation

amaltaro
Copy link
Contributor

@amaltaro amaltaro commented May 3, 2021

Fixes #10468

Status

ready

Description

Instead of providing ALL the Rucio transfer ID in the email alert subject, provide only the workflow name.
Also removed the new lines \n from the description, they get escaped anyways in AlertManager.
Last but not least, increase their expiration time to 2 days (could be something configurable as well). such that the link from Slack to AM can bring us to real content.

UPDATE: expiration time set to only 1 hour, otherwise AM keeps sending a notification both via email and slack every 2h, until the alert is gone from the system.

Is it backward compatible (if not, which system it affects?)

yes

Related PRs

none

External dependencies / deployment changes

none

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: failed
    • 1 new failures
    • 1 changes in unstable tests
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 5 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
    • 0 errors and warnings that should be fixed
    • 0 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 19 comments to review
  • Python3 compatibility checks: succeeded
    • there are suggested fixes for newer python3 idioms

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11712/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented May 3, 2021

I just tested it and now emails come with a subject like:

[FIRING:1] MSTransferor: input data transfer over threshold: amaltaro_ReRecoSkim_HG2105_Val_210503_131254_4340 (ms-transferor high wmcore)

links to the alert in AM are also still alive, with the new expiration time.

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
    • 1 changes in unstable tests
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 5 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
    • 0 errors and warnings that should be fixed
    • 0 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 19 comments to review
  • Python3 compatibility checks: succeeded
    • there are suggested fixes for newer python3 idioms

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11713/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@todor-ivanov todor-ivanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @amaltaro
Looks good to me

Copy link
Contributor

@goughes goughes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. My only comment is on the change to L749, which I tried to keep consistent with L102.
self.alertServiceName = "ms-transferor"

It's fine to change this in the alertName variable, but has the potential to break alerting/filtering if it's ever updated on L102.

@amaltaro
Copy link
Contributor Author

amaltaro commented May 3, 2021

Thanks Todor and Erik.
@goughes I will keep the consistency then and change it back.

@vkuznet Valentin, does AlertManager keep retriggering the notifications for existent alerts? As you can see in this PR, I set the expiration time for this alert to 2 days, but I keep getting an email notification - the same one - every ~2h. Same goes to the Slack channel, notification is pushed into #alerts-dmwm every ~2h.

The use case for this type of alert is to warn the team and/or computing operations that there is a very large input data transfer, which people should be aware of and/or review if needed. That's the only reason I increased its expiration time.
At the same time, I don't want to get the same notification multiple times, but I do want to have the ability to see details for this alert.
Would you suggest anything different?

@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 5 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
    • 0 errors and warnings that should be fixed
    • 0 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 19 comments to review
  • Python3 compatibility checks: succeeded
    • there are suggested fixes for newer python3 idioms

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11718/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented May 4, 2021

Thanks for your review, Todor and Erik. I have updated this PR to keep the consistency mentioned by Erik. I also decreased the expiration time to avoid getting the same notification over and over. Last but not least, I have just ran another test to make sure everything is sound ;)

Could one of you have another look into it please? Thanks @goughes @todor-ivanov

@goughes
Copy link
Contributor

goughes commented May 4, 2021

LGTM, although the commit message and PR title are now misleading since you changed the time.

@amaltaro amaltaro changed the title Remake MSTransferor alert subject; increase alert expiration to 2 days Remake MSTransferor alert subject; increase alert expiration time to 1h May 6, 2021
make it a warning log level

switch the subject back to the service name; expiry in an hour
@cmsdmwmbot
Copy link

Jenkins results:

  • Unit tests: succeeded
  • Pylint check: failed
    • 1 warnings and errors that must be fixed
    • 5 warnings
    • 31 comments to review
  • Pylint py3k check: succeeded
    • 0 errors and warnings that should be fixed
    • 0 warnings
    • 0 comments to review
  • Pycodestyle check: succeeded
    • 19 comments to review
  • Python3 compatibility checks: succeeded
    • there are suggested fixes for newer python3 idioms

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/DMWM-WMCore-PR-test/11729/artifact/artifacts/PullRequestReport.html

@amaltaro
Copy link
Contributor Author

amaltaro commented May 6, 2021

Thanks Erik! I have updated the PR and the commit message to reflect what's provided in this PR.

@amaltaro amaltaro merged commit fcca972 into dmwm:master May 6, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Email alerts coming without subject (and alert name too long)
4 participants