-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remake MSTransferor alert subject; increase alert expiration time to 1h #10475
Conversation
Jenkins results:
|
I just tested it and now emails come with a subject like:
links to the alert in AM are also still alive, with the new expiration time. |
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @amaltaro
Looks good to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me. My only comment is on the change to L749, which I tried to keep consistent with L102.
self.alertServiceName = "ms-transferor"
It's fine to change this in the alertName
variable, but has the potential to break alerting/filtering if it's ever updated on L102.
Thanks Todor and Erik. @vkuznet Valentin, does AlertManager keep retriggering the notifications for existent alerts? As you can see in this PR, I set the expiration time for this alert to 2 days, but I keep getting an email notification - the same one - every ~2h. Same goes to the Slack channel, notification is pushed into #alerts-dmwm every ~2h. The use case for this type of alert is to warn the team and/or computing operations that there is a very large input data transfer, which people should be aware of and/or review if needed. That's the only reason I increased its expiration time. |
Jenkins results:
|
Thanks for your review, Todor and Erik. I have updated this PR to keep the consistency mentioned by Erik. I also decreased the expiration time to avoid getting the same notification over and over. Last but not least, I have just ran another test to make sure everything is sound ;) Could one of you have another look into it please? Thanks @goughes @todor-ivanov |
LGTM, although the commit message and PR title are now misleading since you changed the time. |
make it a warning log level switch the subject back to the service name; expiry in an hour
Jenkins results:
|
Thanks Erik! I have updated the PR and the commit message to reflect what's provided in this PR. |
Fixes #10468
Status
ready
Description
Instead of providing ALL the Rucio transfer ID in the email alert subject, provide only the workflow name.
Also removed the new lines
\n
from the description, they get escaped anyways in AlertManager.Last but not least, increase their expiration time to 2 days (could be something configurable as well). such that the link from Slack to AM can bring us to real content.
UPDATE: expiration time set to only 1 hour, otherwise AM keeps sending a notification both via email and slack every 2h, until the alert is gone from the system.
Is it backward compatible (if not, which system it affects?)
yes
Related PRs
none
External dependencies / deployment changes
none