-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MicroServices throw UnboundLocalError when sending email alerts from kubernetes/docker containers #10234
Comments
For reference, a similar issue has been reported here: which might be useful to debug this problem. |
I have temporarily disabled these alerts from MSTransferor POD - in the production kubernetes cluster - by applying this change:
of course, if that service/POD gets restarted, then it will bring the default transferor configuration with email notifications enabled again. Into the issue, I did some googling and tried to create a SMTP connection following a few suggestions, like:
but none of them work. From what I understood, we would need to have our application to connect to the localhost host namespace (not the container one), and for that I have also tried the @goughes Erik, would you have any suggestion here? Or perhaps, do you think you could work on this issue once MSOutput is deployed in production k8s? |
Alan, did you verified from the pod itself that you can send mail from CLI, e.g. |
Good point, I forgot to mention it in my previous reply.
|
Problem with the WMCore source code has been resolved in: #10244 However, we still have to recover the ability to send email notifications from the kubernetes infrastructure. I talked to Eric a couple of days ago, and they use a |
Did a bit of research to determine if there was a better way to get email alerts for this. Most examples require modifying the sendmail/postfix config on the host or installing your own SMTP server. The simplest method that is in our control is installing sendmail similar to what Eric has done. I have tested this method via a reqmgr2ms container on my VM and it works fine. Don't these pod logs get scraped and sent off somewhere for parsing? MONIT? I'm not familiar with that, but was thinking of a scenario where that system would be responsible for the alerting instead of WMCore and sendmail in a pod. Maybe @vkuznet can comment. |
Once again (I think I commented on this issue in different places) please use amtool (it is available on CVMFS /cvmfs/cms.cern.ch/cmsmon/amtool or you can grab executable directly from github https://github.com/prometheus/alertmanager/releases/tag/v0.21.0). Then use it within a pod. It will send alert to our AlertManager (meaning that it will be visible in CMS Monitoring infrastructure) and we can confiture a channel for you to pass this alert(s) to either slack or send emails. The amtool is tiny static executable and does not require any configuration and pods/service tweaking. If you need example how to use it please refer to our crons, e.g. https://github.com/dmwm/CMSKubernetes/blob/master/docker/sqoop/run.sh#L24 |
I'm sure there are many pros/cons from both approaches, but given that I don't know any of them, I'm afraid I have no preference at the moment. I'd say the most important is to pick the most robust and maintainable tool. Just some random thoughts, Erik, it's up to you ;) |
We're kind of tied to CMS by definition, and I think CMS Monitoring has sufficient long-term support that it's not a concern. I think that we should use |
Here is pros/cons of each approach (in my view): using sendmail tool(s)
using amtool
|
Thanks @vkuznet
can easily turn out to be a positive feature in the long term. |
anything can be point of failure: k8s, AlertManger (AM), SMTP, etc. It depends on what is critical for you and how you treat the infrastructure. The AM runs on k8s, therefore k8s will ensure it will restart it if necessary. I'm not sure what are you trying to solve here. If you care about stability of MS itself, again k8s ensures it will be restarted in case of failure. If you need notification about it either sending email or amtool will do a job, if you save logs you can manually check logs if something happen. How critical it is for you it is different story. If you want to be paranoid you may use both. |
I think we could have it in the #alerts-dmwm channel as well. |
Erik, for alert routing you should decide which labels to use in your alert. The labels may include tag, severity, service, etc. Please have a look how alerts are defined, e.g. for reqmgr2 we have this rules In rules you'll see different labels. Therefore when you use amtool you can define any set of labels, e.g. see example how we define different labels in CMSpark crons (like severity, tag, etc), https://github.com/dmwm/CMSSpark/blob/master/bin/cron4aggregation#L33 So, in AlertManager we'll use labels to identify your alerts. For example, if you use tag=bla, then you'll need to tell us this tags. And, you should tell us which channels you want to propagate the alert based on your tags. The channel can be email, or slack. Therefore you can tell us, route alerts with |
Thanks Valentin. I see three cases where reqmgr2 microservices would need to send a mail. Does it make sense to make a single tag called
Then I can provide the additional error message contents through the Do I get any benefit from separating the
tag: microservices
tag: microservices
tag: microservices |
Erik, I'll go ahead and configure necessary pieces with your information in AlertManager, then I'll request to test it from your end. |
I put new changes in place for our AM instance. You can see them here: Therefore, you may test your alert using I suggest that you use the following script (with whatever adjustments you may want to have) for testing purposes (as I usually do when testing alerts):
|
Thanks for the script Valentin. I just ran two alerts, one medium and one high and can see them here https://cms-monitoring.cern.ch/alertmanager/#/alerts. I'll work on adding an amtool wrapper to WMCore to replace the current email functionality. |
Erik, please confirm if you receive emails for your test alerts and did you see them in Slack channel. As I wrote, if everything is fine, then I'll update our HA clusters with this configuration. And you can use either My suggestion that you should keep url to be configurable in your stack such that we can easily change it if we'll move things around. |
Trying to compile many comments/questions in this reply.
Valentin, does it mean we need to create a new "*rules" template under that repository for every type of alert our service needs to generate? Valentin/Erik, does it make sense to group all our WMCore alerts under the
How about:
I must be doing something wrong over there, because I cannot filter any alerts with those strings. Is there any specific syntax for that (env=production does filter stuff, for instance)? Valentin, can you please clarify how the alert expiration flag works? If there is an alert not expired yet and our service generates another one, does it get fired? What happens when it expires, does it get RESOLVED in slack or what? Anything important that we need to know about this property? |
Alan, let's walk through all your questions:
which assign tag cmsweb (the alert comes from cmsweb, service reqmgr2, and it belongs to kind dmwm. For alerts you generate within your code you may have completely different set of values for those. It is not required to have identical values in rules and in your own alerts. We use
Please note that you can use
And, you can query your alerts using either amtool as shown above or in web interface, e.g. here is a query for specific receiver:
and here is for different filters
You can go to our AM web page and play with filters to understand their behavior using different alerts. |
Hi @vkuznet, could you update the email associated with the two AM rules you created to point to my CERN mail (egough AT cern.ch) so I can test my wrapper without mailing the larger group? |
Erik, instead of changing existing dmwm channels, I created a new one for you. So, if you'll use I can create as many different (individual channels) as it will be necessary, including slack. But you should explicitly tell me what do you want. So far I only added email channel. |
The AM API: /api/v1/alerts, and you can post to it the following JSON: |
And I don't really see a problem of making custom function for py2 to create rfc3339 timestamps. |
Impact of the bug
MSTransferor (but it will apply to any system trying to send alerts via SMTP)
Describe the bug
This issue is meant to report/address two issues:
The second issue isn't really a problem at the moment, but it makes MSTransferor to have a "silent" behaviour, in the sense that: when the service fails to send that alert notification, the workflow in question is skipped from that cycle - even though the Rucio rule creation has already happened; then in the next cycle, MSTransferor finds an existent rule (likely in INJECTED/REPLICATING status), so the service assumes data is already available and simply moves that workflow to
staging
without persisting any rule id to be monitored by MSMonitor. Thus, MSMonitor will bypass this workflow right away because there are no rules to be monitored.How to reproduce it
trigger email notification in a pod
Expected behavior
Microservices - or any other WMCore service - should be able to send email notifications.
Regarding the MSTransferor behaviour, I think we can log the exception, make a record with the content that was supposed to be sent via email (I think MSTransferor already does that), and move on with the workflow processing as if there were no problems sending the email.
Additional context and error message
Some sections of the MSTransferor log under
amaltaro@vocms0750:/cephfs/product/dmwm-logs $ less ms-transferor-20210121-ms-transferor-7744c99cd8-5p6m8.log
and in the subsequent MSTransferor cycle, this is what happened to that workflow:
The text was updated successfully, but these errors were encountered: