-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thorough check of services #43
Comments
I added some functions I'm gonna use for our monitoring setup. I shared them in the gist below: https://gist.github.com/owk/bab9d45cc60eba6295fe4a3cb8176c6f It's not the cleanest code, but it works :) |
Thanks, that looks interesting. One thing I am noticing is the logic for selecting the status. I don't work with swarm much so I may be missing something. The way I see it, if the service is not running we are in a critical state. The way I see it, OK means things are working, WARNING means things are working but with less performance margin than desired, and CRITICAL means things are not working. If a service is 'preparing' would it not be non-functional? Normally services start quickly but if took a very long time it would clearly be critical until it was running. |
Good point, maybe I ought to set 'preparing' as a CRITICAL status. I expect one would only see this status if the check interval is quite low, or if there is indeed a problem. Thanks for that! |
If I understand the issue right, there's also the possibility to use FWIW, after I'm done cleaning up these changes, I'm going to open another issue to review all the changes I've made, because they can be useful in general for Swarm setups. |
from what i can see in the script, you're only determining service health by the HTTP return code, which in fact only signals swarm health (and even that is a rather limited swarm health check). Consider this example with a failed service:
(the container failed to start due to an unsatisfied node constraint, but that could be anything). You're doing a simple HTTP API check, similar to this:
200 here doesn't mean the service is green - it is in fact completely down - 200 only means that swarm was able to serve your request. The service check is somewhat misleading the way it is now; putting that in a production environment as the primary health check is obviously the technician's fault and not yours, but still this will inevitably happen (as it did on the system i'm investigating right now). Perhaps it would be best to remove the service check option until it's fixed to help people avoid getting hurt? |
@nksupport So sorry to hear this slowed your investigation down, that is exactly not what i had hoped it would do. Indeed I see the issue you mentioned, not good. This was written for a previous job so I don't actually use this code much anymore but given the seriousness of the issue I will see if I can get a test environment up to test fixes. Thank you such a detailed explanation of the problem, it really helps. |
@nksupport , I have a pushed an updated swarm check. It is not fully tested but I figured I should get it to you quickly given the impact. Below is a direct link to the updated swarm check, you can run it like this by hand, |
@nksupport, update, that check had another bug which I found when I added unit tests. Here is the updated version which I intend to ship in the next few days. If you have a moment would love to know if it looks good to you. https://github.com/timdaman/check_docker/raw/swarm_fix/check_docker/check_swarm.py |
I believe version 2.2.1 should resolve this issue. Let me know if you see otherwise. |
Hi @timdaman
I was wondering if it would be possible to expand your service checks in check_swarm? The script only checks if the service is available in the swarm, but it doesn't take into account if any containers are actually scheduled for the service? I hastily glanced over the API reference, and I think it would be possible to use the tasks resource for this. Maybe add a 'tasks'-argument where you can filter on the service or the container name?
Thanks for the great scripts!
The text was updated successfully, but these errors were encountered: