Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thorough check of services #43

Open
owk opened this issue Jun 5, 2018 · 9 comments
Open

Thorough check of services #43

owk opened this issue Jun 5, 2018 · 9 comments

Comments

@owk
Copy link

owk commented Jun 5, 2018

Hi @timdaman

I was wondering if it would be possible to expand your service checks in check_swarm? The script only checks if the service is available in the swarm, but it doesn't take into account if any containers are actually scheduled for the service? I hastily glanced over the API reference, and I think it would be possible to use the tasks resource for this. Maybe add a 'tasks'-argument where you can filter on the service or the container name?

Thanks for the great scripts!

@owk
Copy link
Author

owk commented Jun 6, 2018

I added some functions I'm gonna use for our monitoring setup. I shared them in the gist below:

https://gist.github.com/owk/bab9d45cc60eba6295fe4a3cb8176c6f

It's not the cleanest code, but it works :)

@timdaman
Copy link
Owner

Thanks, that looks interesting. One thing I am noticing is the logic for selecting the status. I don't work with swarm much so I may be missing something. The way I see it, if the service is not running we are in a critical state. The way I see it, OK means things are working, WARNING means things are working but with less performance margin than desired, and CRITICAL means things are not working. If a service is 'preparing' would it not be non-functional? Normally services start quickly but if took a very long time it would clearly be critical until it was running.

@owk
Copy link
Author

owk commented Jun 10, 2018

Good point, maybe I ought to set 'preparing' as a CRITICAL status. I expect one would only see this status if the check interval is quite low, or if there is indeed a problem. Thanks for that!

@jkozera
Copy link

jkozera commented Oct 2, 2018

If I understand the issue right, there's also the possibility to use --running check from check_swarm.py, but only after significant change of logic, to allow old instances of the same service in not-running state - see operasoftware@2b1fc15 for example implementation.

FWIW, after I'm done cleaning up these changes, I'm going to open another issue to review all the changes I've made, because they can be useful in general for Swarm setups.

@nksupport
Copy link

nksupport commented Dec 25, 2019

from what i can see in the script, you're only determining service health by the HTTP return code, which in fact only signals swarm health (and even that is a rather limited swarm health check). Consider this example with a failed service:

$ docker service ls | grep testservice

wapxwn0h5i2l        testservice                         replicated          0/1                 testimage:latest

(the container failed to start due to an unsatisfied node constraint, but that could be anything).

You're doing a simple HTTP API check, similar to this:

$ curl -v --unix-socket /var/run/docker.sock http://localhost/services/testservice
> GET /services/testservice HTTP/1.1
< HTTP/1.1 200 OK
(normal service details follow)

200 here doesn't mean the service is green - it is in fact completely down - 200 only means that swarm was able to serve your request.

The service check is somewhat misleading the way it is now; putting that in a production environment as the primary health check is obviously the technician's fault and not yours, but still this will inevitably happen (as it did on the system i'm investigating right now). Perhaps it would be best to remove the service check option until it's fixed to help people avoid getting hurt?

@timdaman
Copy link
Owner

@nksupport So sorry to hear this slowed your investigation down, that is exactly not what i had hoped it would do. Indeed I see the issue you mentioned, not good.

This was written for a previous job so I don't actually use this code much anymore but given the seriousness of the issue I will see if I can get a test environment up to test fixes.

Thank you such a detailed explanation of the problem, it really helps.

timdaman added a commit that referenced this issue Dec 28, 2019
Adding python 3.8 to tests
@timdaman
Copy link
Owner

@nksupport , I have a pushed an updated swarm check. It is not fully tested but I figured I should get it to you quickly given the impact.
https://github.com/timdaman/check_docker/tree/swarm_fix

Below is a direct link to the updated swarm check, you can run it like this by hand, python3 check_swarm.py <Your options>

https://github.com/timdaman/check_docker/blob/edcacedef5ae8e6354962d151dce1cbe50483240/check_docker/check_swarm.py

@timdaman
Copy link
Owner

@nksupport, update, that check had another bug which I found when I added unit tests. Here is the updated version which I intend to ship in the next few days. If you have a moment would love to know if it looks good to you.

https://github.com/timdaman/check_docker/raw/swarm_fix/check_docker/check_swarm.py

@timdaman
Copy link
Owner

I believe version 2.2.1 should resolve this issue. Let me know if you see otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants