Thorough check of services #43

owk · 2018-06-05T09:27:23Z

I was wondering if it would be possible to expand your service checks in check_swarm? The script only checks if the service is available in the swarm, but it doesn't take into account if any containers are actually scheduled for the service? I hastily glanced over the API reference, and I think it would be possible to use the tasks resource for this. Maybe add a 'tasks'-argument where you can filter on the service or the container name?

Thanks for the great scripts!

owk · 2018-06-06T10:44:24Z

I added some functions I'm gonna use for our monitoring setup. I shared them in the gist below:

https://gist.github.com/owk/bab9d45cc60eba6295fe4a3cb8176c6f

It's not the cleanest code, but it works :)

timdaman · 2018-06-10T17:48:24Z

Thanks, that looks interesting. One thing I am noticing is the logic for selecting the status. I don't work with swarm much so I may be missing something. The way I see it, if the service is not running we are in a critical state. The way I see it, OK means things are working, WARNING means things are working but with less performance margin than desired, and CRITICAL means things are not working. If a service is 'preparing' would it not be non-functional? Normally services start quickly but if took a very long time it would clearly be critical until it was running.

owk · 2018-06-10T18:01:59Z

Good point, maybe I ought to set 'preparing' as a CRITICAL status. I expect one would only see this status if the check interval is quite low, or if there is indeed a problem. Thanks for that!

jkozera · 2018-10-02T15:10:06Z

If I understand the issue right, there's also the possibility to use --running check from check_swarm.py, but only after significant change of logic, to allow old instances of the same service in not-running state - see operasoftware@2b1fc15 for example implementation.

FWIW, after I'm done cleaning up these changes, I'm going to open another issue to review all the changes I've made, because they can be useful in general for Swarm setups.

nksupport · 2019-12-25T16:04:29Z

from what i can see in the script, you're only determining service health by the HTTP return code, which in fact only signals swarm health (and even that is a rather limited swarm health check). Consider this example with a failed service:

$ docker service ls | grep testservice

wapxwn0h5i2l        testservice                         replicated          0/1                 testimage:latest

(the container failed to start due to an unsatisfied node constraint, but that could be anything).

You're doing a simple HTTP API check, similar to this:

$ curl -v --unix-socket /var/run/docker.sock http://localhost/services/testservice
> GET /services/testservice HTTP/1.1
< HTTP/1.1 200 OK
(normal service details follow)

200 here doesn't mean the service is green - it is in fact completely down - 200 only means that swarm was able to serve your request.

The service check is somewhat misleading the way it is now; putting that in a production environment as the primary health check is obviously the technician's fault and not yours, but still this will inevitably happen (as it did on the system i'm investigating right now). Perhaps it would be best to remove the service check option until it's fixed to help people avoid getting hurt?

timdaman · 2019-12-26T04:03:33Z

@nksupport So sorry to hear this slowed your investigation down, that is exactly not what i had hoped it would do. Indeed I see the issue you mentioned, not good.

This was written for a previous job so I don't actually use this code much anymore but given the seriousness of the issue I will see if I can get a test environment up to test fixes.

Thank you such a detailed explanation of the problem, it really helps.

Adding python 3.8 to tests

timdaman · 2019-12-28T01:13:08Z

@nksupport , I have a pushed an updated swarm check. It is not fully tested but I figured I should get it to you quickly given the impact.
https://github.com/timdaman/check_docker/tree/swarm_fix

Below is a direct link to the updated swarm check, you can run it like this by hand, python3 check_swarm.py <Your options>

https://github.com/timdaman/check_docker/blob/edcacedef5ae8e6354962d151dce1cbe50483240/check_docker/check_swarm.py

timdaman · 2019-12-28T21:10:19Z

@nksupport, update, that check had another bug which I found when I added unit tests. Here is the updated version which I intend to ship in the next few days. If you have a moment would love to know if it looks good to you.

https://github.com/timdaman/check_docker/raw/swarm_fix/check_docker/check_swarm.py

timdaman · 2020-01-13T05:32:19Z

I believe version 2.2.1 should resolve this issue. Let me know if you see otherwise.

jkozera mentioned this issue Oct 3, 2018

Swarm support improvements #48

Open

timdaman added a commit that referenced this issue Dec 28, 2019

Fixes to swarm check_swarm, #43

edcaced

Adding python 3.8 to tests

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thorough check of services #43

Thorough check of services #43

owk commented Jun 5, 2018 •

edited

Loading

owk commented Jun 6, 2018 •

edited

Loading

timdaman commented Jun 10, 2018

owk commented Jun 10, 2018

jkozera commented Oct 2, 2018

nksupport commented Dec 25, 2019 •

edited

Loading

timdaman commented Dec 26, 2019

timdaman commented Dec 28, 2019

timdaman commented Dec 28, 2019

timdaman commented Jan 13, 2020

Thorough check of services #43

Thorough check of services #43

Comments

owk commented Jun 5, 2018 • edited Loading

owk commented Jun 6, 2018 • edited Loading

timdaman commented Jun 10, 2018

owk commented Jun 10, 2018

jkozera commented Oct 2, 2018

nksupport commented Dec 25, 2019 • edited Loading

timdaman commented Dec 26, 2019

timdaman commented Dec 28, 2019

timdaman commented Dec 28, 2019

timdaman commented Jan 13, 2020

owk commented Jun 5, 2018 •

edited

Loading

owk commented Jun 6, 2018 •

edited

Loading

nksupport commented Dec 25, 2019 •

edited

Loading