liveness and readiness checks for kubernetes #390

nwest1 · 2018-09-11T20:48:37Z

FEATURE REQUEST
Hello!

I think this is a should-have - we can use /version endpoints for liveness probes but want to discuss what makes sense for readiness (if anything.)

The only adapter that lacks this is mqtt at this point.

The text was updated successfully, but these errors were encountered:

drasko · 2018-09-11T23:29:33Z

@nwest1 indeed. We had /status before, but removed it as we added /version. It is unclear exactly what this healthcheck enpoint should contain, probably uptime, maybe even some underlaying machine/arch/OS info...

Regarding MQTT - it is only NodeJS microservice, as we could not find adequate Go candidate for MQTT broker. And it is indeed missing /version endpoint.

We are opened for all propositions regarding this healthcheck endpoint. Also, if you have some code ready for JS microservice please send PR, otherwise someone from Mainflux team will take a look at this early next week.

chombium · 2018-09-12T10:42:55Z

The liveness and readiness probes are simple HTTP GET or TCP endpoints which return status codes between 200 and 400 in case of success, every other status code is failure. Any code greater than or equal to 200 and less than 400 indicates success. The body of the response is optional and it is normally used to bring more context about the service to the user and describe what went wrong. More detail here. Liveness means that the service is running properly (it should return false in case of fatal error for example) and readiness that it can accept traffic (ready to process requests).

A simple example in our case would be the users service. When it is started it can return liveness 200, but it won't return readiness 200 until it has a connection to the DB is established. If it the service ends up in unrecoverable state the liveness probe would return false. Another example is that most of our services are dependant on NATS. If NATS is not there the services end up in unrecoverable state and can not work. In such cases the service can try to reconnect few times and if it fails to reconnect set the liveness probe to false so that kubernetes restarts the service. At the moment the services are trying to connect to NATS at start and if they can not we exit. I think that we don't cover cases when NATS is running when the service starts and is shuts down while the service is running.

I suggest as a beginning, that we define in detail what are the dependencies (infrastructure and other Mainflux services) for each service and the ways we can check if the dependencies are healthy. Most of this is already known (docker-compose, k8s service configs), we only need to think about is how to check the health of the dependencies and combine them together with the service internal states to set the proper liveness and readiness status.
For the implementation we could have common package for checking the infrastructure components NATS, NGINX, Redis.... and each service will use these checks for its own checks.

@drasko I'll pick this issue. It would be nice if someone else joins me at least in the definition and structuring the solution.

drasko · 2018-09-12T11:11:36Z

@chombium let's do this for the start:

Add k8 config to use existing /version for healthcheck
Add missing /version in mqtt.js

If this works, we can start further defining healthcheck data that we want to see for each of the services (i.e. dependencies, etc).

drasko · 2018-09-21T23:26:39Z

Related to https://github.com/mainflux/mainflux/issues/396

nmarcetic · 2018-10-01T11:43:59Z

Also related to #378 Closing this one.

anovakovic01 · 2018-11-27T12:04:19Z

Resolved with #378.

chombium self-assigned this Sep 12, 2018

nmarcetic closed this as completed Oct 1, 2018

nmarcetic reopened this Oct 1, 2018

anovakovic01 mentioned this issue Nov 26, 2018

MF-378 - Add nginx ingress config to k8s services #472

Merged

anovakovic01 closed this as completed Nov 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

liveness and readiness checks for kubernetes #390

liveness and readiness checks for kubernetes #390

nwest1 commented Sep 11, 2018

drasko commented Sep 11, 2018

chombium commented Sep 12, 2018

drasko commented Sep 12, 2018

drasko commented Sep 21, 2018

nmarcetic commented Oct 1, 2018

anovakovic01 commented Nov 27, 2018 •

edited

Loading

liveness and readiness checks for kubernetes #390

liveness and readiness checks for kubernetes #390

Comments

nwest1 commented Sep 11, 2018

drasko commented Sep 11, 2018

chombium commented Sep 12, 2018

drasko commented Sep 12, 2018

drasko commented Sep 21, 2018

nmarcetic commented Oct 1, 2018

anovakovic01 commented Nov 27, 2018 • edited Loading

anovakovic01 commented Nov 27, 2018 •

edited

Loading