Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

liveness and readiness checks for kubernetes #390

Closed
nwest1 opened this issue Sep 11, 2018 · 6 comments
Closed

liveness and readiness checks for kubernetes #390

nwest1 opened this issue Sep 11, 2018 · 6 comments
Assignees

Comments

@nwest1
Copy link
Contributor

nwest1 commented Sep 11, 2018

FEATURE REQUEST
Hello!

I think this is a should-have - we can use /version endpoints for liveness probes but want to discuss what makes sense for readiness (if anything.)

The only adapter that lacks this is mqtt at this point.

@drasko
Copy link
Contributor

drasko commented Sep 11, 2018

@nwest1 indeed. We had /status before, but removed it as we added /version. It is unclear exactly what this healthcheck enpoint should contain, probably uptime, maybe even some underlaying machine/arch/OS info...

Regarding MQTT - it is only NodeJS microservice, as we could not find adequate Go candidate for MQTT broker. And it is indeed missing /version endpoint.

We are opened for all propositions regarding this healthcheck endpoint. Also, if you have some code ready for JS microservice please send PR, otherwise someone from Mainflux team will take a look at this early next week.

@chombium
Copy link
Collaborator

The liveness and readiness probes are simple HTTP GET or TCP endpoints which return status codes between 200 and 400 in case of success, every other status code is failure. Any code greater than or equal to 200 and less than 400 indicates success. The body of the response is optional and it is normally used to bring more context about the service to the user and describe what went wrong. More detail here. Liveness means that the service is running properly (it should return false in case of fatal error for example) and readiness that it can accept traffic (ready to process requests).

A simple example in our case would be the users service. When it is started it can return liveness 200, but it won't return readiness 200 until it has a connection to the DB is established. If it the service ends up in unrecoverable state the liveness probe would return false. Another example is that most of our services are dependant on NATS. If NATS is not there the services end up in unrecoverable state and can not work. In such cases the service can try to reconnect few times and if it fails to reconnect set the liveness probe to false so that kubernetes restarts the service. At the moment the services are trying to connect to NATS at start and if they can not we exit. I think that we don't cover cases when NATS is running when the service starts and is shuts down while the service is running.

I suggest as a beginning, that we define in detail what are the dependencies (infrastructure and other Mainflux services) for each service and the ways we can check if the dependencies are healthy. Most of this is already known (docker-compose, k8s service configs), we only need to think about is how to check the health of the dependencies and combine them together with the service internal states to set the proper liveness and readiness status.
For the implementation we could have common package for checking the infrastructure components NATS, NGINX, Redis.... and each service will use these checks for its own checks.

@drasko I'll pick this issue. It would be nice if someone else joins me at least in the definition and structuring the solution.

@chombium chombium self-assigned this Sep 12, 2018
@drasko
Copy link
Contributor

drasko commented Sep 12, 2018

@chombium let's do this for the start:

  • Add k8 config to use existing /version for healthcheck
  • Add missing /version in mqtt.js

If this works, we can start further defining healthcheck data that we want to see for each of the services (i.e. dependencies, etc).

@drasko
Copy link
Contributor

drasko commented Sep 21, 2018

@nmarcetic
Copy link
Collaborator

Also related to #378 Closing this one.

@anovakovic01
Copy link
Contributor

anovakovic01 commented Nov 27, 2018

Resolved with #378.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants