Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Readiness check #50187

Open
jpountz opened this issue Dec 13, 2019 · 6 comments
Open

Readiness check #50187

jpountz opened this issue Dec 13, 2019 · 6 comments
Labels
:Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown >feature high hanging fruit Team:Core/Infra Meta label for core/infra team

Comments

@jpountz
Copy link
Contributor

jpountz commented Dec 13, 2019

It is a common need to check whether a node has started up. Common approaches today include calling existing APIs, like GET / and checking the HTTP response code. However @jasontedor noted that the fact that none of our APIs is designed with this goal is mind means that we might break this use-case inadvertently, which wouldn't be the case if we had a dedicated API.

We'd need to first understand what exact semantics are needed for a readiness check. For instance do we only need to check whether the node has started up, or do we also need to know whether it has formed a cluster that has an elected master?

Then should we make it its own dedicated API, or should we recommend using an existing API for this, like GET /? If the latter, then we will need to document it.

@jpountz jpountz added >feature :Core/Infra/Core Core issues without another label labels Dec 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-core-infra (:Core/Infra/Core)

@dakrone
Copy link
Member

dakrone commented Dec 13, 2019

should we recommend using an existing API for this, like GET /?

My preference would be a separate API. I think it's a feature that GET / is the lightest weight API we have, and I've used in in the past as way to tell whether a node is really in trouble (ie, does it respond to GET / okay).

@rjernst
Copy link
Member

rjernst commented Dec 13, 2019

do we only need to check whether the node has started up, or do we also need to know whether it has formed a cluster that has an elected master?

I think we need to have connected to/formed a cluster, and also async initialization like security/watcher services need to have completed once they get cluster state.

Additionally, IMO there is no need for an API. We can connect to and form a cluster with only the transport port bound, and then only bind to http once this is complete. This would allow users/tests to wait on the http port being bound. I started experimenting with this last year to improve how integ tests wait for ES to be ready, and would be happy to pick this back up. Last I remember, there were some issues in security, but I think those may now be solved with transport client being removed from master.

@ywelsch
Copy link
Contributor

ywelsch commented Dec 16, 2019

I think we need to have connected to/formed a cluster, and also async initialization like security/watcher services need to have completed once they get cluster state.

I wonder how the gateway settings such as gateway.expected_data_nodes should be treated (i.e. whether the cluster should be treated as ready / not ready in that case).

Additionally, IMO there is no need for an API. We can connect to and form a cluster with only the transport port bound, and then only bind to http once this is complete

The main issue I see with this is that you can't get any diagnostic output from the cluster through the APIs as to why the cluster is not forming or why some of the components did not initialize.

@jasontedor
Copy link
Member

In the context of Kubernetes, we need to distinguish a liveness check from a readiness check.

A liveness check is used to determine when to restart a container because it's sick.

A readiness check is used to determine when a container is ready to receive requests.

The distinction is important because we don't necessarily want to restart a container because it's temporarily partitioned from the cluster. A failing liveness check is used to restart the container, a failing readiness check is used to stop routing traffic.

In this context, I don't think we need to think about

do we only need to check whether the node has started up, or do we also need to know whether it has formed a cluster that has an elected master?

rather we have dedicated checks for each use case.

In the context of a Docker HEALTHCHECK it seems that using the above language, a liveness check is more appropriate (to restart unhealthy containers). In the case of docker-compose healthcheck, it seems to me that readiness is the more appropriate check as otherwise the check can not be relied up to indicate when the service is ready for dependent services to startup (e.g., Kibana).

@sebgl
Copy link

sebgl commented Jan 30, 2020

A related ECK user question: https://discuss.elastic.co/t/does-elastic-have-a-healthcheck-endpoint-that-does-not-require-username-and-password/217090

The default AWS EKS LoadBalancer implementation has its own healthcheck logic. It can be configured with http port and path, but not with authentication details. I would have expected it to be bound to the Pod readiness instead, as this is usually what happens with Kubernetes Services.

This also stands true for GCP default Ingress.

We may want to simplify how a user can setup unauthenticated access to the healthcheck endpoint?

@rjernst rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020
@rjernst rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020
@williamrandolph williamrandolph added high hanging fruit and removed needs:triage Requires assignment of a team area label labels Jan 7, 2021
@gwbrown gwbrown added :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown and removed :Core/Infra/Core Core issues without another label labels Mar 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown >feature high hanging fruit Team:Core/Infra Meta label for core/infra team
Projects
None yet
Development

No branches or pull requests

9 participants