Readiness check #50187

jpountz · 2019-12-13T15:00:33Z

It is a common need to check whether a node has started up. Common approaches today include calling existing APIs, like GET / and checking the HTTP response code. However @jasontedor noted that the fact that none of our APIs is designed with this goal is mind means that we might break this use-case inadvertently, which wouldn't be the case if we had a dedicated API.

We'd need to first understand what exact semantics are needed for a readiness check. For instance do we only need to check whether the node has started up, or do we also need to know whether it has formed a cluster that has an elected master?

Then should we make it its own dedicated API, or should we recommend using an existing API for this, like GET /? If the latter, then we will need to document it.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-12-13T15:00:36Z

Pinging @elastic/es-core-infra (:Core/Infra/Core)

dakrone · 2019-12-13T16:40:00Z

should we recommend using an existing API for this, like GET /?

My preference would be a separate API. I think it's a feature that GET / is the lightest weight API we have, and I've used in in the past as way to tell whether a node is really in trouble (ie, does it respond to GET / okay).

rjernst · 2019-12-13T18:43:58Z

do we only need to check whether the node has started up, or do we also need to know whether it has formed a cluster that has an elected master?

I think we need to have connected to/formed a cluster, and also async initialization like security/watcher services need to have completed once they get cluster state.

Additionally, IMO there is no need for an API. We can connect to and form a cluster with only the transport port bound, and then only bind to http once this is complete. This would allow users/tests to wait on the http port being bound. I started experimenting with this last year to improve how integ tests wait for ES to be ready, and would be happy to pick this back up. Last I remember, there were some issues in security, but I think those may now be solved with transport client being removed from master.

ywelsch · 2019-12-16T09:02:54Z

I think we need to have connected to/formed a cluster, and also async initialization like security/watcher services need to have completed once they get cluster state.

I wonder how the gateway settings such as gateway.expected_data_nodes should be treated (i.e. whether the cluster should be treated as ready / not ready in that case).

Additionally, IMO there is no need for an API. We can connect to and form a cluster with only the transport port bound, and then only bind to http once this is complete

The main issue I see with this is that you can't get any diagnostic output from the cluster through the APIs as to why the cluster is not forming or why some of the components did not initialize.

jasontedor · 2019-12-16T23:56:04Z

In the context of Kubernetes, we need to distinguish a liveness check from a readiness check.

A liveness check is used to determine when to restart a container because it's sick.

A readiness check is used to determine when a container is ready to receive requests.

The distinction is important because we don't necessarily want to restart a container because it's temporarily partitioned from the cluster. A failing liveness check is used to restart the container, a failing readiness check is used to stop routing traffic.

In this context, I don't think we need to think about

do we only need to check whether the node has started up, or do we also need to know whether it has formed a cluster that has an elected master?

rather we have dedicated checks for each use case.

In the context of a Docker HEALTHCHECK it seems that using the above language, a liveness check is more appropriate (to restart unhealthy containers). In the case of docker-compose healthcheck, it seems to me that readiness is the more appropriate check as otherwise the check can not be relied up to indicate when the service is ready for dependent services to startup (e.g., Kibana).

sebgl · 2020-01-30T09:29:23Z

A related ECK user question: https://discuss.elastic.co/t/does-elastic-have-a-healthcheck-endpoint-that-does-not-require-username-and-password/217090

The default AWS EKS LoadBalancer implementation has its own healthcheck logic. It can be configured with http port and path, but not with authentication details. I would have expected it to be bound to the Pod readiness instead, as this is usually what happens with Kubernetes Services.

This also stands true for GCP default Ingress.

We may want to simplify how a user can setup unauthenticated access to the healthcheck endpoint?

jpountz added >feature :Core/Infra/Core Core issues without another label labels Dec 13, 2019

jpountz mentioned this issue Dec 13, 2019

Elasticsearch readiness probe might fail if a single node is stuck elastic/cloud-on-k8s#2248

Closed

tylersmalley mentioned this issue Dec 19, 2019

Create health check / readiness endpoint elastic/kibana#46984

Closed

sebgl mentioned this issue Mar 16, 2020

Create default role for health checks elastic/cloud-on-k8s#2708

Open

rjernst added the Team:Core/Infra Meta label for core/infra team label May 4, 2020

rjernst added the needs:triage Requires assignment of a team area label label Dec 3, 2020

williamrandolph added high hanging fruit and removed needs:triage Requires assignment of a team area label labels Jan 7, 2021

gwbrown added :Core/Infra/Node Lifecycle Node startup, bootstrapping, and shutdown and removed :Core/Infra/Core Core issues without another label labels Mar 24, 2021

jasontedor mentioned this issue Nov 30, 2021

Introduce an unauthenticated endpoint for readiness checks #81168

Closed

pebrc mentioned this issue May 23, 2024

Use Elasticearch readiness port instead of / for Pod readiness elastic/cloud-on-k8s#7841

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Readiness check #50187

Readiness check #50187

jpountz commented Dec 13, 2019

elasticmachine commented Dec 13, 2019

dakrone commented Dec 13, 2019

rjernst commented Dec 13, 2019

ywelsch commented Dec 16, 2019

jasontedor commented Dec 16, 2019

sebgl commented Jan 30, 2020 •

edited

Loading

Readiness check #50187

Readiness check #50187

Comments

jpountz commented Dec 13, 2019

elasticmachine commented Dec 13, 2019

dakrone commented Dec 13, 2019

rjernst commented Dec 13, 2019

ywelsch commented Dec 16, 2019

jasontedor commented Dec 16, 2019

sebgl commented Jan 30, 2020 • edited Loading

sebgl commented Jan 30, 2020 •

edited

Loading