Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce grpc health check in etcd client #16276

Open
chaochn47 opened this issue Jul 20, 2023 · 17 comments
Open

Introduce grpc health check in etcd client #16276

chaochn47 opened this issue Jul 20, 2023 · 17 comments

Comments

@chaochn47
Copy link
Member

chaochn47 commented Jul 20, 2023

What would you like to be added?

Background

gRPC Health checks are used to probe whether the server is able to handle RPCs. A server may choose to reply “unhealthy” because it is not ready to take requests, it is shutting down or some other reason. The client can act accordingly if the response is not received within some time window or the response says unhealthy in it.

ref.

  1. https://github.com/grpc/grpc/blob/master/doc/health-checking.md
  2. https://github.com/grpc/proposal/blob/master/A17-client-side-health-checking.md

#8121 added basic grpc health service only on server side since etcd v3.3.

// server should register all the services manually
// use empty service name for all etcd services' health status,
// see https://github.com/grpc/grpc/blob/master/doc/health-checking.md for more
hsrv := health.NewServer()
hsrv.SetServingStatus("", healthpb.HealthCheckResponse_SERVING)
healthpb.RegisterHealthServer(grpcServer, hsrv)

Problem

In a multi etcd server endpoints scenario, etcd client only fails over to the other endpoint when existing connection/channel is in not in Ready state. However, etcd client does not know about if etcd server can handle RPCs.

For example

  1. defrag stops the process completely for noticeable duration #9222
  2. Removed etcd member failed to stop on stuck disk #14338

It needs a comprehensive design and testing.

Placeholder google doc etcd client grpc health check copied from the KEP template.

Why is this needed?

Improve etcd availability by failing over to other healthy etcd endpoints

@chaochn47
Copy link
Member Author

@serathius @ahrtr @wenjiaswe @ptabor @etcd-io/members for discussion if this feature is wanted.

@fuweid
Copy link
Member

fuweid commented Jul 23, 2023

Hi @chaochn47 , based on your PR #16278, it seems like that the PR marks not_serving status because of defrag. It should be applied to any maintenance APIs, is it correct? My only concern is that based on that https://github.com/grpc/grpc-go/blob/d7f45cdf9ae720256d328fcdb6356ae58122a7f6/health/client.go#L104, it might impact the on-going requests(still in trasfering data) and force to recreate connection. (I haven't verified it yet, update it later)

@serathius
Copy link
Member

For me this duplicates #16007. We need single comprehensive design for etcd probing. As part of this work we can update the grpc probes.

@chaochn47
Copy link
Member Author

chaochn47 commented Jul 24, 2023

Thanks @fuweid for the review!

based on your PR #16278, it seems like that the PR marks not_serving status because of defrag. It should be applied to any maintenance APIs, is it correct?

Online Defrag is one of the use cases that server reports "unhealthy" to client to fail over the requests. I do not plan to support any other maintenance APIs now. #16278 is just a POC and health server should not exposed to maintenance server, while to defrag API only ideally.

My only concern is that based on that https://github.com/grpc/grpc-go/blob/d7f45cdf9ae720256d328fcdb6356ae58122a7f6/health/client.go#L104, it might impact the on-going requests(still in trasfering data) and force to recreate connection.

I won't be too concerned about this failure mode since it is not called out in https://github.com/grpc/proposal/blob/master/A17-client-side-health-checking.md and grpc/proposal#97. But I will validate with adding some testing cases in https://github.com/grpc/grpc-go/blob/master/test/healthcheck_test.go in case it's an unexpected golang specific implementation issue.

@chaochn47
Copy link
Member Author

chaochn47 commented Jul 24, 2023

Thanks @serathius for taking a look!

For me this duplicates #16007. We need single comprehensive design for etcd probing. As part of this work we can update the grpc probes.

I don't agree that it's a duplicate of #16007. I believe /livez and /readyz concepts are brought from https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/ and it's a polling health check mechanism. It is designed for etcd administration to determine when to restart/replace etcd with a new process / container or when to route the client traffic to etcd server.

However, gRPC client side health check replies on etcd (gRPC) server pushes its health status to client for requests fail-over via a server side streaming API. It is designed only for round-robin policy and will improve etcd availability.

I do agree some of the server health status determination logic can be re-used but the two proposals don't share the same motivation. For example, we definitely do not want terminating etcd server when online defrag is active but expect client to route requests to other etcd servers.

@wenjiaswe @logicalhan please drop a comment in case I misunderstood the proposal #16007, thx~

@serathius
Copy link
Member

serathius commented Jul 25, 2023

However, gRPC client side health check replies on etcd (gRPC) server pushes its health status to client for requests fail-over via a server side streaming API. It is designed only for round-robin policy and will improve etcd availability.

Pull push, just changes who has the control over validation period. Server or client.

What's important here is definition of etcd health. I expect both types of health checking will end up with pretty similar definition of health.

@chaochn47
Copy link
Member Author

Thanks. I think we are on the same page now.

Do you still think it's a duplicate of #16007? Will it cover how to consume the health status change?

I would like the etcd client to fail over to other healthy endpoints considering etcd health, not just connection is up or down.

@chaochn47
Copy link
Member Author

/cc @jyotima

@wenjiaswe
Copy link
Contributor

gRPC Health checks are used to probe whether the server is able to handle RPCs.

That sounds like one case of not ready for /readyz, is my understanding correct?

@chaochn47
Copy link
Member Author

Yeah I think so but I want this thread focus on etcd client fail over by introducing gRPC health checks.

I suggest we move the conversation about etcd failure modes classification into which probes to #16007

@serathius
Copy link
Member

This issue can be left opened, as a followup to introduce grpc client health checks when we figure out proper health definitions.

@ahrtr
Copy link
Member

ahrtr commented Jul 26, 2023

It's a valid feature to me, and I agree we should design it together with the Livez/Readyz feature, because the key is how to define health (e.g. get stuck on write for a long time > a threshold).

Also two detailed comments:

@chaochn47
Copy link
Member Author

Thanks @ahrtr for the review,

Definitely we want any feature request to be future-proof and won't be thrown away. However, unless we have a foreseeable approach to resolve the above two items in the upcoming year, this issue can be left on the table.

@chaochn47
Copy link
Member Author

chaochn47 commented Aug 1, 2023

My only concern is that based on that https://github.com/grpc/grpc-go/blob/d7f45cdf9ae720256d328fcdb6356ae58122a7f6/health/client.go#L104, it might impact the on-going requests(still in trasfering data) and force to recreate connection. (I haven't verified it yet, update it later)

Hi @fuweid, I think the on-going RPCs can still proceed, please check the simulated test case chaochn47/grpc-go#1. Thank you ~

I will polish it up and try merge it to the gRPC-go tests.

One caveat is if the watch stream is not interrupted, client that is built on top of the watch as a cache may be stale. In order to terminate the watch stream, I guess the etcd server needs to be stopped.

@chaochn47
Copy link
Member Author

One caveat is if the watch stream is not interrupted, client that is built on top of the watch as a cache may be stale. In order to terminate the watch stream, I guess the etcd server needs to be stopped.

FWIW, when the disk I/O is blocked/stalled, partitioned node stays with state s.Leader() != types.ID(raft.None). Watch is not cancelled because of monitorLeader depends on s.Leader() == types.ID(raft.None). It is explained in #13527 and reproduce steps provided.

kube-apiserver watch cache built on top of etcd watch will not receive update and could potentially receive stale data. Instead, the watch should be cancelled by server and requires client to re-create watch on a different etcd node that is connected to quorum.

So after etcd health status is properly defined, it should be consumed by monitorLeader as well.

@fuweid
Copy link
Member

fuweid commented Aug 2, 2023

I think the on-going RPCs can still proceed, please check the simulated test case chaochn47/grpc-go#1. Thank you ~

@chaochn47 Thanks for the confirm. I also verified it in my local. It won't impact streaming or ongoing unary requests. Side note: if the client, like etcdctl, doesn't use the health profile, the client can still get the connection. I think the existing pod's exec liveness probe won't be impacted.

@lavacat
Copy link

lavacat commented Aug 5, 2023

I think this is very useful specifically for defrag. In high db size, high traffic clusters this is a real problem. When maxGapBetweenApplyAndCommitIndex is reached and defrag is still blocking applies, etcd will fail requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

6 participants