-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce grpc health check in etcd client #16276
Comments
@serathius @ahrtr @wenjiaswe @ptabor @etcd-io/members for discussion if this feature is wanted. |
Hi @chaochn47 , based on your PR #16278, it seems like that the PR marks not_serving status because of defrag. It should be applied to any maintenance APIs, is it correct? My only concern is that based on that https://github.com/grpc/grpc-go/blob/d7f45cdf9ae720256d328fcdb6356ae58122a7f6/health/client.go#L104, it might impact the on-going requests(still in trasfering data) and force to recreate connection. (I haven't verified it yet, update it later) |
For me this duplicates #16007. We need single comprehensive design for etcd probing. As part of this work we can update the grpc probes. |
Thanks @fuweid for the review!
Online Defrag is one of the use cases that server reports "unhealthy" to client to fail over the requests. I do not plan to support any other maintenance APIs now. #16278 is just a POC and health server should not exposed to maintenance server, while to defrag API only ideally.
I won't be too concerned about this failure mode since it is not called out in https://github.com/grpc/proposal/blob/master/A17-client-side-health-checking.md and grpc/proposal#97. But I will validate with adding some testing cases in https://github.com/grpc/grpc-go/blob/master/test/healthcheck_test.go in case it's an unexpected golang specific implementation issue. |
Thanks @serathius for taking a look!
I don't agree that it's a duplicate of #16007. I believe However, gRPC client side health check replies on etcd (gRPC) server pushes its health status to client for requests fail-over via a server side streaming API. It is designed only for I do agree some of the server health status determination logic can be re-used but the two proposals don't share the same motivation. For example, we definitely do not want terminating etcd server when online defrag is active but expect client to route requests to other etcd servers. @wenjiaswe @logicalhan please drop a comment in case I misunderstood the proposal #16007, thx~ |
Pull push, just changes who has the control over validation period. Server or client. What's important here is definition of etcd health. I expect both types of health checking will end up with pretty similar definition of health. |
Thanks. I think we are on the same page now. Do you still think it's a duplicate of #16007? Will it cover how to consume the health status change? I would like the etcd client to fail over to other healthy endpoints considering etcd health, not just connection is up or down. |
/cc @jyotima |
That sounds like one case of not ready for /readyz, is my understanding correct? |
Yeah I think so but I want this thread focus on etcd client fail over by introducing gRPC health checks. I suggest we move the conversation about etcd failure modes classification into which probes to #16007 |
This issue can be left opened, as a followup to introduce grpc client health checks when we figure out proper health definitions. |
It's a valid feature to me, and I agree we should design it together with the Livez/Readyz feature, because the key is how to define health (e.g. get stuck on write for a long time > a threshold). Also two detailed comments:
|
Thanks @ahrtr for the review, Definitely we want any feature request to be future-proof and won't be thrown away. However, unless we have a foreseeable approach to resolve the above two items in the upcoming year, this issue can be left on the table. |
Hi @fuweid, I think the on-going RPCs can still proceed, please check the simulated test case chaochn47/grpc-go#1. Thank you ~ I will polish it up and try merge it to the gRPC-go tests. One caveat is if the watch stream is not interrupted, client that is built on top of the watch as a cache may be stale. In order to terminate the watch stream, I guess the etcd server needs to be stopped. |
FWIW, when the disk I/O is blocked/stalled, partitioned node stays with state kube-apiserver watch cache built on top of etcd watch will not receive update and could potentially receive stale data. Instead, the watch should be cancelled by server and requires client to re-create watch on a different etcd node that is connected to quorum. So after etcd health status is properly defined, it should be consumed by |
@chaochn47 Thanks for the confirm. I also verified it in my local. It won't impact streaming or ongoing unary requests. Side note: if the client, like etcdctl, doesn't use the health profile, the client can still get the connection. I think the existing pod's exec liveness probe won't be impacted. |
I think this is very useful specifically for defrag. In high db size, high traffic clusters this is a real problem. When |
What would you like to be added?
Background
ref.
#8121 added basic grpc health service only on server side since etcd v3.3.
etcd/server/etcdserver/api/v3rpc/grpc.go
Lines 81 to 86 in 6220174
Problem
In a multi etcd server endpoints scenario, etcd client only fails over to the other endpoint when existing connection/channel is in not in
Ready
state. However, etcd client does not know about if etcd server can handle RPCs.For example
It needs a comprehensive design and testing.
Placeholder google doc etcd client grpc health check copied from the KEP template.
Why is this needed?
Improve etcd availability by failing over to other healthy etcd endpoints
The text was updated successfully, but these errors were encountered: