-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Specify health-check for each service #97
Comments
The new API we design should take care of two things:
This was also discussed during Ingress v1beta -> v1 transition but was punted because of the scope of the change required. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle rotten |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/remove-lifecycle stale |
/remove-lifecycle stale |
Would it make sense to apply these to Also, backendPolicy has a We also need to think about the protocol of the service to know if we need say a This idea would add a
Note: These are in Example yaml:
|
One concern with backend policies is that its not explicitly when we should check it. This could be solvable by a comment in the spec, but we need to consider it. For example, if I have 10 Gateways, which ones should health check policy1? is it all of them, any of them with routes to policy1, etc. If all gateways have routes to it, we will now get 10x the health check load. The reason I care is we will have lots of gateways most likely, so want to make sure the spec is clear here |
Yeah having multiple Gateways makes it tricky, I'd expect each Gateway to do it's own Health Checks against the resource. |
Interesting. I think "who" does the health-checking is up to the implementation. You use a health-checking service or a proxy or some more complicated distributed mechanism, that is up to you and should be transparent to the end user. What matters more here is what happens when health-checks are failing and how different implementations react differently to it. As I think more about this, it feels like BackendPolicy seems like a resource to define configuration of different types and not behavior. Making behavior consistent in this area seems infeasible. |
Another question which may be obvious but not to me - why do we need this at gateway level instead of using pod readinessProbes? |
|
To follow up on this, I've written up a quick doc to try to summarize the portability of health check config across implementations. There's a spreadsheet that covers some of the most commonly supported features. I likely got at least some things wrong here, so please correct me if I did. At a high level, most implementations support HTTP health checks with the option to specify Path, Timeout, and Interval. The next most supported fields are Hostname and Healthy/Unhealthy Thresholds. |
This seems to match what Pod readiness probe supports - any reason to not just use the exact same thing (as spec and API), and default to the HTTP readiness probe in the pod, if it exists ? I assume like all K8S services, it will be required that gateways respect the K8S readiness semantics as well, i.e. if K8S kubelet The network view may be different from kubelet view - but that doesn't mean users need to maintain 2 different endpoints. I am concerned a bit with the scalability of such system - distributed health check in a mesh with multiple networks, security, etc are pretty difficult - if we are worried about the feeback loop from K8S we should also worry about the load and feedback loop on the health check system if it is mandated by the Gateaway API. |
This actually is a great point and requires some more discussion and probably an issue of its own. Some Ingress Controllers route traffic to the VIP of the k8s Service while others route traffic directly to the endpoints. Which behavior do we expect from implementations in Gateway API? cc @mark-church @bowei @robscott @danehans
Users can maintain the same endpoint if that's what they want. We are not asking the users to maintain a different endpoint. Is that acceptable @costinm ?
I think this point is similar to John's point above. |
I think the thing that we would probably need to be a little prescriptive about is that the Gateway (however it's implemented) should be doing its own active checking if health checks are specified. That way we can be clear how they are distinct in behavior from adding a readiness check to a Pod and having Endpoints not visible in that way. |
I don't think its clear what "the Gateway" is, given the API is implementation agnostic. Does it mean all instances of the gateway (pod for in cluster workloads) must send requests to the pods to determine if its healthy? Does it mean one of them can and share the information with others? Can I spin up a distinct workload that does the health checks and reports it back? Stretching this really far, can that distinct workload be "kubelet" and "reporting it back" mean "put it in the pod readiness status"? |
That's a great point. I was thinking of it more in terms of saying something like "if you specify a healthcheck here, it must be done by some other mechanism than a Readiness check". You may use a readiness check as well (to filter endpoints or something), but asking here implies having the Gateway controller do something more active than that. Note that this isn't taking a stance on active health-checks v passive ones, just that it's something that's not the Kubelet. I do think it's fair to talk about "the Gateway" as a thing, since, at its heart, a Gateway describes some piece of network infrastructure that translates between traffic that doesn't know about cluster networking internals to something that does. (I've been working on this as a Gateway definition, but I don't think it's all the way there). |
In search for a solution, we were unsuccessful in coming up with a portable way to support this feature. If there are more portable ways to tackle this, we would love to hear about them. For now, I'll close this issue. |
@hbagdi: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
What would you like to be added:
As a Service owner, when I'm exposing my service to other users/services outside the k8s cluster (or even inside), I want to define active health-checking behavior for my service.
Why is this needed:
In case an instances of a service goes unhealthy, the proxy can skip sending requests to that specific instance and instead route traffic to other instances. This is also useful during rolling upgrades where a health-check endpoint can stop responding and stop accepting connections before the pod is replaced.
/kind feature
/kind user-story
The text was updated successfully, but these errors were encountered: