-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding KEP 2433: Topology Aware Hints #2434
Conversation
20d2279
to
39d06fa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple small comments for the PRR - please address.
And I will work with approval for SIG approval.
- The `topology.kubernetes.io/topology-aware-routing` annotation is set to | ||
`Auto` for the Service. | ||
- At least one endpoint for the Service has a hint pointing to the zone | ||
Kube-Proxy is running within. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at least one with hint to our zone or without hint?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the safest approach here might actually be to ignore all hints if some are missing. I think the most likely cause for hints missing would be a transition between states, and that transition exposes us to risk if we don't fallback to all endpoints immediately. For example, if we have a 500 endpoint Service, updates are going to come to kube-proxy for each EndpointSlice individually.
It's possible that at one point only one EndpointSlice will have hints for each endpoint. Without an immediate fallback, we'd risk significantly overloading these endpoints.
The other alternative is also not great. Let's say we always send traffic to endpoints with matching hints + endpoints without any hints. In the above scenario, we'd overload endpoints without hints if we were weighting all endpoints equally.
If kube-proxy could reliably determine how many zones were in a cluster, we could do a variation of that second approach that gave endpoints without a zone hint a 1/n weight, where n == the number of zones in the cluster. I think that could still result in imbalanced traffic during a transition depending on the composition of slices with/without hints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be safe.
Easy: if all endpoints have the hint, then respect the hint
More nuanced: note how many unique values we see and only use the hint if ~ at least 1/N of all endpoints are for this zone (with some fudge factor).
I'm not sure the more nuanced answer is useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah every nuanced approach here feels like it could lead to more trouble than its worth. I'd lean towards the safest approach at least for alpha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should explore more nuanced, but I agree not for Alpha. Maybe we can explicitly add as a beta graduation criteria to have the answer for it.
f169818
to
6d777cc
Compare
|
||
* **Can the feature be disabled once it has been enabled (i.e. can we roll back | ||
the enablement)?** | ||
Yes. It can easily be disabled universally by turning off the feature gate or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when it will be GA it will be always enabled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. Although the feature gate would be enabled by default, it would still require per-Service opt-in. I think we could potentially change the default value of this annotation with a follow up KEP. This is hidden a bit further up in the KEP:
A future KEP will explore changing the default behavior from
Disabled
toAuto
with a new feature gate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
### Kube-Proxy | ||
|
||
When the `TopologyAwareHints` feature gate is enabled, Kube-Proxy will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember why, but for slices I think you ended having two independent feature gates for kube-proxy and the slice controller, can this happen again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I almost did that here. In that case I actually split the features up into different KEPs entirely because they felt sufficiently different. This feels more like a single feature, but I'm open to a separate feature gate here. Any naming recommendations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was just asking because I don't remember the cause, so if we don't know I don't think that is a big deal to amend it later if we find out it is really needed
} | ||
``` | ||
|
||
#### Future API Expansion |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So your future safety here is "add a new field and treat them as mutually exclusive" ? It seems to me that we'll need multiple zones (and probably weight) pretty quickly in order to handle small services.
e.g. 2 replicas (in zones A and B) and 3 node zones (A, B, C). You'll need to set up zone A to send to replica A 100%, Zone B to replica B 100% and zone C to replicas A and B at 50%. That's a simple case where weight doesn't matter but you simply can't represent an endpoint in 2 zones with the above proposal.
e.g. 4 replicas (2 in zone A, 1 each in B and C) and 3 node-zones. You'll want something like:
zone A: 50% to replica a1; 50% to replica a2
zone B: 75% to replica b1; 25% to replica a1
zone C: 75% to replica c1; 25% to replica a2
I think it will be important quickly - am I wrong? To that end, I'd suggest we start with zones []struct
now and as the impl builds up consider weight.
Note this is different than general per-endpoint weight (which we should still consider on its own), because this is a hint to consumers how to split for capacity (based on the set of endpoints), not an indicator of capacity (which is more intrinsic to the endpoint). I'm not sure that holds water yet, ergo let's not add weight yet :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The answer for the current proposal is that we simply would fallback to slices without hints if we could not reach a reasonable balance otherwise. When we tested weighting with the previous proposals it looked like it would significantly increase the number of EndpointSlice updates. Weight is most useful when there are a small number of endpoints. This is also when topology aware routing includes the most risk since it would be very easy for an individual endpoint to be overloaded. So even if we had weight and could accomplish topology aware routing with only 2 endpoints, the result might not be great. Agree that it's worth expanding the API to make weight easy to add in the future and also agree that we shouldn't add weight yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MY argument is to start now with zones []struct
with no weight and when (inevitably, IMO) we get complaints that small-N services don't work, we can revisit weight. it might be, as you say, that weight only matters at the very low end (N < 12), but that's a lot of Services. All your harm-reduction heuristics will have to consider it, so it's definitely a different discussion.
- The `topology.kubernetes.io/topology-aware-routing` annotation is set to | ||
`Auto` for the Service. | ||
- At least one endpoint for the Service has a hint pointing to the zone | ||
Kube-Proxy is running within. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's be safe.
Easy: if all endpoints have the hint, then respect the hint
More nuanced: note how many unique values we see and only use the hint if ~ at least 1/N of all endpoints are for this zone (with some fudge factor).
I'm not sure the more nuanced answer is useful.
Let's pin down basics and finish API argument in the code PR(s) |
|
||
#### Overload | ||
|
||
Overload is a key concept for this proposal. This occurs when there are less |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this seems to be a property of a zone only, or it will be extended to the other things that you mention like regions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the same concepts could apply to region, it's just less common in my experience to see individual clusters spanning zones. This may be something that becomes more relevant with multicluster Services.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
6d777cc
to
1908b5e
Compare
Kube-Proxy is running within. | ||
- All endpoints for the Service have zone hints. | ||
|
||
When the above conditions are true, kube-proxy will only route traffic to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the example you've added
- addresses: ["10.1.2.1"]
zone: "zone-a"
hints:
zone: "zone-a"
- addresses: ["10.1.2.2"]
zone: "zone-b"
hints:
zone: "zone-b"
- addresses: ["10.1.2.3"]
zone: "zone-a"
hints:
zone: "zone-c"
kube-proxies from zone A never will use
- addresses: ["10.1.2.3"]
zone: "zone-a"
hints:
zone: "zone-c"
despite it is in the same zone, is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, in this case C didn't have any so it was "donated" from A, and therefore all zones are ~equally provisioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then it is possible that pods in the same node that the endpoint, will not use the local endpoint because was donated to other zone, I don't know how much are you into soccer/football, but there is a legendary sentence from a Brazilian coach nicknamed Tim that says: "Playing soccer is like covering yourself with a short blanket." , that means either you cover your head, or you wrap your feet 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha yep, that is definitely an unfortunate downside of this approach. The previous subsetting approach was incompatible with ExternalTrafficPolicy=Local
for the same reason.
/lgtm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some super minor comments.
const SubSystem = "endpoint_slice_controller" | ||
|
||
// This metric observes overload per zone for endpoints in EndpointSlices | ||
EPSOverloadByZone = metrics.NewHistogramVec( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How are we going to report this metric?
Whenever you process a particular service you report it for each of zones?
I'm definitely not against this metric, but I'm having some troubles with intuitively explaining what it means to myself.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I'm really not sure, I was hoping that reporting every value when a Service syncs would give us somewhat helpful data as far as distribution. I think some visibility into overload would be really helpful but I still haven't found a great way to communicate that with a metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see two possible use cases:
- help the users to understand how good or bad distributed are their deployments, i.e they are overloading zoneA, ...
- understand the distribution algorithm convergence, I think that in a dynamic environment this is going to fluctuate, if you see that this is changing constantly that means that you are constantly redistributing endpoints, that will impact workloads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - I fully buy the usecases (those are the same ones I had on my mind).
My concern was more: "would this metric allow us answering those questions"?
And I'm not convinced (at least not intuitively...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm not sure it would either. I've been trying to think of alternatives but haven't so far. To avoid overcommitting to anything, I've removed this from the KEP for now, will keep on thinking of potential metrics we could add here. Any ideas would be very appreciated.
This feature is not yet implemented but per-Service enablement/disablement is | ||
covered in depth as part of the test plan. | ||
|
||
### Rollout, Upgrade and Rollback Planning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Technically neither of the below is needed for Alpha, but I'm really happy to see that already filled in :)
|
||
### Configuration | ||
|
||
A new `service.kubernetes.io/topology-aware-routing` annotation will be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of an annotation, I wonder if TopologyAware
(or just Topology
) can one of the possible values for internalTrafficPolicy
. That way we can validate and reject it when externalTrafficPolicy: Local
or when internalTrafficPolicy: Local|PreferLocal
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess we can also validate against annotation but not sure what the precedent for that is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call, based on the suggestion here and in that KEP (#2441 (comment)) and some follow up discussion in Slack, I think it makes sense to integrate this with the TrafficPolicy
field instead of adding a new annotation. I've updated this KEP to reflect that. I think you provided a good summary in Slack of what we're thinking here:
- if
externalTrafficPolicy=Cluster
, fall back totrafficPolicy
for external sources - if
externalTrafficPolicy=Local
-- this rule only takes precedent for external sources, with internal traffic followingtrafficPolicy
@thockin @andrewsykim I've updated this KEP to use the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
This functionality will be guarded by the `TopologyAwareHints` feature gate. | ||
This gate will be dependent on the `ServiceInternalTrafficPolicy` feature gate | ||
since it uses the `TrafficPolicy` guarded by that gate. This gate should not be | ||
enabled if the deprecated `ServiceTopology` gate is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LOL - can we get a table or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haha, cleaned this up so it's hopefully a bit easier to understand.
for a Service, the EndpointSlice controller will add hints to EndpointSlices. | ||
These hints will indicate where an endpoint should be consumed by proxy | ||
implementations to enable topology aware routing. | ||
When the `TopologyAwareHints` feature gate is enabled and the `trafficPolicy` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intersectional issue - for all of the trafficPolicy values that involve a fallback - what do they fall back to? I'm a bit anxious about this (commented on both KEPs).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question - wouldn't they fall back to Cluster
to match the current default behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, but is that what we really want?
This functionality will be guarded by the `TopologyAwareHints` feature gate. | ||
This gate will be dependent on the `ServiceInternalTrafficPolicy` feature gate | ||
since it uses the `TrafficPolicy` guarded by that gate. This gate should not be | ||
enabled if the deprecated `ServiceTopology` gate is enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
:)
/approve /hold |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: robscott, thockin, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
dbca8e5
to
d1a4d5e
Compare
/lgtm |
/hold |
This is a follow up to #2355 and #2354 and represents what I hope will be a simpler and better approach for topology aware routing.
Enhancement issue: #2433
/cc @andrewsykim @bowei @dcbw
/assign @thockin @wojtek-t
/sig network