-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] KEP-3903: Unknown Version Interoperability Proxy #3903
Conversation
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: lavalamp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
## Proposal | ||
|
||
API changes: | ||
* To the apiservices API, add an "alternates" clause, a list of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer a "serviceable" or "serviceableBy" clause as opposed to "alternates".
## Summary | ||
|
||
When a cluster has multiple apiservers at mixed versions (such as during an | ||
upgrade or downgrate), not every apiserver can serve every resource at every |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
upgrade or downgrate), not every apiserver can serve every resource at every | |
upgrade or downgrade), not every apiserver can serve every resource at every |
keps/sig-api-machinery/3903-unknown-version-interoperability-proxy/README.md
Show resolved
Hide resolved
|
||
* Ensure discovery reports the same set of resources everywhere (not just group | ||
versions, as it does today) | ||
* Proxy client traffic to an apiserver that can service the request |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered instead disabling resources which are not served by all API servers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then you end up with an intersection of the resources of both sets of apiservers, which is still going to lead to wonky behavior from the GC and NLC controllers.
amend UVIP KEP with some additional details
By routing all discovery requests to the newest apiserver, we can ensure that namespace and gc | ||
controllers do what they would be doing if the upgrade happened instantaneously. | ||
|
||
Alternatively, we can use the storage version objects to reconstruct a merged discovery |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexzielenski @apelisse @Jefftree Thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the aggregators will have to know all the resources, even those that aren't handled by its own apiserver (so that it can proxy them somewhere else). So I think it needs to have its own, synced, representation of what is available, and so it should be able to serve that representation in the form of a discovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just because we can, doesn't mean we should. Yes, we should be able to serve a synced representation of what resources are available, but what does supporting this intermediate state actually buy us?
keps/sig-api-machinery/3903-unknown-version-interoperability-proxy/README.md
Outdated
Show resolved
Hide resolved
This might be a good place to talk about core concepts and how they relate. | ||
--> | ||
|
||
### Risks and Mitigations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In a cluster of N apiservers under balanced load, proxying requests to a single apiserver could increase the load by a factor of N (worst case)? Is it reasonable to expect 1 apiserver will be able to serve the load previously served by N apiservers? Could this make upgrades more dangerous?
If so, should "Risks and Mitigations" include a section about this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, it's coming.
The answer is that heavily trafficked resources will not be added or removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, there are zonal clusters that handle loads which vary from very small to very big, so why would we expect a single apiserver instance not to be able to handle the load?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think this is probably safe. It's just good to write some reasoning down. My thinking was that you'll maybe have some new APIs in a upgrade but client's won't be using them yet during upgrade. During downgrade, maybe some clients will be using the APIs, but the vast majority of traffic is for things that can be served by all API servers. Combined with the fact that you don't lose any etcd capacity and you're required to over-provisioned to handle N-1 apiservers, I think this is OK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so why would we expect a single apiserver instance not to be able to handle the load?
We definitely have clusters in the fleet where single instance can't handle the load.
That said - I agree with Daniel here - the answer that we shouldn't be removing heavily-loaded resources is a reasonable one and I think it's already true in general.
The rollback is an interesting case that should be mentioned explicitly though (but it should generally also be true, because the same problem as appears during rollback can actually appear during upgrade).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think the logic is symmetrical -- if you do a rollback that removes a now-heavily-trafficked API, you're gonna have a very bad time; this is already true today IMO
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely have clusters in the fleet where single instance can't handle the load.
Those should be pretty exceptional cases. For most clusters, I would imagine most of the traffic coming from CM and scheduler, which are leader elected and don't have any anti-affinity rules by default, so a single apiserver should normally be able to handle that sort of traffic.
|
||
The garbage collector makes decisions about deleting objects when all | ||
referencing objects are deleted. A discovery gap / apiserver mismatch, as | ||
described above, could result in GC seeing a 404 and assuming an object has been |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we make the 404 a 503, then this point is moot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the goals and proposal section to make it clearer why this doesn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but even UVIP as currently stated doesn't really solve the problem, because in the case the apiserver is able to act on resources which end up disappearing in the newer version is kinda moot without the transition hooks which allow us to clean up our resource mess once the upgrade completes. And, in that same situation, the actions that the apiserver makes on those resources may not even make sense once the upgrade completes because those resources will essentially be orphaned.
Therefore my contention is that 503ing here is not any less correct then letting the apiserver act on old resources which will then disappear once the upgrade completes unless we also have the ability to transition between resources being enabled/disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As stated I think this solves GC and namespace lifecycle controller, and doing less than this proposal would not (see all the paragraphs I added in the most recent commit).
It also enables a hypothetical clear-before-resource-removal controller, but that's not a story because it's not necessary to motivate any of the changes proposed here. GC and NLC already exercise the cases.
keps/sig-api-machinery/3903-unknown-version-interoperability-proxy/README.md
Show resolved
Hide resolved
* Note that merely serving 503s at the right times does not solve the problem, | ||
for two reasons: controllers might get an incomplete discovery and therefore | ||
not ask about all the correct resources; and when they get 503 responses, | ||
although the controller can avoid doing something destructive, it also can't | ||
make progress and is stuck for the duration of the upgrade. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a controller is trying to make progress it falls in one of two camps:
-
it's trying to actuate on a resource which is removed in a newer version. In that case, whatever it does (besides deletions) are going to end up moot, because those objects will be orphaned once the upgrade completes.
-
it's trying to actuate on a resource which was just introduced in the newer version. In that case, if we are routing to the newest apiserver, this just actually does the right thing.
So really, we're actually only talking about the first scenario and those resources are going to get orphaned, regardless of whether or not we end up merging the resources for discovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't agree for case 1, the controller could be trying to delete the object in which case we want it to succeed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and anyway, we can't be sure the upgrade is going to complete.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then the correct thing to do is to route discovery to the newest apiserver and for resources which cannot be served by the newest apiserver, we should return a 404 instead of a 503, in order to force deletions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does "the upgrade might not complete" imply that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was responding mostly to this:
I don't agree for case 1, the controller could be trying to delete the object in which case we want it to succeed.
But anyhow, it's cool, we chatted offline and I'm okay with merged discovery for beta. I just really wanted to avoid it for alpha.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's trying to actuate on a resource which is removed in a newer version
It's much more subtle than that. Suppose that the new version is just removing the v1beta1 version (and v1 still exists). If we will be routing to newer apiservers than clearly we will fail this request. It will probably eventually succeed, but it's an unnecessary delay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that depends on the client. And during an upgrade, depending on ordering, we may see both clients that only understand v1beta1, and clients that only understand v1.
Add a note about large request volume proxied to the apiserver
Add design details for new filter in handler chain
@lavalamp: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
For the mTLS between source and destination apiservers, we will do the following | ||
|
||
1. For server authentication by the client (source apiserver) : the client needs to validate the server certs (presented by the destination apiserver), for which it needs to know the CA bundle of the authority that signed those certs. We should be able to reuse the bundle given to all pods to verify whatever kube-apiserver instance they talk to (currently passed to kube-controller-manager as --root-ca-file) | ||
|
||
2. For client authentication by the server (destination apiserver) : destination apiserver will check the source apiserver certs to determine that the proxy request is from an authenticated client. The destination apiserver will use requestheader authentication (and NOT client cert authentication) for this using the kube-aggregator proxy client cert/key and the --requestheader-client-ca-file passed to the apiserver upon bootstrap |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
etcd has configuration for both client-to-server and peer-to-peer traffic. Long term, I expect that we'll need to converge to configuration options that are similar, only we also have kubelet traffic, so there actually three "directions" or traffic.
Can we model this after etcd (https://etcd.io/docs/v3.2/op-guide/security/)? I'm OK with the peer-to-peer traffic defaulting to the approach defined here (for the nice out-of-the-box experience), but I think we should also offer a dedicated set of optional flags to configure this new direction of traffic (e.g. --peer-advertise-address
, --peer-bind-address
, --peer-client-ca-file
,...) for cluster administrators that want or require the ability to configure this direction of traffic differently.
cc @deads2k
/close |
@richabanker: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
(not at all ready for review, just getting this started)