Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] KEP-3903: Unknown Version Interoperability Proxy #3903

Closed
wants to merge 14 commits into from

Conversation

lavalamp
Copy link
Member

@lavalamp lavalamp commented Mar 9, 2023

(not at all ready for review, just getting this started)

  • One-line PR description: Adds a KEP.
  • Issue link:
  • Other comments:

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels Mar 9, 2023
@k8s-ci-robot k8s-ci-robot requested a review from deads2k March 9, 2023 23:18
@k8s-ci-robot k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Mar 9, 2023
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: lavalamp

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from jpbetz March 9, 2023 23:18
@k8s-ci-robot k8s-ci-robot added sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. approved Indicates a PR has been approved by an approver from all required OWNERS files. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Mar 9, 2023
@lavalamp lavalamp changed the title [WIP] KEP-NNNN: Unknown Version Interoperability Proxy [WIP] KEP-3903: Unknown Version Interoperability Proxy Mar 9, 2023
## Proposal

API changes:
* To the apiservices API, add an "alternates" clause, a list of
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer a "serviceable" or "serviceableBy" clause as opposed to "alternates".

## Summary

When a cluster has multiple apiservers at mixed versions (such as during an
upgrade or downgrate), not every apiserver can serve every resource at every
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
upgrade or downgrate), not every apiserver can serve every resource at every
upgrade or downgrade), not every apiserver can serve every resource at every


* Ensure discovery reports the same set of resources everywhere (not just group
versions, as it does today)
* Proxy client traffic to an apiserver that can service the request
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered instead disabling resources which are not served by all API servers?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then you end up with an intersection of the resources of both sets of apiservers, which is still going to lead to wonky behavior from the GC and NLC controllers.

By routing all discovery requests to the newest apiserver, we can ensure that namespace and gc
controllers do what they would be doing if the upgrade happened instantaneously.

Alternatively, we can use the storage version objects to reconstruct a merged discovery
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexzielenski @apelisse @Jefftree Thoughts on this?

Copy link
Member

@apelisse apelisse Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the aggregators will have to know all the resources, even those that aren't handled by its own apiserver (so that it can proxy them somewhere else). So I think it needs to have its own, synced, representation of what is available, and so it should be able to serve that representation in the form of a discovery.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just because we can, doesn't mean we should. Yes, we should be able to serve a synced representation of what resources are available, but what does supporting this intermediate state actually buy us?

This might be a good place to talk about core concepts and how they relate.
-->

### Risks and Mitigations
Copy link
Contributor

@jpbetz jpbetz Mar 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In a cluster of N apiservers under balanced load, proxying requests to a single apiserver could increase the load by a factor of N (worst case)? Is it reasonable to expect 1 apiserver will be able to serve the load previously served by N apiservers? Could this make upgrades more dangerous?

If so, should "Risks and Mitigations" include a section about this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, it's coming.

The answer is that heavily trafficked resources will not be added or removed.

Copy link
Member

@logicalhan logicalhan Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, there are zonal clusters that handle loads which vary from very small to very big, so why would we expect a single apiserver instance not to be able to handle the load?

Copy link
Contributor

@jpbetz jpbetz Mar 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think this is probably safe. It's just good to write some reasoning down. My thinking was that you'll maybe have some new APIs in a upgrade but client's won't be using them yet during upgrade. During downgrade, maybe some clients will be using the APIs, but the vast majority of traffic is for things that can be served by all API servers. Combined with the fact that you don't lose any etcd capacity and you're required to over-provisioned to handle N-1 apiservers, I think this is OK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so why would we expect a single apiserver instance not to be able to handle the load?

We definitely have clusters in the fleet where single instance can't handle the load.

That said - I agree with Daniel here - the answer that we shouldn't be removing heavily-loaded resources is a reasonable one and I think it's already true in general.
The rollback is an interesting case that should be mentioned explicitly though (but it should generally also be true, because the same problem as appears during rollback can actually appear during upgrade).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I think the logic is symmetrical -- if you do a rollback that removes a now-heavily-trafficked API, you're gonna have a very bad time; this is already true today IMO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We definitely have clusters in the fleet where single instance can't handle the load.

Those should be pretty exceptional cases. For most clusters, I would imagine most of the traffic coming from CM and scheduler, which are leader elected and don't have any anti-affinity rules by default, so a single apiserver should normally be able to handle that sort of traffic.


The garbage collector makes decisions about deleting objects when all
referencing objects are deleted. A discovery gap / apiserver mismatch, as
described above, could result in GC seeing a 404 and assuming an object has been
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we make the 404 a 503, then this point is moot.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the goals and proposal section to make it clearer why this doesn't work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but even UVIP as currently stated doesn't really solve the problem, because in the case the apiserver is able to act on resources which end up disappearing in the newer version is kinda moot without the transition hooks which allow us to clean up our resource mess once the upgrade completes. And, in that same situation, the actions that the apiserver makes on those resources may not even make sense once the upgrade completes because those resources will essentially be orphaned.

Therefore my contention is that 503ing here is not any less correct then letting the apiserver act on old resources which will then disappear once the upgrade completes unless we also have the ability to transition between resources being enabled/disabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As stated I think this solves GC and namespace lifecycle controller, and doing less than this proposal would not (see all the paragraphs I added in the most recent commit).

It also enables a hypothetical clear-before-resource-removal controller, but that's not a story because it's not necessary to motivate any of the changes proposed here. GC and NLC already exercise the cases.

Comment on lines +215 to +219
* Note that merely serving 503s at the right times does not solve the problem,
for two reasons: controllers might get an incomplete discovery and therefore
not ask about all the correct resources; and when they get 503 responses,
although the controller can avoid doing something destructive, it also can't
make progress and is stuck for the duration of the upgrade.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a controller is trying to make progress it falls in one of two camps:

  1. it's trying to actuate on a resource which is removed in a newer version. In that case, whatever it does (besides deletions) are going to end up moot, because those objects will be orphaned once the upgrade completes.

  2. it's trying to actuate on a resource which was just introduced in the newer version. In that case, if we are routing to the newest apiserver, this just actually does the right thing.

So really, we're actually only talking about the first scenario and those resources are going to get orphaned, regardless of whether or not we end up merging the resources for discovery.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't agree for case 1, the controller could be trying to delete the object in which case we want it to succeed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and anyway, we can't be sure the upgrade is going to complete.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then the correct thing to do is to route discovery to the newest apiserver and for resources which cannot be served by the newest apiserver, we should return a 404 instead of a 503, in order to force deletions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does "the upgrade might not complete" imply that?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was responding mostly to this:

I don't agree for case 1, the controller could be trying to delete the object in which case we want it to succeed.

But anyhow, it's cool, we chatted offline and I'm okay with merged discovery for beta. I just really wanted to avoid it for alpha.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's trying to actuate on a resource which is removed in a newer version

It's much more subtle than that. Suppose that the new version is just removing the v1beta1 version (and v1 still exists). If we will be routing to newer apiservers than clearly we will fail this request. It will probably eventually succeed, but it's an unnecessary delay.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, that depends on the client. And during an upgrade, depending on ordering, we may see both clients that only understand v1beta1, and clients that only understand v1.

@k8s-ci-robot
Copy link
Contributor

@lavalamp: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-enhancements-verify 8aa71c9 link true /test pull-enhancements-verify

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Comment on lines +318 to +322
For the mTLS between source and destination apiservers, we will do the following

1. For server authentication by the client (source apiserver) : the client needs to validate the server certs (presented by the destination apiserver), for which it needs to know the CA bundle of the authority that signed those certs. We should be able to reuse the bundle given to all pods to verify whatever kube-apiserver instance they talk to (currently passed to kube-controller-manager as --root-ca-file)

2. For client authentication by the server (destination apiserver) : destination apiserver will check the source apiserver certs to determine that the proxy request is from an authenticated client. The destination apiserver will use requestheader authentication (and NOT client cert authentication) for this using the kube-aggregator proxy client cert/key and the --requestheader-client-ca-file passed to the apiserver upon bootstrap
Copy link
Contributor

@jpbetz jpbetz May 12, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

etcd has configuration for both client-to-server and peer-to-peer traffic. Long term, I expect that we'll need to converge to configuration options that are similar, only we also have kubelet traffic, so there actually three "directions" or traffic.

Can we model this after etcd (https://etcd.io/docs/v3.2/op-guide/security/)? I'm OK with the peer-to-peer traffic defaulting to the approach defined here (for the nice out-of-the-box experience), but I think we should also offer a dedicated set of optional flags to configure this new direction of traffic (e.g. --peer-advertise-address, --peer-bind-address, --peer-client-ca-file ,...) for cluster administrators that want or require the ability to configure this direction of traffic differently.

cc @deads2k

@k8s-ci-robot k8s-ci-robot added cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. and removed cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 17, 2023
@richabanker
Copy link
Contributor

/close
in favor of #4015

@k8s-ci-robot
Copy link
Contributor

@richabanker: Closed this PR.

In response to this:

/close
in favor of #4015

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/api-machinery Categorizes an issue or PR as relevant to SIG API Machinery. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants