diff --git a/keps/sig-api-machinery/3903-unknown-version-interoperability-proxy/README.md b/keps/sig-api-machinery/3903-unknown-version-interoperability-proxy/README.md new file mode 100644 index 00000000000..664641286c7 --- /dev/null +++ b/keps/sig-api-machinery/3903-unknown-version-interoperability-proxy/README.md @@ -0,0 +1,884 @@ + +# KEP-3903: Unknown Version Interoperability Proxy + + + + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories (Optional)](#user-stories-optional) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats (Optional)](#notesconstraintscaveats-optional) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + + +## Release Signoff Checklist + + + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR) +- [ ] (R) KEP approvers have approved the KEP status as `implementable` +- [ ] (R) Design details are appropriately documented +- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - [ ] e2e Tests for all Beta API Operations (endpoints) + - [ ] (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - [ ] (R) Minimum Two Week Window for GA e2e tests to prove flake free +- [ ] (R) Graduation criteria is in place + - [ ] (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) +- [ ] (R) Production readiness review completed +- [ ] (R) Production readiness review approved +- [ ] "Implementation History" section is up-to-date for milestone +- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io] +- [ ] Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + + + +[kubernetes.io]: https://kubernetes.io/ +[kubernetes/enhancements]: https://git.k8s.io/enhancements +[kubernetes/kubernetes]: https://git.k8s.io/kubernetes +[kubernetes/website]: https://git.k8s.io/website + +## Summary + +When a cluster has multiple apiservers at mixed versions (such as during an +upgrade or downgrade), not every apiserver can serve every resource at every +version. + +To fix this, we will add a filter to the handler chain in the aggregator which +proxies clients to an apiserver that is capable of handling their request. + +## Motivation + +When an upgrade or downgrade is performed on a cluster, for some period of time +the apiservers are at differing versions and are able to serve different sets of +built-in resources (different groups, versions, and resources are all possible). + +In an ideal world, clients would be able to know about the entire set of +available resources and perform operations on those resources without regard to +which apiserver they happened to connect to. Currently this is not the case. + +Today, these things potentially differ: +* Resources available somewhere in the cluster +* Resources known by a client (i.e. read from discovery from some apiserver) +* Resources that can be actuated by a client + +This can have serious consequences, such as namespace deletion being blocked +incorrectly or objects being garbage collected mistakenly. + +### Goals + +* Ensure discovery reports the same set of resources everywhere (not just group + versions, as it does today) +* Ensure that every resource in discovery can be accessed successfully +* In the failure case (e.g. network not routable between apiservers), ensure + that unreachable resources are served 503 and not 404. + +### Non-Goals + +* Change cluster installation procedures (no new certs etc) +* Lock particular clients to particular versions + +## Proposal + +We will use the existing `StorageVersion` API to figure out which group, versions, +and resources an apiserver can serve. + + +API server change: +* A new handler is added to the stack: + + - If the request is for a group/version/resource the apiserver doesn't have + locally (we can use the StorageVersion API), it will proxy the request to + one of the apiservers that is listed in the object. If an apiserver fails + to respond is not available, then we will return a 503 (there is a small + possibility of a race between the controller registering the apiserver + with the resources it can serve and receiving a request for a resource + that is not yet available on that apiserver). + +* Discovery merging. + + - During upgrade or downgrade, it may be the case that no apiserver has a + complete list of available resources. To fix the problems mentioned, it's + necessary that discovery exactly matches the capability of the system. So, + we will use the storage version objects to reconstruct a merged discovery + document and serve that in all apiservers. + +Why so much work? +* Note that merely serving 503s at the right times does not solve the problem, + for two reasons: controllers might get an incomplete discovery and therefore + not ask about all the correct resources; and when they get 503 responses, + although the controller can avoid doing something destructive, it also can't + make progress and is stuck for the duration of the upgrade. +* Likewise proxying but not merging the discovery document, or merging the + discovery document but serving 503s instead of proxying, doesn't fix the + problem completely. We need both safety against destructive actions and the + ability for controllers to proceed and not block. + +### User Stories (Optional) + +#### Garbage Collector + +The garbage collector makes decisions about deleting objects when all +referencing objects are deleted. A discovery gap / apiserver mismatch, as +described above, could result in GC seeing a 404 and assuming an object has been +deleted; this could result in it deleting a subsequent object that it should +not. + +This proposal will cause the GC to see the complete list of resources in +discovery, and when it requests specific objects, see either the correct object +or get a 503 (which it handles safely). + +#### Namespace Lifecycle Controller + +This controller seeks to empty all objects from a namespace when it is deleted. +Discovery failures cause NLC to be unable to tell if objects of a given resource +are present in a namespace. It fails safe, meaning it refuses to delete the +namespace until it can verify it is empty: this causes slowness deleteing +namespaces that is a common source of complaint. + +Additionally, if the NLC knows about a resource that the apiserver it is talking +to does not, it may incorrectly get a 404, assume a collection is empty, and +delete the namespace too early, leaving garbage behind in etcd. This is a +correctness problem, the garbage will reappear if a namespace of the same name +is recreated. + +This proposal addresses both problems. + +### Notes/Constraints/Caveats (Optional) + + + +### Risks and Mitigations + +Cluster admins might not read the release notes and realize they should enable +network/firewall connectivity between apiservers. In this case clients will +recieve 503s instead of transparently being proxied. 503 is still safer than +today's behavior. + +Requests will consume egress bandwidth for 2 apiservers when proxied. We can cap +the number if needed, but upgrades aren't that frequent and few resources are +changed on releases, so these requests should not be common. We will count them +with a metric. + +There could be a large volume of requests for a specific resource which might result in the identified apiserver being unable to serve the proxied requests. This scenario should not occur too frequently, since resource types which have large request volume should not be added or removed during an upgrade -- that would cause other problems, too. + +We should ensure at most one proxy, rather than proxying the request over and over again (if the source apiserver has an incorrect understanding of what the destination apiserver can serve). + +To prevent server-side request forgeries we will not give control over information about apiserver IP/endpoint and the trust bundle (used to authenticate server while proxying) to users via REST APIs. + +## Design Details + +### Aggregation Layer + +1. A new filter will be added to the [handler chain] of the aggregation layer. This filter will maintain an internal map with the key being the group-version-resource and the value being a list of server IDs of apiservers that are capable of serving that group-version-resource + 1. This internal map is populated using an informer for StorageVersion objects. An event handler will be added for this informer that will get the apiserver ID of the requested group-version-resource and update the internal map accordingly + +2. This filter will pass on the request to the next handler in the local aggregator chain, if: + 1. It is a non resource request + 2. The StorageVersion informer cache hasn't synced yet or if `StorageVersionManager.Completed()` has returned false. We will serve error 503 in this case + 3. The request has a header that indicates that this request has been proxied once already. If for some reason the resource is not found locally, we will serve error 503 + 4. No StorageVersion was retrieved for it, meaning the request is for an aggregated API or for a custom resource + 5. If the local apiserver ID is found in the list of serviceable-by server IDs from the internal map + +3. If the local apiserver ID is not found in the list of serviceable-by server IDs, a random apiserver ID will be selected from the retrieved list and the request will be proxied to this apiserver + +4. If there is no apiserver ID retrieved for the requested GVR, we will serve 404 with error `GVR is not served by anything in this cluster` + +5. If the proxy call fails for network issues or any reason, we serve 503 with error `Error while proxying request to destination apiserver` + +[handler chain]:https://github.com/kubernetes/kubernetes/blob/fc8f5a64106c30c50ee2bbcd1d35e6cd05f63b00/staging/src/k8s.io/apiserver/pkg/server/config.go#L639 + +##### StorageVersion enhancement needed + +StorageVersion API currently tells us whether a particular StorageVersion can be read from etcd by the listed apiserver. We will enhance this API to also include apiserver ID of the server that can serve this StoageVersion. + +#### Identifying destination apiserver's network location + +* TODO: We need to find a place to store and retrieve the destination apiserver's host and port information given the server's ID. +We do not want to store this information in + + * StorageVersion : because we do not want to expose the network identity of the apiservers in this API that can be listed in multiple places where it may be unnecessary/redundant to do so + * Endpoint reconciler lease : because the IP present here could be that of a load balancer for the apiservers, but we need to know the definite address of the identified destination apiserver + +#### Proxy transport between apiservers and authn + +For the mTLS between source and destination apiservers, we will do the following + +1. For server authentication by the client (source apiserver) : the client needs to validate the server certs (presented by the destination apiserver), for which it needs to know the CA bundle of the authority that signed those certs. We should be able to reuse the bundle given to all pods to verify whatever kube-apiserver instance they talk to (currently passed to kube-controller-manager as --root-ca-file) + +2. For client authentication by the server (destination apiserver) : destination apiserver will check the source apiserver certs to determine that the proxy request is from an authenticated client. The destination apiserver will use requestheader authentication (and NOT client cert authentication) for this using the kube-aggregator proxy client cert/key and the --requestheader-client-ca-file passed to the apiserver upon bootstrap + +### Discovery Merging +TODO: detailed description of discovery merging. (not scheduled until beta.) + +### Test Plan + + + +[ ] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + + + +##### Unit tests + + + + + +- ``: `` - `` + +##### Integration tests + + + +- : + +##### e2e tests + + + +- : + +### Graduation Criteria + +#### Alpha + +- Proxying implemented (behind feature flag) +- mTLS or other secure system used for proxying + +#### Beta + +- Discovery document merging implemented + +#### GA + +- TODO: wait for beta to determine any further criteria + + + +### Upgrade / Downgrade Strategy + + + +### Version Skew Strategy + + + +## Production Readiness Review Questionnaire + + + +### Feature Enablement and Rollback + + + +###### How can this feature be enabled / disabled in a live cluster? + + + +- [ ] Feature gate (also fill in values in `kep.yaml`) + - Feature gate name: + - Components depending on the feature gate: +- [ ] Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? + + + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? + + + +###### What happens if we reenable the feature if it was previously rolled back? + +###### Are there any tests for feature enablement/disablement? + + + +### Rollout, Upgrade and Rollback Planning + + + +###### How can a rollout or rollback fail? Can it impact already running workloads? + + + +###### What specific metrics should inform a rollback? + + + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? + + + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? + + + +### Monitoring Requirements + + + +###### How can an operator determine if the feature is in use by workloads? + + + +###### How can someone using this feature know that it is working for their instance? + + + +- [ ] Events + - Event Reason: +- [ ] API .status + - Condition name: + - Other field: +- [ ] Other (treat as last resort) + - Details: + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? + + + +This feature depends on the `StorageVersion` feature, that generates objects with a `storageVersion.status.serverStorageVersions[*].apiServerID` field which is used to find the destination apiserver's network location. + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? + + + +- [X] Metrics + - Metric name: `kubernetes_uvip_count` + - Components exposing the metric: kube-apiserver + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? + + + +### Dependencies + + + +###### Does this feature depend on any specific services running in the cluster? + +No, but it does depend on the `StorageVersion` feature in kube-apiserver. + + + +### Scalability + + + +###### Will enabling / using this feature result in any new API calls? + + + +###### Will enabling / using this feature result in introducing new API types? + + + +###### Will enabling / using this feature result in any new calls to the cloud provider? + + + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? + + + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? + + + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? + + + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? + + + +### Troubleshooting + + + +###### How does this feature react if the API server and/or etcd is unavailable? + +###### What are other known failure modes? + + + +###### What steps should be taken if SLOs are not being met to determine the problem? + +## Implementation History + + + +## Drawbacks + + + +## Alternatives + + + +## Infrastructure Needed (Optional) + + diff --git a/keps/sig-api-machinery/3903-unknown-version-interoperability-proxy/kep.yaml b/keps/sig-api-machinery/3903-unknown-version-interoperability-proxy/kep.yaml new file mode 100644 index 00000000000..0aa93ee957d --- /dev/null +++ b/keps/sig-api-machinery/3903-unknown-version-interoperability-proxy/kep.yaml @@ -0,0 +1,48 @@ +title: Unknown Version Interoperability Proxy +kep-number: 3903 +authors: + - "@lavalamp" +owning-sig: sig-api-machinery +participating-sigs: + - sig-aaa + - sig-bbb +status: provisional|implementable|implemented|deferred|rejected|withdrawn|replaced +creation-date: yyyy-mm-dd +reviewers: + - TBD + - "@alice.doe" +approvers: + - TBD + - "@oscar.doe" + +see-also: + - apiserver identity kep, link TODO +replaces: + - none I think + +# The target maturity stage in the current dev cycle for this KEP. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: "v1.28" + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: "v1.28" + beta: "v1.29" + stable: "v1.30" + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: MyFeature + components: + - kube-apiserver + - kube-controller-manager +disable-supported: true + +# The following PRR answers are required at beta release +metrics: + - my_feature_metric