-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-2593: update risks and mitigation, rename object and target releases #4174
Conversation
/hold
Yes, and the
I'm not worried about "vendor lockin". I'm worred about... exactly what they did in that commit. The KEP clearly states that "This KEP assumes that the only consumer of the --cluster-cidr value is the NodeIPAM controller." And yet, flannel was modified so that it now also consumes the |
So, clarifying that some more: what everyone wants is "pod network autoscaling", because if you don't have that, then node autoscaling stops being useful at some point. In GKE, it happens to be the case that the only additional feature you need to have in order to be able to do pod network autoscaling is the ability to reconfigure the kcm Node IPAM controller; nothing else in GKE cares about the size/shape of the cluster network. In most other cluster configurations, this is not the case; for example, when using flannel as your network plugin, flannel itself needs to know the extent of the cluster network, so just having the ability to reconfigure the Node IPAM controller does not give you pod network autoscaling if you are using flannel. The PR linked above "fixes" this by having flannel sniff the config objects that KEP-2593 added in order to configure NodeIPAM, which KEP-2593 explicitly says is not what those objects are intended to be used for. There is not actually any way for them to implement the feature that they want with the current KEP. And, IMO, that suggests that the KEP is not actually solving the right problem; having the ability to reconfigure NodeIPAM is one of the things you need in order to have pod network autoscaling, but it's not the only thing (or even, for most clusters, the primary thing). |
Or, OTOH, if we are only going to solve the problem for GKE, then there is no need for a Kubernetes-level feature; GKE could just implement its own alternate NodeIPAM controller and use that rather than the kcm one (as several other network plugins already do). Then external components like flannel would not be tricked into misusing the API. |
SUSE rancher wanted this feature as described in the KEP, is not misusing it, a consequence of using this feature is that the objects used to configure the Ipam matches the pod CIDRs on the nodes, not the other way around. cilium also has a mode of operation that depends on the nodeIpam and will benefit of this feature, there are multiple projects that depends on the nodeIpam and benefit from this feature , is not a GKE one problem |
@danwinship I think there is a misunderstanding on what this feature targets, ALL clusters that depends on nodeIPAM, some modes of cilium, flannel, kubeadm, ... CAN NOT scale if the number of nodes is equal to the 2^"ClusterCIDR mask - node-pod-mask-size". GKE has bigger demand and customer reach this limit more often than in other places, I think that we can remove the GKE argument as we can see how there are st least 3 OSS projects that use nodeIpam and the problem is demonstrated. I don't find legit pushing back on the argument that people may use the apis for something that is not supposed to represent , people can sneak into the object and assume that is the ClusterCIDR for all the cluster, it perfectly can list all nodes and list the node.Spec.PodCIDRs too I don't disagree on the name can be misleading but I completely disagree on the feature not be useful outside of GKE |
I'm going to rename of the object to avoid people to assume this objects configure the PodCIDR of the cluster, I think we never disagree on that, but we both reviewed the original KEP back in 2021 #2594 , I don't think is legit at this point to revisit the intent of the KEP ... |
The name ClusterCIDR for the object may cause confusion on users, since will only be represent the ClusterCIDR IF the Cluster is using the specific MultiNodeIPAM controller that uses these objects.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: aojea The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@caseydavenport I can see calico has a CRD |
(FWIW, I didn't like this approach then either: #2594 (comment). I just got busy with other stuff and stopped arguing against it.) |
then what we do, do we revert the KEP? |
No, SUSE wanted "pod network scaling". That is not the feature described in the KEP. The KEP describes "extending NodeIPAM to support pod network scaling", which is part of the "pod network scaling" feature, but not all of it. SUSE wanted all of it, and so they also made changes to flannel, and those changes reuse parts of the NodeIPAM scaling feature in a way which the KEP does not anticipate and probably intends to forbid. (The KEP says that "
Yes, I understand that. My point is that most clusters, whether they use NodeIPAM or not, also require other changes in order to support pod network scaling. And this KEP makes no effort to solve those other problems. I would love for Kubernetes to support pod network scaling, but if we want it to support that, then we need to actually write a pod network scaling KEP, which will include dynamic NodeIPAM reconfiguration as one of its pieces, but will also have to include other things. |
I'm confused now, why is the other KEP blocking this KEP? |
Addressed in the last commit in the "Risk and Mitigations" section |
I don't think that addresses the problem. The "flannel config vs kcm config" point was just one example. My point is that the KEP process exists to make sure that we answer various hard questions when adding a new feature, and in the case of KEP-2593, we explicitly decided to not answer any of the hard questions about pod network scaling in the general case (despite the fact that the KEP's first Goal is "Support multiple discontiguous IP CIDR blocks for Cluster CIDR" and its first User Story is "Add more pod IPs to the cluster"). But now people are trying to use KEP-2593 to implement pod network scaling in other contexts anyway, and potentially shooting themselves (and other people) in the foot because we didn't think about any of the possible difficulties with doing this... The KEP has become an attractive nuisance. (e.g., Another problem with the flannel patch: if the admin deletes a |
|
agree, is a very bespoke solution, it does not solve the real problem and foremost, is going to be confusing for users ... I think we should also remove it from the codebase @thockin @danwinship that has a lot of overlap with the multinetwork KEP https://docs.google.com/document/d/17LhyXsEgjNQ0NWtvqvtgJwVqdJWreizsgAZHWflgP-A/edit |
During the development of the KEP 1880: Multiple Service CIDRs the intersection with KEP 2593: Enhanced NodeIPAM to support Discontiguous Cluster CIDR made the SIG Network to revisit the state of the Kubernetes networking configuration, specially the configuration of the Pod and Services networks.
This triggered several discussion and different alternatives were evaluated and discussed deeply , concluding that all possible problems of the network intersection can not be mitigated, it will be very difficult (if not impossible) at this stage of the project to do any significant changes with backwards compatibility and without disrupting the whole ecosystem that depends on current state.
During these conversations some concerns were raised by the community about this KEP-2593:
One part that may cause fiction is the name of the object used,
ClusterCIDR
, but there are several points that makes keeping the same object name preferrable as previously mentioned:The proposal is to move forward to beta this KEP in 1.29 with current state
/sig network
/assign @thockin @danwinship