KEP-3070: Reserve Service IP Ranges For Dynamic and Static IP Allocation #3071

aojea · 2021-12-02T18:27:45Z

One-line PR description: KEP-3070: adding KEP

Issue link: Reserve Service IP Ranges For Dynamic and Static IP Allocation #3070

Other comments: Implementation KEP-3070: Reserve Service IP Ranges For Dynamic and Static IP Allocation kubernetes#106792

aojea · 2021-12-02T18:28:03Z

/assign @thockin @danwinship
/sig network

keps/sig-network/3070-reserved-service-ip-range/README.md

danwinship · 2021-12-09T14:14:02Z

keps/sig-network/3070-reserved-service-ip-range/README.md

+The proposal is to implement a new strategy on the Service IP allocation bitmap that avoids to use the first 32 values
+if there are free values on the upper block of the range.


That seems weird. I would expect that if there was a reserved range, that it would be reserved unconditionally, not "reserved until we run out of unreserved addresses".

Also, it's weird to hardcode the reserved range to 32 addresses regardless of the size of the service CIDR...

That seems weird. I would expect that if there was a reserved range, that it would be reserved unconditionally, not "reserved until we run out of unreserved addresses".

this keeps the behavior backwards compatible, you can create services as long as there are free IPs, if we keep the lower range reserved only for static IPs we are changing the behavior

Also, it's weird to hardcode the reserved range to 32 addresses regardless of the size of the service CIDR...

@thockin I knew we should have a history about why 32 😄

Heh. We went back and forth on this a lot in discussing.

If your service CIDR is /27, you only get 32 service IPs (IIRC, it's 30 actually, we avoid the first and last). At that point we can debate whether it's OK to have so few and if, at that point, we can make many guarantees.

If your CIDR is bigger than that, do we need to reserve more than 32 IPs? If so, do we need a "cascade" of blocks?

e.g. something like min(32, cidrSize/2) is attractive, but only useful if we think there are a lot of users with really small ranges.

Something more dynamic like "always prefer the upper half unless it is full" is more general, I guess. Is it worth the extra effort?

E.g. assume a /22 service range (1024 IPs). First look at [512]-[1023]. If that's full, look at [256-511]. If that's full, look at [128-255]. It seems like that gives the caller influence where it isn't needed, which could be a vector for some sort of attack.

min(max(16, cidrSize/16), 256) ? At /28 it would be 16 (full range). At /27 - /24 it would be 16 (part of the range). At /23 it would be 32. /22 -> 64. /21 ->128. /20 and bigger -> 256.

Again, worth the effort?

what about making it a fixed percentage? we subdivide in two ranges,

┌─────────────┬─────────────────────────────────────────────┐ │ │ │ │ │ │ │ │ │ │ static │ dynamic │ │ │ │ │ │ │ │ │ │ └─────────────┴─────────────────────────────────────────────┘ ◄────────────► ◄────────────────────────────────────────────► 15% 85%

"15%" must be a typo, since it is not a power of two. I think you meant to type "12.5%" :)

Seriously though, I think percents are OK for humans to get a big-picture, but harder to be precise about.

min(max($min, cidrSize/$step), $max) is easy to describe as "never less than $min or more than $max, with a graduated step function between them. We can argue about values for those variables (or even tweak them over time if we get feedback), but the logic still seems appropriate?

I proposed:

$min = 16
$max = 256
$step = 16

But I am fine to haggle :)

I'm fine with anything meanwhile users can understand it

updated with the suggested formula

FTR I meant that 32 seemed like too many reserved IPs for small service CIDRs, not that I thought we should reserve more for large CIDRs.

But I guess it's reasonable to assume that users with larger clusters are likely to have more use cases for fixed-IP services, so I guess giving them a larger range makes sense. And if we want the reserved size to (a) always be a power of 2, and (b) always include the .10 address, then I guess 16 has to be the minimum value. And given that the "reserved" range is more like a "preferred" range, then even 16 isn't too large for a /27

keps/sig-network/3070-reserved-service-ip-range/README.md

thockin · 2021-12-17T23:19:54Z

keps/sig-network/3070-reserved-service-ip-range/README.md

+The proposal is to implement a new strategy on the Service IP allocation bitmap that avoids to use the first 32 values
+if there are free values on the upper block of the range.


Heh. We went back and forth on this a lot in discussing.

If your service CIDR is /27, you only get 32 service IPs (IIRC, it's 30 actually, we avoid the first and last). At that point we can debate whether it's OK to have so few and if, at that point, we can make many guarantees.

If your CIDR is bigger than that, do we need to reserve more than 32 IPs? If so, do we need a "cascade" of blocks?

e.g. something like min(32, cidrSize/2) is attractive, but only useful if we think there are a lot of users with really small ranges.

Something more dynamic like "always prefer the upper half unless it is full" is more general, I guess. Is it worth the extra effort?

E.g. assume a /22 service range (1024 IPs). First look at [512]-[1023]. If that's full, look at [256-511]. If that's full, look at [128-255]. It seems like that gives the caller influence where it isn't needed, which could be a vector for some sort of attack.

min(max(16, cidrSize/16), 256) ? At /28 it would be 16 (full range). At /27 - /24 it would be 16 (part of the range). At /23 it would be 32. /22 -> 64. /21 ->128. /20 and bigger -> 256.

Again, worth the effort?

thockin · 2021-12-21T20:28:13Z

After discussion a few hours ago - should we cover node ports in the same way?

aojea · 2021-12-21T21:35:58Z

After discussion a few hours ago - should we cover node ports in the same way?

implementation wise is simpler if we want to use the same reserved ranges sizes for each

aojea · 2021-12-22T09:00:06Z

however, the maths we are doing here doesn't apply, we are always assuming binary arithmetic for the range sizes, not a big deal if we assume this will always remain as is.
Now that I write it, I have the feeling that for the future it is better if we decouple NodePorts from ClusterIPs and break this dependency now that we can.

thockin · 2021-12-22T16:54:05Z

OK. In the future we can do a KEP for NodePorts, if needed. I bet the same formula applies, just with different coefficients.

This LGTM - @danwinship do you want a word on it? I'll approve but hold for now.

/approve
/lgtm
/hold

thockin · 2021-12-22T16:54:48Z

oh, needs PRR anyway. I did not read for that. Will do

thockin · 2021-12-22T18:07:58Z

keps/sig-network/3070-reserved-service-ip-range/README.md

+
+#### GA
+
+- 2 examples of real-world usage


What does this mean? That 2 providers have switched to this mode as the default? That's unlikely while it is beta. Let's get specific here. I think time and lack of trouble is the only real signal.

When do we make this mode the default? Is there a reason not to do that?

It is interest in enabling it in Openshift by default to solve some races at installation time, I was expecting more installers will face similar issue, but as you say this is just more a matter of time and signal.

keps/sig-network/3070-reserved-service-ip-range/README.md

thockin · 2021-12-22T18:10:44Z

keps/sig-network/3070-reserved-service-ip-range/README.md

+
+###### What specific metrics should inform a rollback?
+
+The allocation logic already has the following metrics:


Do we want new metrics to indicate "forced to use reserved range" ?

I've tried to add new metrics but its complicated, since this behavior is deep in the bitmap allocation and the metrics are one layer above on the ip allocator .... inheritance problems 🤷

I'll add a comment explaining it

I've tried to add new metrics but its complicated, since this behavior is deep in the bitmap allocation and the metrics are one layer above on the ip allocator .... inheritance problems

Hmm - can't we report them somewhat post-factum (when we're returning the actual result we already know what IP we allocated and if it was statically requested or not)?

do you mean something like randomly_allocated_ips gauge and randomly_allocation_total counter ?
that is easy to do, will that work for you?

do you mean something like randomly_allocated_ips gauge and randomly_allocation_total counter ?
that is easy to do, will that work for you?

Can you clarify? I think these can be interpreted differently.

we have these metrics now

- allocated_ips - available_ips - allocation_total - allocation_errors_total

we can add two new ones

randomly_allocated_ips randomly_allocation_total

randomly_allocated_ips reports the number of IPs in the allocator that are allocated "randomly" i.e. with ClusterIP empty, it reports current value.
randomly_allocation_total is a counter, every time that you allocate a ClusterIP randomly it increases, the user can play with the current metrics and these new ones to obtain the IPs allocated statically

Thanks for clarifiction.

Yes - those metric will be very useful.

What I had on my mind (please let me know if this doesn't make any sense because of some reason) is actually an extension of that by:

adding a label to that metric (like "from-static" or sth) to show if those random IPs were allocated from static range

Does that make sense?

yeah, that is much simpler and easier, I will update the kep with that

keps/sig-network/3070-reserved-service-ip-range/README.md

keps/sig-network/3070-reserved-service-ip-range/kep.yaml

thockin

/lgtm

danwinship

lgtm except for the "reserved block" comment

I'd like to avoid using "Reserved" in the KEP title since neither range is actually reserved for a single use, but I can't think of any way to describe the behavior well in few enough words...

danwinship · 2022-01-05T17:28:10Z

keps/sig-network/3070-reserved-service-ip-range/README.md

+
+There are situation that users need to rely on well-known predefined IP addresses.
+
+The best example is the Kubernetes DNS, the IP is hardcoded in the kubelet flag `--cluster-dns`.


(Well, no, the best example is the apiserver...)

is it possible to race with the apiserver .1 address?
The loop to create the kubernetes.default runs at the beginning, I honestly don't know if at that point clients are able to create services, but is worth adding it

danwinship · 2022-01-05T17:30:12Z

keps/sig-network/3070-reserved-service-ip-range/README.md

+// forward looking for the next available address. The big.Int range is subdivided so it will try
+// to allocate first from the reserved block of addresses (it will wrap the reserved subrange if necessary).
+// If there is no free address it will try to allocate one from the reserved block too.


"to allocate first from the unreserved block", right? (You're referring to both ranges as "the reserved block" here.)

and "(it will wrap around back to the start of the unreserved subrange if necessary)"

I will word it as reserved for static and reserved for dynamic allocation, it sounds less ambiguous

keps/sig-network/3070-reserved-service-ip-range/README.md

aojea · 2022-01-09T12:38:41Z

/hold cancel

comments addressed, ready for final review

aojea · 2022-01-09T16:51:46Z

keps/sig-network/3070-reserved-service-ip-range/README.md

+  - allocated_ips
+  - available_ips


@wojtek-t these 2 metrics can not track the information of the type of allocation, because when an ip is released we don't have that information.
however, the counters can have it, because are only incremented

- allocation_total - allocation_errors_total

Fair - that sounds good enough to me.

Can you clarify that here?

keps/sig-network/3070-reserved-service-ip-range/README.md

wojtek-t

Couple nits, but I'm overall fine with it. Will approve once addressed.

keps/sig-network/3070-reserved-service-ip-range/README.md

wojtek-t · 2022-01-10T09:19:37Z

keps/sig-network/3070-reserved-service-ip-range/README.md

+  - allocated_ips
+  - available_ips


Fair - that sounds good enough to me.

Can you clarify that here?

keps/sig-network/3070-reserved-service-ip-range/README.md

keps/sig-network/3070-reserved-service-ip-range/kep.yaml

keps/sig-network/3070-reserved-service-ip-range/README.md

wojtek-t · 2022-01-10T10:11:33Z

@aojea - can you please squash the commits?

keps/sig-network/3070-reserved-service-ip-range/README.md

wojtek-t · 2022-01-10T12:05:41Z

This is more than enough for Alpha from PRR perspective - thanks!

/lgtm
/approve

k8s-ci-robot · 2022-01-10T12:06:06Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: aojea, thockin, wojtek-t

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~keps/prod-readiness/OWNERS~~ [wojtek-t]
~~keps/sig-network/OWNERS~~ [thockin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wojtek-t · 2022-01-26T19:25:10Z

@aojea @thockin - requires adding to the tracking spreadsheet:
https://docs.google.com/spreadsheets/d/1T21mUTvPh70NB2eseHjCyD4LgRjyxWI9Bd1SoP8zAwA/edit#gid=0

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 2, 2021

k8s-ci-robot assigned danwinship and thockin Dec 2, 2021

k8s-ci-robot added the sig/network Categorizes an issue or PR as relevant to SIG Network. label Dec 2, 2021

k8s-ci-robot requested review from caseydavenport and thockin December 2, 2021 18:28

k8s-ci-robot added the kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory label Dec 2, 2021

aojea mentioned this pull request Dec 8, 2021

UPSTREAM: <carry>: ip allocator with reserved addresses openshift/kubernetes#921

Closed

danwinship reviewed Dec 9, 2021

View reviewed changes

jayunit100 reviewed Dec 9, 2021

View reviewed changes

keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved

aojea commented Dec 9, 2021

View reviewed changes

keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved

thockin reviewed Dec 17, 2021

View reviewed changes

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Dec 21, 2021

aojea force-pushed the reserved_serviceip_allocation branch from 48bffff to 6debbd4 Compare December 21, 2021 12:51

aojea force-pushed the reserved_serviceip_allocation branch 2 times, most recently from 9009c87 to 4c5ea45 Compare December 22, 2021 09:24

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Dec 22, 2021

thockin reviewed Dec 22, 2021

View reviewed changes

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Dec 23, 2021

aojea commented Dec 23, 2021

View reviewed changes

keps/sig-network/3070-reserved-service-ip-range/kep.yaml Show resolved Hide resolved

thockin reviewed Dec 23, 2021

View reviewed changes

wojtek-t self-assigned this Jan 4, 2022

danwinship reviewed Jan 5, 2022

View reviewed changes

thockin added this to the v1.24 milestone Jan 6, 2022

aojea changed the title ~~KEP-3070: Reserve Service IP Range For Dynamic and Static IP Allocation~~ KEP-3070: Reserve Service IP Ranges For Dynamic and Static IP Allocation Jan 7, 2022

k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jan 7, 2022

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 9, 2022

aojea commented Jan 9, 2022

View reviewed changes

keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved

wojtek-t reviewed Jan 10, 2022

View reviewed changes

aojea commented Jan 10, 2022

View reviewed changes

keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved

aojea commented Jan 10, 2022

View reviewed changes

keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved

aojea commented Jan 10, 2022

View reviewed changes

keps/sig-network/3070-reserved-service-ip-range/kep.yaml Show resolved Hide resolved

aojea commented Jan 10, 2022

View reviewed changes

keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved

aojea commented Jan 10, 2022

View reviewed changes

keps/sig-network/3070-reserved-service-ip-range/README.md Outdated Show resolved Hide resolved

Reserve Service IP Range For Dynamic and Static IP Allocation

fa2606a

aojea force-pushed the reserved_serviceip_allocation branch from ea0e123 to fa2606a Compare January 10, 2022 11:55

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 10, 2022

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 10, 2022

k8s-ci-robot merged commit 6b90035 into kubernetes:master Jan 10, 2022

		The proposal is to implement a new strategy on the Service IP allocation bitmap that avoids to use the first 32 values
		if there are free values on the upper block of the range.


		###### What specific metrics should inform a rollback?

		The allocation logic already has the following metrics:


		There are situation that users need to rely on well-known predefined IP addresses.

		The best example is the Kubernetes DNS, the IP is hardcoded in the kubelet flag `--cluster-dns`.

KEP-3070: Reserve Service IP Ranges For Dynamic and Static IP Allocation #3071

KEP-3070: Reserve Service IP Ranges For Dynamic and Static IP Allocation #3071

Conversation

aojea commented Dec 2, 2021 • edited Loading

aojea commented Dec 2, 2021

Choose a reason for hiding this comment

aojea Dec 9, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin commented Dec 21, 2021

aojea commented Dec 21, 2021

aojea commented Dec 22, 2021

thockin commented Dec 22, 2021

thockin commented Dec 22, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin left a comment

Choose a reason for hiding this comment

danwinship left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

aojea commented Jan 9, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wojtek-t commented Jan 10, 2022

wojtek-t commented Jan 10, 2022

k8s-ci-robot commented Jan 10, 2022

wojtek-t commented Jan 26, 2022

aojea commented Dec 2, 2021 •

edited

Loading

aojea Dec 9, 2021 •

edited

Loading