-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3070: Reserve Service IP Ranges For Dynamic and Static IP Allocation #3071
KEP-3070: Reserve Service IP Ranges For Dynamic and Static IP Allocation #3071
Conversation
aojea
commented
Dec 2, 2021
•
edited
Loading
edited
- One-line PR description: KEP-3070: adding KEP
- Issue link: Reserve Service IP Ranges For Dynamic and Static IP Allocation #3070
- Other comments: Implementation KEP-3070: Reserve Service IP Ranges For Dynamic and Static IP Allocation kubernetes#106792
/assign @thockin @danwinship |
The proposal is to implement a new strategy on the Service IP allocation bitmap that avoids to use the first 32 values | ||
if there are free values on the upper block of the range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems weird. I would expect that if there was a reserved range, that it would be reserved unconditionally, not "reserved until we run out of unreserved addresses".
Also, it's weird to hardcode the reserved range to 32 addresses regardless of the size of the service CIDR...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems weird. I would expect that if there was a reserved range, that it would be reserved unconditionally, not "reserved until we run out of unreserved addresses".
this keeps the behavior backwards compatible, you can create services as long as there are free IPs, if we keep the lower range reserved only for static IPs we are changing the behavior
Also, it's weird to hardcode the reserved range to 32 addresses regardless of the size of the service CIDR...
@thockin I knew we should have a history about why 32 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh. We went back and forth on this a lot in discussing.
If your service CIDR is /27, you only get 32 service IPs (IIRC, it's 30 actually, we avoid the first and last). At that point we can debate whether it's OK to have so few and if, at that point, we can make many guarantees.
If your CIDR is bigger than that, do we need to reserve more than 32 IPs? If so, do we need a "cascade" of blocks?
e.g. something like min(32, cidrSize/2)
is attractive, but only useful if we think there are a lot of users with really small ranges.
Something more dynamic like "always prefer the upper half unless it is full" is more general, I guess. Is it worth the extra effort?
E.g. assume a /22 service range (1024 IPs). First look at [512]-[1023]. If that's full, look at [256-511]. If that's full, look at [128-255]. It seems like that gives the caller influence where it isn't needed, which could be a vector for some sort of attack.
min(max(16, cidrSize/16), 256)
? At /28 it would be 16 (full range). At /27 - /24 it would be 16 (part of the range). At /23 it would be 32. /22 -> 64. /21 ->128. /20 and bigger -> 256.
Again, worth the effort?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about making it a fixed percentage? we subdivide in two ranges,
┌─────────────┬─────────────────────────────────────────────┐
│ │ │
│ │ │
│ │ │
│ static │ dynamic │
│ │ │
│ │ │
│ │ │
└─────────────┴─────────────────────────────────────────────┘
◄────────────► ◄────────────────────────────────────────────►
15% 85%
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"15%" must be a typo, since it is not a power of two. I think you meant to type "12.5%" :)
Seriously though, I think percents are OK for humans to get a big-picture, but harder to be precise about.
min(max($min, cidrSize/$step), $max)
is easy to describe as "never less than $min or more than $max, with a graduated step function between them. We can argue about values for those variables (or even tweak them over time if we get feedback), but the logic still seems appropriate?
I proposed:
$min = 16
$max = 256
$step = 16
But I am fine to haggle :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm fine with anything meanwhile users can understand it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated with the suggested formula
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FTR I meant that 32 seemed like too many reserved IPs for small service CIDRs, not that I thought we should reserve more for large CIDRs.
But I guess it's reasonable to assume that users with larger clusters are likely to have more use cases for fixed-IP services, so I guess giving them a larger range makes sense. And if we want the reserved size to (a) always be a power of 2, and (b) always include the .10 address, then I guess 16 has to be the minimum value. And given that the "reserved" range is more like a "preferred" range, then even 16 isn't too large for a /27
The proposal is to implement a new strategy on the Service IP allocation bitmap that avoids to use the first 32 values | ||
if there are free values on the upper block of the range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Heh. We went back and forth on this a lot in discussing.
If your service CIDR is /27, you only get 32 service IPs (IIRC, it's 30 actually, we avoid the first and last). At that point we can debate whether it's OK to have so few and if, at that point, we can make many guarantees.
If your CIDR is bigger than that, do we need to reserve more than 32 IPs? If so, do we need a "cascade" of blocks?
e.g. something like min(32, cidrSize/2)
is attractive, but only useful if we think there are a lot of users with really small ranges.
Something more dynamic like "always prefer the upper half unless it is full" is more general, I guess. Is it worth the extra effort?
E.g. assume a /22 service range (1024 IPs). First look at [512]-[1023]. If that's full, look at [256-511]. If that's full, look at [128-255]. It seems like that gives the caller influence where it isn't needed, which could be a vector for some sort of attack.
min(max(16, cidrSize/16), 256)
? At /28 it would be 16 (full range). At /27 - /24 it would be 16 (part of the range). At /23 it would be 32. /22 -> 64. /21 ->128. /20 and bigger -> 256.
Again, worth the effort?
48bffff
to
6debbd4
Compare
After discussion a few hours ago - should we cover node ports in the same way? |
implementation wise is simpler if we want to use the same reserved ranges sizes for each |
however, the maths we are doing here doesn't apply, we are always assuming binary arithmetic for the range sizes, not a big deal if we assume this will always remain as is. |
9009c87
to
4c5ea45
Compare
OK. In the future we can do a KEP for NodePorts, if needed. I bet the same formula applies, just with different coefficients. This LGTM - @danwinship do you want a word on it? I'll approve but hold for now. /approve |
oh, needs PRR anyway. I did not read for that. Will do |
|
||
#### GA | ||
|
||
- 2 examples of real-world usage |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this mean? That 2 providers have switched to this mode as the default? That's unlikely while it is beta. Let's get specific here. I think time and lack of trouble is the only real signal.
When do we make this mode the default? Is there a reason not to do that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is interest in enabling it in Openshift by default to solve some races at installation time, I was expecting more installers will face similar issue, but as you say this is just more a matter of time and signal.
|
||
###### What specific metrics should inform a rollback? | ||
|
||
The allocation logic already has the following metrics: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want new metrics to indicate "forced to use reserved range" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to add new metrics but its complicated, since this behavior is deep in the bitmap allocation and the metrics are one layer above on the ip allocator .... inheritance problems 🤷
I'll add a comment explaining it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tried to add new metrics but its complicated, since this behavior is deep in the bitmap allocation and the metrics are one layer above on the ip allocator .... inheritance problems
Hmm - can't we report them somewhat post-factum (when we're returning the actual result we already know what IP we allocated and if it was statically requested or not)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean something like randomly_allocated_ips
gauge and randomly_allocation_total
counter ?
that is easy to do, will that work for you?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you mean something like randomly_allocated_ips gauge and randomly_allocation_total counter ?
that is easy to do, will that work for you?
Can you clarify? I think these can be interpreted differently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have these metrics now
- allocated_ips
- available_ips
- allocation_total
- allocation_errors_total
we can add two new ones
randomly_allocated_ips
randomly_allocation_total
randomly_allocated_ips
reports the number of IPs in the allocator that are allocated "randomly" i.e. with ClusterIP empty, it reports current value.
randomly_allocation_total
is a counter, every time that you allocate a ClusterIP randomly it increases, the user can play with the current metrics and these new ones to obtain the IPs allocated statically
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifiction.
Yes - those metric will be very useful.
What I had on my mind (please let me know if this doesn't make any sense because of some reason) is actually an extension of that by:
- adding a label to that metric (like "from-static" or sth) to show if those random IPs were allocated from static range
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, that is much simpler and easier, I will update the kep with that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm except for the "reserved block" comment
I'd like to avoid using "Reserved" in the KEP title since neither range is actually reserved for a single use, but I can't think of any way to describe the behavior well in few enough words...
|
||
There are situation that users need to rely on well-known predefined IP addresses. | ||
|
||
The best example is the Kubernetes DNS, the IP is hardcoded in the kubelet flag `--cluster-dns`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Well, no, the best example is the apiserver...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it possible to race with the apiserver .1 address?
The loop to create the kubernetes.default
runs at the beginning, I honestly don't know if at that point clients are able to create services, but is worth adding it
// forward looking for the next available address. The big.Int range is subdivided so it will try | ||
// to allocate first from the reserved block of addresses (it will wrap the reserved subrange if necessary). | ||
// If there is no free address it will try to allocate one from the reserved block too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"to allocate first from the unreserved block", right? (You're referring to both ranges as "the reserved block" here.)
and "(it will wrap around back to the start of the unreserved subrange if necessary)"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will word it as reserved for static and reserved for dynamic allocation, it sounds less ambiguous
/hold cancel comments addressed, ready for final review |
- allocated_ips | ||
- available_ips |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wojtek-t these 2 metrics can not track the information of the type of allocation, because when an ip is released we don't have that information.
however, the counters can have it, because are only incremented
- allocation_total
- allocation_errors_total
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair - that sounds good enough to me.
Can you clarify that here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple nits, but I'm overall fine with it. Will approve once addressed.
- allocated_ips | ||
- available_ips |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair - that sounds good enough to me.
Can you clarify that here?
@aojea - can you please squash the commits? |
ea0e123
to
fa2606a
Compare
This is more than enough for Alpha from PRR perspective - thanks! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, thockin, wojtek-t The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@aojea @thockin - requires adding to the tracking spreadsheet: |