-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP-3066: subsecond probes #3067
base: master
Are you sure you want to change the base?
Conversation
Skipping CI for Draft Pull Request. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some late thoughts on the discussion / early feedback on the draft.
Allow for 0 second Probe, with setting millisecond. | ||
|
||
Cons: | ||
Loosens validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds similar to tv_sec
and tv_nsec
from POSIX time structs. If so, we could mention that analogy.
<!-- | ||
What other approaches did you consider, and why did you rule them out? These do | ||
not need to be as detailed as the proposal, but should include enough | ||
information to express the idea and why it was not acceptable. | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should mention microseconds and nanoseconds too (as these are plausible alternatives to milliseconds).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing microseconds might be useful if we support future healthcheck mechanisms.
### v2 api for probe. | ||
|
||
I think this means a v2 API for Container therefore Pod. | ||
This seems too invasive. | ||
|
||
All Seconds fields become resource.Quantity instead of int32. This supports subdivision in a single field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bigger problem is how to support round-tripping to the (still GA) v1 API. Even if we did a v2 API, we'd need a way to represent the subsecond timeout in the v1 API, right?
(I'm personally all for adopting a v2 Pod and v2 PodTemplate API, but lots of clients would expect to still create v1 Deployments, etc).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod
|
||
V2 API for existing objects. | ||
Converting fields from int32 to resource.Quantity. | ||
Subsecond resolution less than one millisecond. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd omit this non-goal; it sounds more like a design choice than a goal we're not pursuing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk
If the Milliseconds variant of a field is set, | ||
*use it in preference* to the existing Seconds field, | ||
and **completely ignore** the value of the existing field. | ||
This behavior makes it opt-in on setting a non-zero Milliseconds field. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is an existing admission control mechanism that enforces constraints on timeoutSeconds
, does this (draft) change break any API contract that the admission control mechanism is entitled to assume?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ehashman recommended in the sig node call that we forgo the alternate fields idea and move to adding a new flag to indicate seconds should be processed as milliseconds.. etc.
great point on we will need to check for admission control impacts..
Overall, I think my concerns about these API changes are around changes we might want to make a few years in the future. If we specify probe timeouts in microseconds, that allows for two things:
If we ever have to support second-granularity, millisecond-granularity and microsecond-granularity probes, I hope the API for that won't look as ugly as I fear it might. The approach I'd still promote is to have an extra field to specify “early failure”: an integer quantity of milliseconds (or microseconds - we should leave room for that). That early failure offset is something that legacy clients can ignore without a big problem and would fully support round-tripping to a future Pod v2 API that unifies these fields. Using a negative offset makes the distinction between (1s - 995ms) more obvious to a legacy client that's unaware of the new fields. A positive offset (0s + 5ms, or maybe 0s + 5000μs) could get interpreted quite differently. If we do that, we should absolutely disallow a negative offset ≥ 1.0s seconds. Otherwise we have a challenge around unambiguous round-tripping (eg from a future Pod v2 API back to Pod v1). |
For |
New Field: int32 PeriodOffsetMilliseconds | ||
|
||
If I want to set to 0.5 seconds, 500 milliseconds, | ||
PeriodSeconds <= 10 & PeriodOffsetmilliseconds <= -9500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an example of a pair of values that could be ambiguous for round-tripping, forcing future APIs to store and track both pieces of data.
By contrast, if we had:
....
periodSeconds: 1
earlyRecheckOffset:
- unit: Millisecond
value: 500
...
and the v1 Pod API doesn't allow ambiguous values, then a future v2 Pod API can store a single field value behind the scenes
(eg: periodMicroseconds = 500000u64
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another option, still clamped to less than 1s:
...
periodSeconds: 0
extraRecheckOffset:
- unit: Millisecond
value: 500
...
|
||
All Seconds fields become resource.Quantity instead of int32. This supports subdivision in a single field. | ||
|
||
### OffsetMilliseconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikebrow can remove this section from the alternatives since it's the main proposal now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod let's focus in on the offset milliseconds option since and fix the naming above..
Allow for 0 second Probe, with setting millisecond. Similar to what is done with `tv_sec` and `tv_nsec` from POSIX time structs. | ||
|
||
Cons: | ||
Loosens validation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this proposal would just add a new field for each existing *Seconds field, like PeriodSeconds
and PeriodMilliseconds
and then the actual time is the sum? This seems like a way more intuitive and understandable API to me. It's also simpler to enter - just one new field instead of a new struct. And follows the existing *Seconds pattern. I am not seeing the downside? Validation can be done on the total duration as well as the individual fields, can't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Clients that don't recognise the new field would be liable to trigger much too early, right? How do we avoid that?
Also, how do we avoid setting a precedent that this riskier approach is the one to take.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's true. The lack of versioning at this granularity is a problem. Of course the "offset" has a similar problem (inaccurate interval from old clients). Here's another (maybe bad) idea:
on write:
- if *Milliseconds is set, then set *Seconds = ceil(*Milliseconds/1000)
- otherwise, set *Milliseconds = *Seconds * 1000
Then, on read, both will be set, and old clients will have a slightly longer interval.
|
||
Pros: | ||
Cons: | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another alternative is to use "Duration", like PeriodDuration: 1.5s
or PeriodDuration: 1500ms
or PeriodDuration: 1s500ms
. But AFAIK we don't do that anywhere else, it would be a new type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The existing API guidelines say
Duration fields must be represented as integer fields with units being part of the field name
so if for some of those options, if we adopt them I imagine that we also need to change the API guidelines.
```yaml | ||
... | ||
periodSeconds: 1 | ||
periodSecondsOffset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find this very unintuitive and verbose as well.
Subsecond Probe OptionsThanks @sftim and @johnbelamaric for your reviews! Trying to summarize the options in one place (since it's a bit messy in the KEP at the moment), here's four options that have been discussed for configuring a probe to run on a subsecond interval: (edit: given that this discussion is still ongoing, probably better to use hackmd rather than having to continue to edit a comment in a PR...) |
(I've got the above in a hackmd ... if it's easier to edit there vs. doing this in issue comments, let me know and I'll add you both to it) |
@psschwei I suggested using I saw my proposal as distinct because it did fully avoid the inability to round-trip with a combined field. |
You're right, let me add it to the hackmd now. Sorry for the oversight; that's why I thought it would be good to put all the options in a single place... |
This KEP was read and discussed at this month's session of the KEP reading club, commenting here for discoverability purposes. There is a slack thread on |
9b71650
to
513660c
Compare
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
@mikebrow: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
/remove-lifecycle rotten |
@mikebrow Please sign the CLA and fix failing CI tests, thanks. |
/cc |
Is this KEP still alive? |
Hi. --- a/keps/sig-node/3066-subsecond-probes/README.md
+++ b/keps/sig-node/3066-subsecond-probes/README.md
@@ -48,10 +48,10 @@ Probe timeouts are limited to seconds and that does NOT work well for clients lo
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
- - [<code>early*Offset</code>](#)
- - [<code>*Duration</code>](#-1)
- - [A new struct, <code>ProbeOffset</code>](#a-new-struct-)
- - [Setting the time units in a different field, <code>ReadSecondsAs</code>](#setting-the-time-units-in-a-different-field-)
+ - [<code>early*Offset</code>](#earlyoffset)
+ - [<code>*Duration</code>](#duration)
+ - [A new struct, <code>ProbeOffset</code>](#a-new-struct-probeoffset)
+ - [Setting the time units in a different field, <code>ReadSecondsAs</code>](#setting-the-time-units-in-a-different-field-readsecondsas)
- [v2 api for probe.](#v2-api-for-probe)
- [OffsetMilliseconds](#offsetmilliseconds)
- [Reconcile seconds field to nearest whole second.](#reconcile-seconds-field-to-nearest-whole-second) |
was paused for 1.31 will bring it back to life for 1.32! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have a feeling this may end up in the same place that fast-restart are approaching -- some node-level "budget" for probes, with an HTB to provide some "fairness", such that the probe's period is a request which will be satisfied when possible, but may be delayed when the node is under contention. E.g. if every pod asks for sub-second probes, we could end up doing hundreds of QPS of traffic just on probing.
|
||
There are a few corner cases around default (effective period) values: | ||
|
||
`periodSeconds: 0` and `periodMilliseconds: 500` would be 10.5s / 10500ms (as 0 => 10 for `periodSeconds` via [defaults.go](https://github.com/kubernetes/kubernetes/blob/3b13e9445a3bf86c94781c898f224e6690399178/pkg/apis/core/v1/defaults.go#L213-L215)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should fail in that case, no? Sorry, you can't use MS unless you also specify S. Any existing pod will still be OK, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod will update.. should fail, if one wants mills offset must say from any int except 0, because...
|
||
`periodSeconds: 2` and `periodMilliseconds: -500` would be 1.5s / 1500ms. | ||
|
||
`periodSeconds: 1` and `periodMilliseconds: -500` would be 0.5s / 500ms. (*** To reduce un-necessary resource usage, because periodMilliseconds is used to reduce the period to less than a second the offset is only used until success is reached then no longer becoming 1second after success.***) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we draw out the state-machine for probes and consider how this changes that, specifically in the context of startup, readiness, and liveness actually being different?
In the past we have added fields to probes that don't apply to all modes and it's gross.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related, how does this interact with SuccessThreshold?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kep does not intend to modify success/failure threshold counts / process.
I think a budget makes sense... we had proposed a type of throttling as a mitigation, but should be able to change that to a node-level budget |
I think it makes sense to have both a sane lower-bound (which can depend on probe type) and an overall budget. I think that if we can describe a "budget" semantic that balances requirements, we can use that same pattern in other places. I think requirements include:
@lauralorenz we had a very similar discussion wrt restarts. |
Hi @psschwei @mikebrow and any other authors -- yes as @thockin mentions, I'm trending towards needing a budget and something that can admit a restart request that is determined in budget for CrashLoopBackOff (#4603) longer term. The restart I'm trying to budget for would be queuing against kubelet directly as the operation is done inline of the kubelet runtime's SyncPod loop. Yours is a little different in that the restart of interest is queueing into to the probe worker's runtime, or perhaps earlier as to whether to allocate a probe worker in the first place (?). Either way it's still taking up resources from the kubelet component so I think we would use somewhat the same kubelet stats to hand out budget and should work together to define it! Let's hang |
@@ -0,0 +1,3 @@ | |||
kep-number: 3066 | |||
alpha: | |||
approver: "@wojtek-t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mikebrow when updating please switch this over to me.
/assign |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this feature have a designated API reviewer? I feel like there are still too many open questions around the API, and unfortunately don't see this making the v1.32 deadline.
TLDR: | ||
Add two new optional fields (of type `*int32`) to the Probe struct, which would be used to offset the second based time values in the Probe struct: | ||
|
||
- `PeriodMilliseconds` // How often (in milliseconds) to offset PeriodSeconds when performing the probe (*** to reduce un-necessary resource usage, when periodMilliseconds is used to reduce the period to less than a second the offset is only used until success is reached then no longer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to reduce un-necessary resource usage, when periodMilliseconds is used to reduce the period to less than a second the offset is only used until success is reached then no longer
How does this work in practice? Use a negative offset to get a sub-second initial readiness, and then drop the offset after that? If I wanted .5s initial, and 5s steady state probes, would I say periodSeconds: 5
and periodMilliseconds: -4500
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initialDelaySeconds 0 initialDelayMilliseconds 500
periodSeconds 5
first probe should be at 500mills .. then wait 5sec for second probe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought this was describing a period of rapid polling until the probe succeeds? Initial delay is something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure we are in sync yet.. maybe a call?
The idea here is the initial delay and period, being defined as whole seconds, is too coarse. Without creating a v2 api for probes and to keep probes backwards compatible.. this proposal is to introduce fine grained offsets for the these course fields. E.g. where it it known the initial delay needs to be at least 1.5 seconds the admin/developer must choose 2seconds today.. and with this feature may define 2seconds minus 500mills resulting in 1.5seconds with the gate on and 2seconds with the gate off..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WRT rapid polling before success.. a limit must be set on the period to avoid the probe DOSing the node. This would apply to the budget discussion, as opposed to the overtly simple measure of limiting arguably expensive exec probes to up to twice as fast as they are today until success.. and cheaper probe types to 200mils (or up to 5x faster.) Similarly noting that it's not necessarily "faster" the author may simply prefer a value between two seconds instead of a first whole second or the first +1second.
TLDR: | ||
Add two new optional fields (of type `*int32`) to the Probe struct, which would be used to offset the second based time values in the Probe struct: | ||
|
||
- `PeriodMilliseconds` // How often (in milliseconds) to offset PeriodSeconds when performing the probe (*** to reduce un-necessary resource usage, when periodMilliseconds is used to reduce the period to less than a second the offset is only used until success is reached then no longer). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the term offset here is a little confusing. I would just stick with the explanation that the two periods will be summed, and drop the term offset. If the two periods are being summed, then the milliseconds aren't any more of an offset than periodSeconds, right?
|
||
`periodSeconds: 2` and `periodMilliseconds: -500` would be 1.5s / 1500ms. | ||
|
||
`periodSeconds: 1` and `periodMilliseconds: -500` would be 0.5s / 500ms. (*** To reduce un-necessary resource usage, because periodMilliseconds is used to reduce the period to less than a second the offset is only used until success is reached then no longer becoming 1second after success.***) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO the sub-second rule is not explicit enough, and we should add a new field for "initialPeriodMilliseconds". Why can't I have an initial period of 5s and steady state of 30s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
initialDelaySeconds 5
periodSeconds 30
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggest modifying the datastructure comments to reflect the documentation and implementation better.
|
||
`periodSeconds: 2` and `periodMilliseconds: -500` would be 1.5s / 1500ms. | ||
|
||
`periodSeconds: 1` and `periodMilliseconds: -500` would be 0.5s / 500ms. (*** To reduce un-necessary resource usage, because periodMilliseconds is used to reduce the period to less than a second the offset is only used until success is reached then no longer becoming 1second after success.***) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related, how does this interact with SuccessThreshold?
`periodSeconds: 0` and `periodMilliseconds: 500` would be 10.5s / 10500ms (as 0 => 10 for `periodSeconds` via [defaults.go](https://github.com/kubernetes/kubernetes/blob/3b13e9445a3bf86c94781c898f224e6690399178/pkg/apis/core/v1/defaults.go#L213-L215)) | ||
|
||
Each optional millisecond field will be restricted to [-999,999], and the effective sum will never be allowed | ||
to be less than 0 (if the effective sum is less than 0, it will fail at the validation stage and block deployments). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really want to allow 1ms probes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no. a min will be selected prob .5sec... see line 178 below
```go | ||
PeriodMilliseconds *int32 //note these are signed offset | ||
InitialDelayMilliseconds *int32 | ||
TimeoutMilliseconds *int32 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought timeout was out of scope for the alpha?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod.. this was left over.. to describe how it could be implemented... in a follow up.
} | ||
``` | ||
|
||
#### Validation of fields |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a lot more detail. Somewhere above you said that the millis fields are capped in +/-999, but I think that also conflicts with the approach to the initial fast polling. Also the validation based on probe types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nod +- 999 && with the additional bottom limit based on probe type
I appreciate the possible alternate implementation pattern you were thinking about, would like to discuss that in the context of gate off and/or backwards compatible nodes ignoring the mills offsets.
// Do it only if the kubelet has started recently. | ||
if probeTickerPeriod > time.Since(w.probeManager.start) { | ||
time.Sleep(time.Duration(rand.Float64() * float64(probeTickerPeriod))) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which period does this use for jitter? The fast initial one, or the steady state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
period is after initial.. thus jitter is being applied to steady state..
To follow up to @tallclair - I am happy to think about this KEP, but it's exceedingly unlikely that I can do it this cycle. |
Propose we put together a temporary WG to cover the proposed probe keps. |
Signed-off-by: Mike Brown brownwm@us.ibm.com