-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[KEP-3022] Write the production readiness requirements to graduate to beta #3338
Conversation
sanposhiho
commented
Jun 5, 2022
- One-line PR description: Write the production readiness requirements to graduate to beta
- Issue link: Min domains in PodTopologySpread #3022
- Other comments:
@@ -292,13 +292,22 @@ rollout. Similarly, consider large clusters and how enablement/disablement | |||
will rollout across nodes. | |||
--> | |||
|
|||
It shouldn't impact already running workloads. It's an opt-in feature, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will review tomorrow - in the meantime, if you're targeting beta, please update the kep.yaml and corresponding prr file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. will update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are new requirements for the Test Plan. Please check the updated template.
###### What specific metrics should inform a rollback? | ||
|
||
<!-- | ||
What signals should users be paying attention to when the feature is young | ||
that might indicate a serious problem? | ||
--> | ||
|
||
A spike on metric `schedule_attempts_total{result="error|unschedulable"}` when pods using this feature are added. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also include a spike on scheduling_algorithm_duration_seconds
, which would suggest that the feature is too slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
will add a mention of plugin_execution_duration_seconds
as well.
@@ -307,12 +316,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we | |||
are missing a bunch of machinery and tooling and can't do that now. | |||
--> | |||
|
|||
Yes. The behavior is changed as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Describe the manual testing that was done :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with this manual testing. Although, a test that most closely follows the question is when you actually do an upgrade.
So you start with an apiserver in version 1.24 (with feature disabled) and then upgrade to 1.25 with feature enabled and back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I'll do another manual test.
@@ -327,6 +340,8 @@ checking if there are objects with field X set) may be a last resort. Avoid | |||
logs or events for this purpose. | |||
--> | |||
|
|||
The operator can query pods with `pod.spec.topologySpreadConstraints.minDomains` field set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could extend kubernetes/kubernetes#107556 to produce a metric.
But I wouldn't block beta graduation to this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely need kubernetes/kubernetes#107556...
I have not had enough time to work on this, but will make it a priority...
to produce a metric.
So, what metrics do you imagine like?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you create an issue so we can discuss there?
But it would be something about which filter plugins influenced a pod scheduling decision. One counter increment for each plugin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will create that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Opened: kubernetes/kubernetes#110643
@@ -363,18 +378,19 @@ These goals will help you determine what you need to measure (SLIs) in the next | |||
question. | |||
--> | |||
|
|||
- 99% of `plugin_execution_duration_seconds{plugin="PodTopologySpread"}` are within x milliseconds. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have to provide an actual value instead of x
@@ -383,6 +399,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co | |||
implementation difficulties, etc.). | |||
--> | |||
|
|||
No. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as described in the How can an operator determine if the feature is in use by workloads?
question
@@ -466,8 +486,13 @@ details). For now, we leave it here. | |||
|
|||
###### How does this feature react if the API server and/or etcd is unavailable? | |||
|
|||
The feature doesn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The feature doesn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd | |
The feature isn't affected because Pod Topology Spread plugin doesn't communicate with kube-apiserver or etcd |
@@ -483,6 +508,9 @@ For each of them, fill in the following information by copying the below templat | |||
|
|||
###### What steps should be taken if SLOs are not being met to determine the problem? | |||
|
|||
- Check `plugin_execution_duration_seconds{plugin="PodTopologySpread"}` to see if latency increased. | |||
- Check `schedule_attempts_total{result="error|unschedulable"}` to see if the number of attempts increased. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What should I do if I see problems in either of the metrics?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
@alculquicondor @wojtek-t |
stage: alpha | ||
latest-milestone: "v1.24" | ||
stage: beta | ||
latest-milestone: "v1.25" | ||
milestone: | ||
alpha: "v1.24" | ||
beta: "v1.25" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you remove the stable line?
We don't know yet if it will be done in 1.26 or not, but most likely 1.27+ (after 2 beta releases).
@@ -292,13 +292,22 @@ rollout. Similarly, consider large clusters and how enablement/disablement | |||
will rollout across nodes. | |||
--> | |||
|
|||
It shouldn't impact already running workloads. It's an opt-in feature, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are new requirements for the Test Plan. Please check the updated template.
keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md
Outdated
Show resolved
Hide resolved
keps/sig-scheduling/3022-min-domains-in-pod-topology-spread/README.md
Outdated
Show resolved
Hide resolved
@@ -307,12 +316,16 @@ Longer term, we may want to require automated upgrade/rollback tests, but we | |||
are missing a bunch of machinery and tooling and can't do that now. | |||
--> | |||
|
|||
Yes. The behavior is changed as expected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with this manual testing. Although, a test that most closely follows the question is when you actually do an upgrade.
So you start with an apiserver in version 1.24 (with feature disabled) and then upgrade to 1.25 with feature enabled and back.
@@ -327,6 +340,8 @@ checking if there are objects with field X set) may be a last resort. Avoid | |||
logs or events for this purpose. | |||
--> | |||
|
|||
The operator can query pods with `pod.spec.topologySpreadConstraints.minDomains` field set. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you create an issue so we can discuss there?
But it would be something about which filter plugins influenced a pod scheduling decision. One counter increment for each plugin.
@alculquicondor Please retake a look again. 🙏 |
extending the production code to implement this enhancement. | ||
--> | ||
|
||
- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/podtopologyspread`: `2020-06-17` - `86%` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also mention the pod validation packages
We expect no non-infra related flakes in the last month as a GA graduation criteria. | ||
--> | ||
|
||
N/A |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain why:
possible explanations is that there are no new API endpoints and that the feature doesn't interact with other components, so E2E doesn't add extra value to integration tests.
@@ -383,6 +469,8 @@ Describe the metrics themselves and the reasons why they weren't added (e.g., co | |||
implementation difficulties, etc.). | |||
--> | |||
|
|||
Yes. It would be useful if we could see more details related to scheduler's decisions in metrics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Link to the issue.
Maybe be more specific about decisions. In this case, we would like to know which Filters had impact on the scheduling of the pod.
Applied your new suggestions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
/approve
I just updated the upgrade/rollback manual test section. |
@wojtek-t Could you take a look at this PR as well? 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm satisfied with the PRR answers, awaiting SIG approval.
@@ -4,3 +4,5 @@ | |||
kep-number: 3022 | |||
alpha: | |||
approver: "@wojtek-t" | |||
beta: | |||
approver: "@wojtek-t" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wojtek is out, please change to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, thanks for taking over! 🙏
Oh, I see SIG approval. Just change the PRR approver to me, so if all hell breaks loose Wojtek doesn't get blamed :-P |
@johnbelamaric Updated. |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alculquicondor, johnbelamaric, sanposhiho The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@johnbelamaric @alculquicondor |
/lgtm Thanks! |