-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PRR questionnaire section to New Event API KEP #1772
Conversation
Hi @chelseychen. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @wojtek-t Please take a look and feel free to add new reviewers, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a first, high-level pass. I will have more comments further, but those are the high-level ones.
- [x] Other | ||
- Describe the mechanism: | ||
|
||
The new Event API can be enabled / disabled by switching to use the new / old Event proto and libs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to separate two things here:
(1) the API itself
(2) the use of the API.
(1) We've never used feature-gates for the API itself. Though, you can enable/disabled specific APIs at kube-apiserver level using --runtime-config flag.
(2) Regarding using the library. We should have added a feature-gate a gate it based on that, but we didn't. My fault.
So for now, there is no way to disable it.
[We only have a fallback, that if the API is disabled, we simply fallback to the old libraries. And we probably should continue with this pattern, given we're already graduating to GA, so featurage-gate will already be locked to true.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should put something along those lines in the doc. And then answering the questions below is not needed (as it's kind of answered in what I described above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, let me pull in others opinions.
@johnbelamaric - given my explanation above (ignoring for a moment a fact that it was a bad decision and we shouldn't do this that way), I'm willing towards saying that "rollback is not supported". Or should we interpret it as "if you rollback to the release where component X wasn't migrated to use the new API, it will be rolled back".
[FTR - currently only scheduler is migrated to the new API]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few things. I think, "rollback supported" is probably a bad term. Using "rollback supported" to describe the ability to disable the feature (i.e, roll back the enablement as described below) without rolling back the version I think is a little confusing. We probably should call it "disable-supported".
In this case, it sounds like you are saying, the way to disable this is not via a feature gate but instead by disabling the API, in which case the library is smart enough to use the prior API. IIUC, this causes no data loss on existing events created with the new API, and presumably we round-trip those events so you'll still be able to query events created with the new API after disabling it, by using the old API. If that's correct, then I would say "rollback supported" is true. It's not feature gate based, but it is possible and works.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should call it "disable-supported".
+1
If that's correct, then I would say "rollback supported" is true. It's not feature gate based, but it is possible and works.
Yes - what yo wrote is the right summary.
Fair enough - I think that makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt - IIRC we discussed a while ago fallbacking to core. once we move move to GA I think we should still fallback for safety. might be worth considering for other APIs (if we're not already) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It depends on the use. the event recorder is used by a broad range of clients, many out of tree, with potentially different skew support statements than kube-scheduler/kube-controller-manager, etc.
The broader the use, the more we should accomodate running against a range of server versions, so fallback for event recorder makes sense to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so fallback for event recorder makes sense to me.
Agree (at least for the next couple releases).
Also - currently this logic is purely baked into scheduler:
https://github.com/kubernetes/kubernetes/blob/96e13de777a9eb57f87889072b68ac40467209ac/cmd/kube-scheduler/app/server.go#L281
We should probably make that part of the library.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated the description of the mechanism, not sure if I got it correct. Let me know if it needs any changes.
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
Try to be as paranoid as possible - e.g. what if some components will restart | ||
in the middle of rollout? | ||
|
||
A rollout could fail if some components restart in the middle of the rollout. Then those components will continue using the old Event API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I don't understand this one - can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Friendly ping - I think I still don't understand it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking maybe this should also be separated into two cases? (1) The rollout of API won't affect already running workloads; (2) The rollout of the API migration of individual components might fail and impact running workloads. e.g., the component restarts causes the rollout didn't finish, so new Event API didn't take effect. Not sure if I understand this correctly.
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
/ok-to-test |
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
|
||
* **Will enabling / using this feature result in introducing new API types?** | ||
|
||
Yes, a new API type "eventsv1.Event" is being introduced. The migration of Event API will cause creation of new types of Event objects. The number of Event objects depends on cluster state, which theoretically won't be too large due to deduplication logic and reasonable-cardinality of objects in the system. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note to myself - we need better answer here.
309816c
to
26fb915
Compare
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
Try to be as paranoid as possible - e.g. what if some components will restart | ||
in the middle of rollout? | ||
|
||
A rollout could fail if some components restart in the middle of the rollout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really understand this one - can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I replied in your previous comment but you might miss it:
I'm thinking maybe this should also be separated into two cases? (1) The rollout of API won't affect already running workloads; (2) The rollout of the API migration of individual components might fail and impact running workloads. e.g., the component restarts causes the rollout didn't finish, so new Event API didn't take effect. Not sure if I understand this correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with (1).
Re (2) - I still don't fully understand it. So let me maybe rephrase it. Is that what you meant for (2):
The rollout of the API migration of individual components may cause components fail to initialize in case of bug. It will not affect running workloads though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's right. Thanks for rewording it more clearly. I've updated the answer.
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
keps/sig-instrumentation/383-new-event-api-ga-graduation/kep.yaml
Outdated
Show resolved
Hide resolved
d36e697
to
9071e38
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@chelseychen - thanks a lot, two more comments, but I think we're pretty close
Try to be as paranoid as possible - e.g. what if some components will restart | ||
in the middle of rollout? | ||
|
||
A rollout could fail if some components restart in the middle of the rollout. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with (1).
Re (2) - I still don't fully understand it. So let me maybe rephrase it. Is that what you meant for (2):
The rollout of the API migration of individual components may cause components fail to initialize in case of bug. It will not affect running workloads though.
keps/sig-instrumentation/383-new-event-api-ga-graduation/README.md
Outdated
Show resolved
Hide resolved
/lgtm /assign @dashpole |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: chelseychen, dashpole The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
A new PRR questionnaire section was added to the KEP template as a required section in #1620, so this pull request is to add this section to align with that requirement and answer all questions regarding the fact that we're targeting GA graduation in 1.19.