-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🌱 Source should retry to get informers until timeout expires #1678
Conversation
/assign @sbueringer @alvaroaleman @fabriziopandini |
This seems like a bit of a niche usecase. What is the advantage of this over implementing the retrying inside your runabbles |
@alvaroaleman This issue is actually coming from the fact that when you have a manger that bundles caches, webhooks, and controllers you might end up in a situation where the caches can't be populated yet because the api-server is reading the new CRDs (new API versions for example). The controller then would crash and cause also the webhooks to go along with it, which can cause reading or operating on objects fail in a chain reaction. The overall system is eventually consistent and this changeset introduces a way to optionally retry a runnable in case there are errors. If those errors persist, the manager should definitely crash at that point. |
851eccd
to
6104588
Compare
So the scenario you are describing is if the manager has a conversion webhook for a version the cache wants to read (because some controller uses it)? Shouldn't we simply start conversion webhooks first to fix that? |
Not just conversion, this applies to all webhooks. Yes, we're start webhooks first today. Consider this scenario:
The problem becomes even more complicated if the two controllers watch each other's CRDs, in which case there is never enough time for the two controller to stay up and running enough to allow a proper cache start. We've seen this problem in Cluster API during a version upgrade, which led us here. Happy to go over the use case a bit more if that helps. |
lgtm for me |
I don't think so, because mutating and validating webhooks only cover mutating calls, hence they can not be in the path of a read-only request.
This should result in minimal delay or not? CRD establishing takes like what, two seconds?
By "each others CRD" you mean a crd version that is provided through a conversion webhook in the other controller? Don't we give the cache two minutes to get up and ready? |
Fair, the error we've seen was mostly related to conversion kubernetes-sigs/cluster-api#5327
It really depends, api-server load is definitely something that can affect, but from what I've been able to test for example changing the webhook service can take a bit.
No, in this case I mean two managers controlling two independent set of CRDs but also watching each other's CRDs to act on changes. |
6104588
to
b627bce
Compare
Signed-off-by: Vince Prignano <vincepri@vmware.com>
b627bce
to
4af39e6
Compare
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman, vincepri The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/cherrypick release-0.10 |
@vincepri: new pull request created: #1682 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This changeset adds the ability for a Manager to not fail immediately if
a wait.Backoff parameter is given as RunnableRetryBackoff in Options.
Currently, if a runnable fails to run the Start operation is never
retried which could cause the manager and all webhooks to stop and the
deployment to go into CrashLoopBackoff. Given the eventual consistency
of controllers and managers cooperating with other controllers or the
api-server, allow some sort of backoff by trying to start runnables
a number of times before giving up.
Signed-off-by: Vince Prignano vincepri@vmware.com