Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explicitly Queue a Reconcile Request if a Shared Provider has Expired #241

Merged
merged 1 commit into from
Aug 1, 2023

Conversation

ulucinar
Copy link
Collaborator

@ulucinar ulucinar commented Jul 27, 2023

Description of your changes

Fixes #192

We currently return an error if a terraform.SharedProvider has already expired and an external client requests service from the expired shared provider. This results in the MR being reconciled by the external client to acquire a Synced: False status condition. This is a temporary situation that's resolved once the expired shared provider is drained and the terraform.SharedProviderScheduler replaces it.

However, until the shared provider is replaced, the MRs entering into the Synced: False state become confusing. Before the handler.EventHandler implementation from #231, we had to return an error to the managed reconciler to requeue a timely reconciliation request. However, we can now explicitly requeue an exponentially backed-off reconciliation request so that the external client will retry to get service from the provider in a timely manner without having to wait for the poll period (default 10 min for the official providers) and without forcing the MR falling into the Synced: False state.

The external client will retry to schedule a shared runner up to 20 times without erroring, exponentially backing-off after each failure. And after this initial 20 retries, it will report the error to the managed reconciler and keep trying.

I have:

  • Read and followed Crossplane's contribution process.
  • Run make reviewable to ensure this PR is ready for review.
  • Added backport release-x.y labels to auto-backport this PR if necessary.

How has this code been tested

Tested via crossplane-contrib/provider-upjet-aws#805.

Copy link
Member

@sergenyalcin sergenyalcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ulucinar LGTM. I left a nit comment for further discussions

Comment on lines +184 to +188
return managed.ExternalObservation{
ResourceExists: true,
ResourceUpToDate: true,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I do not have a strong opinion, but I think the case of a resource that has never been scheduled and that of a resource that has been scheduled and reconciled several times (at least within the application) can be evaluated differently.

So, we may consider returning an error for a resource that has never been scheduled.

Copy link
Collaborator Author

@ulucinar ulucinar Jul 28, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @sergenyalcin for considering this. I agree with your point but I also feel a bit uncomfortable with continuing to produce unsync'ed MRs even for certain cases since this change is a UX change.

As we have discussed, I also feel uncomfortable for returning a direct success from the external client's Create function to the managed reconciler but the situation is similar to what we do in the async mode of operation, i.e., we already return an immediate success by just starting a goroutine that may very well fail to initiate a creation request for the external resource.

I propose we continue with suppressing the unsync'ed state of MRs and if folks complain about prolonging wait times than we already have a solution for this: If you don't want to wait, please give more memory to the native provider and increase the provider-ttl command-line parameter. Btw, this PR is not expected to prolong any wait times on its own. No change in the wait times when a runner has expired, and we should just be preventing the MRs from being unsync'ed when the runner has expired and new scheduling requests cannot be admitted until the old one is drained and replaced. (One thing regarding these statements is the max delay configured for the rate limiter, if it's higher than the managed reconciler's, then wait times may change, or any other aspect of the rate limiter).

Maybe we will just end up implementing a scheduler that will not deny requests coming to an expired runner but will fork a new one as needed at the cost of running multiple runners in parallel and increasing the memory footprint. Let's try the approach implemented here first as we already have the provider-ttl parameter.

…ng errors

- The external client will requeue at most 20 times before reporting an error

Signed-off-by: Alper Rifat Ulucinar <ulucinar@users.noreply.github.com>
Copy link
Member

@sergenyalcin sergenyalcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @ulucinar LGTM!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Confusing "budget exceeded (ttl)" message in shared schdeuler
2 participants