Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry loop for client.get of replicaset as that sometimes fails #1072

Merged
merged 6 commits into from
Sep 5, 2022

Conversation

kevinearls
Copy link
Member

@kevinearls kevinearls commented Sep 2, 2022

Signed-off-by: Kevin Earls kearls@redhat.com

This fixes some of the test failures we occasionally see in CI as described in #959

There are others which I believe are related to acquiring a leader lease taking more than 2.5 minutes. I will add more information about that to #959

Resolves #959

Signed-off-by: Kevin Earls <kearls@redhat.com>
@kevinearls kevinearls requested a review from a team September 2, 2022 12:40
}

// use a retry loop to get the Deployment. A single call to client.get fails occasionally
err := retry.OnError(backOff, checkError, getReplicaSet)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the same retry approach for other objects as well?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could. Do you have specific places you'd want to do this? So far this is the only place I've seen failures

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right this is the only place that uses k8s client in this function. I don't know any other place.

Name: owner.Name,
}, &rs)
nsn := types.NamespacedName{Namespace: ns.Name, Name: owner.Name}
backOff := wait.Backoff{Duration: 10 * time.Millisecond, Factor: 1.5, Jitter: 0.1, Steps: 20, Cap: 30 * time.Second} // TODO decide which of these we need and what they should be set to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the pod webhook timeout? cap of 30 seconds seems too much TBH.

Could we extract the retry functionality to a separate function? It will be useful if we want to reuse it for other objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

max wait 10 seconds is too much. Could we use e.g. 2-3 seconds?


checkError := func(err error) bool {
// if the error looks like 'ReplicaSet.apps "my-deployment-with-sidecar-f46b479f" not found' ignore it
if strings.HasPrefix(err.Error(), "ReplicaSet.apps") && strings.HasSuffix(err.Error(), "not found") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a better way to compare the error? e.g. apierrors.IsNotFound(err)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me try that. It may take a while to make sure I get an error...

Signed-off-by: Kevin Earls <kearls@redhat.com>
Signed-off-by: Kevin Earls <kearls@redhat.com>
Signed-off-by: Kevin Earls <kearls@redhat.com>
@kevinearls
Copy link
Member Author

@pavolloffay if you're still around...

@pavolloffay
Copy link
Member

@kevinearls the PR looks good to me! :) but the cap of 10s seems too high. The timeout of the webhook is 10s, retrying for 10s and blocking the pod creation does not feel right. Could we start with e.g. 2-3s and see if that would work?

Signed-off-by: Kevin Earls <kearls@redhat.com>
@pavolloffay pavolloffay merged commit 4655c15 into open-telemetry:main Sep 5, 2022
@kevinearls kevinearls deleted the flaky-tests-fix-1 branch September 5, 2022 08:16
ItielOlenick pushed a commit to ItielOlenick/opentelemetry-operator that referenced this pull request May 1, 2024
…pen-telemetry#1072)

* Add retry loop for client.get of replicaset as that sometimes fails

Signed-off-by: Kevin Earls <kearls@redhat.com>

* Reduce total timeout, remove TODO

Signed-off-by: Kevin Earls <kearls@redhat.com>

* Use apierrors to check error

Signed-off-by: Kevin Earls <kearls@redhat.com>

* Appease the linter

Signed-off-by: Kevin Earls <kearls@redhat.com>

* Lower meximum wait time to 2 seconds

Signed-off-by: Kevin Earls <kearls@redhat.com>

Signed-off-by: Kevin Earls <kearls@redhat.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix flaky auto-instrumentation multi-container test
2 participants