-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ktesting: tone down warning about leaked test goroutine #401
Conversation
This issue is currently awaiting triage. If klog contributors determine this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Let's fix the linter issues first, see #402 |
logger.WithCallDepth(1).Error(nil, "WARNING: test kept at least one goroutine running after test completion", "callstack", string(dbg.Stacks(false))) | ||
l.shared.goroutineWarningDone = true | ||
duration := time.Since(l.shared.stopTime) | ||
if duration > stopGracePeriod { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems will bury the real goroutine leakage if the duration is less than stopGracePeriod, do I understand right here?
Another point I can't get here is if the test completes, why will it wait for another 10s?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the "leaked" goroutine stops after less than stopGracePeriod
, then it wasn't really a leak. It was "just" a test that didn't block and wait for goroutines at the end. This is very common.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally misunderstood the meaning here. I'm not quite understand how it works but I still have a silly question like: Will we wait and detect the goroutine leakage after test completes? Or why the duration from stop will be greater than stopGracePeriod
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we don't wait. That is something that uber/goleak does, but not this code. This code is merely reporting on leaked goroutines if and when it detects them, which is when they keep logging.
The stopGracePeriod
is the grace period that goroutines have before their log output triggers this warning. I can rename it if you have a better suggestion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recalled goleak has maxRetries, then can't we just configure the goleak? It's configurable now. Both the retry times and sleep duration. I think somehow the same as stopGracePeriod
. Will this lead to other problems?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
keep the warning for goroutines that are more likely to be real leaks while not printing it for goroutines which terminate quickly after test completion.
Definitely agree.
We cannot depend on goleak here
Yes for klog... I mean for callers, like test framework. I suggested in this way is just because the goroutine leakage detection duration is mostly decided by goleak, the maxRetries(default to 20), maxSleep(default to 100ms) specifically, the duration ~= maxRetries * (maxSleep + filterStacks)
, have no idea whether 10s fits here. However, the feature is reverted... uber-go/goleak#116, so no way to configure now.
As a user of klog, happy with either way to avoid confusion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Definitely agree.
So LGTM?
I mean for callers, like test framework.
Leak detection with goleak only works when tests run sequentially or in TestMain. I'm not sure how useful it is in an E2E test suite, because the actual code under test runs elsewhere. For integration and unit tests, individual test owners have to integrate goleak. Some apiserver-based tests have it.
have no idea whether 10s fits here
So this is about how klog will interact with goleak. I don't have a strong opinion here. I picked 10 seconds because it should be long enough to no produce "false positives" for the "I am shutting down after test completion" goroutines and short enough to give feedback for others. Because it's not causing a test failure, it can be lower than the goleak timeouts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are these 10s added to the test duration? 10s sounds like a lot to me
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, they are not added. At the time that this gets called, the test has already completed.
/assign @dims For approval. A new release for Kubernetes 1.31 also makes sense. |
Hmm, we want a few more things for a new release, like #404 |
/retest Does that work for GitHub actions? The linter wasn't happy, for some reason. |
Linter failures fixed in main. |
@@ -250,8 +266,12 @@ func (l tlogger) fallbackLogger() logr.Logger { | |||
logger = logger.WithCallDepth(l.shared.callDepth + 1) | |||
|
|||
if !l.shared.goroutineWarningDone { | |||
logger.WithCallDepth(1).Error(nil, "WARNING: test kept at least one goroutine running after test completion", "callstack", string(dbg.Stacks(false))) | |||
l.shared.goroutineWarningDone = true | |||
duration := time.Since(l.shared.stopTime) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this access need to be locked?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Good catch, thanks.
/lgtm cancel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrrm, no. fallbackLogger
gets called while the mutex is locked. This wasn't clear here, so I've added a comment explaining that (https://github.com/kubernetes/klog/compare/83145e470293dc67bede46353012cededd97e031..ed62027e604261af3e9c42c7a50092a0db074f19).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aojea: Okay for another LGTM?
83145e4
to
ed62027
Compare
/hold For second approval. |
Not terminating immediately is fairly normal: many controllers log "terminating" messages while they shut down, which often is right after test completion, if that is when the test cancels the context and then doesn't wait for goroutines (which is often not possible). Therefore the warning now only gets printed if that happens after more than 10 seconds since test completion, and then the warning is only printed as an info message, not an error.
ed62027
to
cd24012
Compare
/lgtm |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aojea, dims, kerthcet The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold cancel |
[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [k8s.io/klog/v2](https://github.com/kubernetes/klog) | `v2.120.1` -> `v2.130.0` | [![age](https://developer.mend.io/api/mc/badges/age/go/k8s.io%2fklog%2fv2/v2.130.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/go/k8s.io%2fklog%2fv2/v2.130.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/go/k8s.io%2fklog%2fv2/v2.120.1/v2.130.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/go/k8s.io%2fklog%2fv2/v2.120.1/v2.130.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | --- ### Release Notes <details> <summary>kubernetes/klog (k8s.io/klog/v2)</summary> ### [`v2.130.0`](https://github.com/kubernetes/klog/releases/tag/v2.130.0): Prepare klog release for Kubernetes v1.31 (Take 1) [Compare Source](https://github.com/kubernetes/klog/compare/v2.120.1...v2.130.0) #### What's Changed - chore(\*): fix spelling of Intel Corporation by [@​jsoref](https://github.com/jsoref) in [https://github.com/kubernetes/klog/pull/399](https://github.com/kubernetes/klog/pull/399) - build: fix some linter warnings by [@​pohly](https://github.com/pohly) in [https://github.com/kubernetes/klog/pull/402](https://github.com/kubernetes/klog/pull/402) - examples: fix linter warning by [@​pohly](https://github.com/pohly) in [https://github.com/kubernetes/klog/pull/406](https://github.com/kubernetes/klog/pull/406) - Do not acquire lock for file.Sync() fsync call by [@​1978629634](https://github.com/1978629634) in [https://github.com/kubernetes/klog/pull/404](https://github.com/kubernetes/klog/pull/404) - ktesting: tone down warning about leaked test goroutine by [@​pohly](https://github.com/pohly) in [https://github.com/kubernetes/klog/pull/401](https://github.com/kubernetes/klog/pull/401) #### New Contributors - [@​jsoref](https://github.com/jsoref) made their first contribution in [https://github.com/kubernetes/klog/pull/399](https://github.com/kubernetes/klog/pull/399) - [@​1978629634](https://github.com/1978629634) made their first contribution in [https://github.com/kubernetes/klog/pull/404](https://github.com/kubernetes/klog/pull/404) **Full Changelog**: kubernetes/klog@v2.120.1...v2.130.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/anza-labs/lke-operator). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4zOTMuMCIsInVwZGF0ZWRJblZlciI6IjM3LjM5My4wIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJhcmVhL2RlcGVuZGVuY3kiXX0=--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [k8s.io/klog/v2](https://github.com/kubernetes/klog) | `v2.120.1` -> `v2.130.0` | [![age](https://developer.mend.io/api/mc/badges/age/go/k8s.io%2fklog%2fv2/v2.130.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/go/k8s.io%2fklog%2fv2/v2.130.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/go/k8s.io%2fklog%2fv2/v2.120.1/v2.130.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/go/k8s.io%2fklog%2fv2/v2.120.1/v2.130.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | --- ### Release Notes <details> <summary>kubernetes/klog (k8s.io/klog/v2)</summary> ### [`v2.130.0`](https://github.com/kubernetes/klog/releases/tag/v2.130.0): Prepare klog release for Kubernetes v1.31 (Take 1) [Compare Source](https://github.com/kubernetes/klog/compare/v2.120.1...v2.130.0) #### What's Changed - chore(\*): fix spelling of Intel Corporation by [@​jsoref](https://github.com/jsoref) in [https://github.com/kubernetes/klog/pull/399](https://github.com/kubernetes/klog/pull/399) - build: fix some linter warnings by [@​pohly](https://github.com/pohly) in [https://github.com/kubernetes/klog/pull/402](https://github.com/kubernetes/klog/pull/402) - examples: fix linter warning by [@​pohly](https://github.com/pohly) in [https://github.com/kubernetes/klog/pull/406](https://github.com/kubernetes/klog/pull/406) - Do not acquire lock for file.Sync() fsync call by [@​1978629634](https://github.com/1978629634) in [https://github.com/kubernetes/klog/pull/404](https://github.com/kubernetes/klog/pull/404) - ktesting: tone down warning about leaked test goroutine by [@​pohly](https://github.com/pohly) in [https://github.com/kubernetes/klog/pull/401](https://github.com/kubernetes/klog/pull/401) #### New Contributors - [@​jsoref](https://github.com/jsoref) made their first contribution in [https://github.com/kubernetes/klog/pull/399](https://github.com/kubernetes/klog/pull/399) - [@​1978629634](https://github.com/1978629634) made their first contribution in [https://github.com/kubernetes/klog/pull/404](https://github.com/kubernetes/klog/pull/404) **Full Changelog**: kubernetes/klog@v2.120.1...v2.130.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/open-telemetry/opentelemetry-go-contrib). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy40MTAuMSIsInVwZGF0ZWRJblZlciI6IjM3LjQxMC4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJTa2lwIENoYW5nZWxvZyIsImRlcGVuZGVuY2llcyJdfQ==--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [k8s.io/klog/v2](https://github.com/kubernetes/klog) | `v2.120.1` -> `v2.130.0` | [![age](https://developer.mend.io/api/mc/badges/age/go/k8s.io%2fklog%2fv2/v2.130.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/go/k8s.io%2fklog%2fv2/v2.130.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/go/k8s.io%2fklog%2fv2/v2.120.1/v2.130.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/go/k8s.io%2fklog%2fv2/v2.120.1/v2.130.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | --- > [!WARNING] > Some dependencies could not be looked up. Check the Dependency Dashboard for more information. --- ### Release Notes <details> <summary>kubernetes/klog (k8s.io/klog/v2)</summary> ### [`v2.130.0`](https://github.com/kubernetes/klog/releases/tag/v2.130.0): Prepare klog release for Kubernetes v1.31 (Take 1) [Compare Source](https://github.com/kubernetes/klog/compare/v2.120.1...v2.130.0) #### What's Changed - chore(\*): fix spelling of Intel Corporation by [@​jsoref](https://github.com/jsoref) in [https://github.com/kubernetes/klog/pull/399](https://github.com/kubernetes/klog/pull/399) - build: fix some linter warnings by [@​pohly](https://github.com/pohly) in [https://github.com/kubernetes/klog/pull/402](https://github.com/kubernetes/klog/pull/402) - examples: fix linter warning by [@​pohly](https://github.com/pohly) in [https://github.com/kubernetes/klog/pull/406](https://github.com/kubernetes/klog/pull/406) - Do not acquire lock for file.Sync() fsync call by [@​1978629634](https://github.com/1978629634) in [https://github.com/kubernetes/klog/pull/404](https://github.com/kubernetes/klog/pull/404) - ktesting: tone down warning about leaked test goroutine by [@​pohly](https://github.com/pohly) in [https://github.com/kubernetes/klog/pull/401](https://github.com/kubernetes/klog/pull/401) #### New Contributors - [@​jsoref](https://github.com/jsoref) made their first contribution in [https://github.com/kubernetes/klog/pull/399](https://github.com/kubernetes/klog/pull/399) - [@​1978629634](https://github.com/1978629634) made their first contribution in [https://github.com/kubernetes/klog/pull/404](https://github.com/kubernetes/klog/pull/404) **Full Changelog**: kubernetes/klog@v2.120.1...v2.130.0 </details> --- ### Configuration 📅 **Schedule**: Branch creation - "on tuesday" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/open-telemetry/opentelemetry-collector-contrib). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy40MTAuMSIsInVwZGF0ZWRJblZlciI6IjM3LjQxMC4xIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6WyJkZXBlbmRlbmNpZXMiLCJyZW5vdmF0ZWJvdCJdfQ==--> --------- Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com> Co-authored-by: opentelemetrybot <107717825+opentelemetrybot@users.noreply.github.com>
What this PR does / why we need it:
Not terminating immediately is fairly normal: many controllers log "terminating" messages while they shut down, which often is right after test completion, if that is when the test cancels the context and then doesn't wait for goroutines (which is often not possible).
Therefore the warning now only gets printed if that happens after more than 10 seconds since test completion, and then the warning is only printed as an info message, not an error.
Release note:
/cc @aojea @soltysh @kerthcet