-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing,testutils: detect span leaks #58902
tracing,testutils: detect span leaks #58902
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)
pkg/sql/txn_state.go, line 161 at r2 (raw file):
// // XXX: We'll need do address the above with always-on tracing; we're // leaking spans.
One thing we could do is finish all the spans registered with the tracer on server shutdown. But... that's exactly destroying any way to check that we're not leaking them.
@asubiotto can you chime in here and suggest/make a change? We need reliable semantics about span finishing in all places, or the registry for long-running operations won't be very useful. My naive assumption is that a txnState only exists in the context of a conn executor, so if that had a close method that is called reliably (which should be possible) that would take care of the issue.
@irfansharif to get this PR merged, you could introduce a (temporary) tracing.WithBypassRegistry()
option to StartSpan{,Ctx}
that would simply avoid putting the Span
in the registry, and use it here. Then we can file an issue against SQL and move on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @knz)
pkg/sql/txn_state.go, line 161 at r2 (raw file):
Previously, tbg (Tobias Grieger) wrote…
One thing we could do is finish all the spans registered with the tracer on server shutdown. But... that's exactly destroying any way to check that we're not leaking them.
@asubiotto can you chime in here and suggest/make a change? We need reliable semantics about span finishing in all places, or the registry for long-running operations won't be very useful. My naive assumption is that a txnState only exists in the context of a conn executor, so if that had a close method that is called reliably (which should be possible) that would take care of the issue.
@irfansharif to get this PR merged, you could introduce a (temporary)
tracing.WithBypassRegistry()
option toStartSpan{,Ctx}
that would simply avoid putting theSpan
in the registry, and use it here. Then we can file an issue against SQL and move on.
I agree with adding a temporary way to bypass adding spans to the registry (maybe with a more explicit you-shouldn't-do-this indicator in the name) to get this PR merged.
With respect to the specific problem of leaking spans on server shutdown I'm assuming this is specifically about txns that don't respond to a cancellation request, otherwise I would expect finishSQLTxn
to be called. If not, we'd probably need to do something in the connExecutor
's close method, although I'm not sure it gets called in these cases. It's been a while since I took a look at the code, but I think the best thing is to add this bypass and create an issue with a reproduction so that we can tackle it on the SQL exec team.
e2787bc
to
ef8ff5a
Compare
I'll wait for the discussion on #58897 before landing this. I added the |
...in the import processor. The contract of `RowSource.Next` requires that at most one of the return values will be non-empty. This wasn't the case here, and caused opaque failures in cockroachdb#58897. cockroachdb#58897 tries to enable background tracing by default, which for us means that the trailing meta can be non-empty (they'll contain span recordings). That behaviour ends up tickling this bug, tripping up TestCSVImportCanBeResumed. Release note: None
ef8ff5a
to
a01d118
Compare
This is follow-up work from cockroachdb#58712, where we measured the overhead for always-on tracing and found it to be minimal/acceptable. Lets switch this on by default to shake the implications of doing so. We can reasonably expect two kinds of fallout: (1) Unexpected blow up in memory usage due to resource leakage (which is a can be problem now that we're always maintaining open spans in an internal registry, see cockroachdb#58721) (2) Performance degradataion due to tracing overhead per-request (something cockroachdb#58712) was spot checking for. For (1) we'll introduce a future test in a separate PR. For (2), we'll monitor roachperf over the next few weeks. --- Also moved some of the documentation for the cluster setting into a comment form above. Looking at what's rendered in our other cluster settings (`SHOW ALL CLUSTER SETTINGS`), we default to a very pithy, unwrapped description. Release note: None
6ee172d
to
bc7ba4d
Compare
With always-on tracing, we're maintaining in-memory registry of active spans (cockroachdb#58490). Spans are added and removed from this registry when they're `Start()`-ed and `Finish()`-ed. Spans that not explicitly finished (typically using `defer sp.Finish()`) are now a resource-leak, as they take up room in the registry. We'll want to find instances of this leak as soon as they crop up[*]. To that end we add a check in our `TestCluster.Stop` codepaths that asserts against the registry being empty. This should give us wide coverage given it's usage throughout. We expect this change to capture the cases described in cockroachdb#58721. --- [*] For known issues like `txnState.resetForNewSQLTxn`, we'll temporarily introduce a `WithBypassRegistry` option. Fixing that doesn't look simple, it ties into questions around draining SQL and rolling back open txns. Release note: None
bc7ba4d
to
b8b888f
Compare
Given how difficult CI has been recently, I'm just going to pile on and optimistically bors it. The second commit is from #58897. So either that fails and this goes through, or vice versa, or the commits get de-duped away. bors r+ |
Build succeeded: |
With always-on tracing, we're maintaining in-memory registry of active
spans (#58490). Spans are added and removed from this registry when
they're Start()-ed and Finish()-ed. Spans that not explicitly finished
(typically using
defer sp.Finish()
) are now a resource-leak, as theytake up room in the registry. We'll want to find instances of this leak
as soon as they crop up.
To that end we add a check in our TestCluster.Stop codepaths that
asserts against the registry being empty. This should give us wide
coverage given it's usage throughout. We expect this change to capture
the cases described in #58721.
This check is currently failing, given we're actually leaking spans in
txnState.resetForNewSQLTxn
. Fixing that doesn't look simple, it tiesinto questions around draining SQL and rolling back open txns.
Release note: None