-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[dnm] tracing: investigate perf regression under sql.txn_stats.sample_rate=1 #61328
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r8, 5 of 6 files at r9, 5 of 5 files at r10, 5 of 5 files at r11.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/sql/instrumentation.go, line 181 at r11 (raw file):
// If we need to collect stats, create a non-verbose child span. Stats // will be added as structured metadata and processed in Finish. sp := tracing.SpanFromContext(ctx)
This is not quite an equivalent change. In connExecutor.execCmd
we create a span for every statement, so sp
is always non-nil, but with the current change we tell instrumentationHelper
to not "finish" which means that it will not process execution stats attached to the trace. I prototyped the proper usage of that already present span in yuzefovich@58ab90b, but it doesn't seem to change the micro benchmark numbers.
pkg/sql/colflow/stats.go, line 257 at r9 (raw file):
// is not the way they are supposed to be used, so this should be fixed. if false { _, span := tracing.ChildSpan(ctx, opName)
After addressing the span creation in #61380, the performance gains of this commit in the benchmark went away. What this change essentially does is to drop on the floor all of the execution statistics we collected, thus there is less information to propagate via the trace.
Updated the PR on top of your recent work, @yuzefovich @irfansharif as we discussed yesterday, mind taking a look here with me? The branch should be relatively clean now. I added a microbenchmark that tries to capture exactly the overhead that would be incurred from setting the sampling rate from zero to one. It incurs a cost, but I don't see anything that can explain the slow down I see via @yuzefovich could you scrutinize the microbenchmark? Is there any significant work that is done as a result of propagating component stats that it doesn't capture?
The case
That's a drop of 46µs incurred by changing the sampling rate from The autoCollection=true numbers avoid an extra unmarshal-remarshal cycle |
This is also consistent with this: With
With
If throwing away the |
Fixes cockroachdb#59308. This commit adds support for range merge transactions to refresh. This is necessary to allow the merge transaction to have its write timestamp be bumped and still commit without retrying. In such cases, the transaction must refresh its reads, including its original read on the RHS's local range descriptor. If we were to block this refresh on the frozen RHS range, the merge would deadlock. On the surface, it seems unsafe to permit Refresh requests on an already subsumed RHS range, because the refresh's effect on the timestamp cache will never make it to the LHS leaseholder. This risks the future joint range serving a write that invalidates the Refresh. However, in this specific situation, we can be sure that such a serializability violation will not occur because the Range merge also writes to (deletes) this key. This means that if the Range merge transaction commits, its intent on the key will be resolved to the timestamp of the refresh and no future write will ever be able to violate the refresh. Conversely, if the Range merge transaction does not commit, then the merge will fail and the update to the RHS's timestamp cache will not be lost (not that this particularly matters in cases of aborted transactions). The same line of reasoning as the one above has motivated us to explore removing keys from a transaction's refresh spans when they are written to by the transaction, as the intents written by the transaction act as a form of pessimistic lock that obviate the need for the optimistic refresh. Such an improvement would eliminate the need for this special case, but until we generalize the mechanism to prune refresh spans based on intent spans, we're forced to live with this. See cockroachdb#59308 (comment) for why the original fix, which attempted to manually refresh the range merge transaction before it entered its critical phase, was not sufficient. Release justification: needed for new functionality.
This commit adds support for a stream client that can connect to a cluster and start replicating changes that happen on that stream. Release note: None Release justification: low-risk, high-benefit change
Previously, creating interleaved partitioned indexes panicked. This commit disallows their creation to prevent panicking. Fixes cockroachdb#60699 Release justification: This is a low risk change that prevents panics when attempting to create interleaved partitioned tables. Release note (bug fix): Creating interleaved partitioned indexes is now disallowed. Previously, the database would crash when trying to create one.
This commit removes redundant context argument from `trailingMetaCallback` in favor of just using `ProcessorBase.Ctx` when needed. This was already actually the case since the callback was always called with `ProcessorBase.Ctx`. We also had the wrong sequences of actions when collecting trailing metadata in inverted joiner, join reader, and zigzag joiner, where we first called `InternalClose` and then generated the metadata. That is wrong because `InternalClose` replaces the context used throughout the operation of a processor with the "original" context (the one passed into `StartInternal`). This is now fixed. Additionally, a few unnecessary method overwrites are removed and `ProcessorBase` now provides the default implementation of `ConsumerClosed` (which simply calls `InternalClose`). This commit also makes a slight adjustment to context management of the change aggregator. Release justification: low-risk cleanup. Release note: None
as of 1982047 this log line is now formatted with Go's templating system, which doesn't require that % is escaped. Before: ``` I210303 09:56:17.459802 311 2@server/status/runtime.go:553 ⋮ [n1] 76 runtime stats: 106 MiB RSS, 261 goroutines (stacks: 5.0 MiB), 29 MiB/63 MiB Go alloc/total (heap fragmentation: 6.8 MiB, heap reserved: 11 MiB, heap released: 12 MiB), 7.6 MiB/13 MiB CGO alloc/total (0.6 CGO/sec), 1.8/1.2 %%(u/s)time, 0.0 %%gc (0x ), 17 KiB/22 KiB (r/w)net ``` After: ``` I210303 09:59:17.607021 59 2@server/status/runtime.go:553 ⋮ [n1] 71 runtime stats: 92 MiB RSS, 255 goroutines (stacks: 3.7 MiB), 27 MiB/48 MiB Go alloc/total (heap fragmentation: 4.8 MiB, heap reserved: 1.3 MiB, heap released: 27 MiB), 10 MiB/15 MiB CGO alloc/total (0.0 CGO/sec), 0.0/0.0 %(u/s)time, 0.0 %gc (0x), 30 KiB/34 KiB (r/w)net ``` Release justification: Negligible risk. Release note: None
Fix a goroutine leak caused by gRPC streams. It turns out that receiving an error from an RPC's stream is not enough to cleanup the stream, you need to do something else as described here https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream This patch starts canceling the stream's context whenever the stream is discarded. Release note: None Release justification: Bug fix for new functionality.
Previously, in order to propagate execution statistics we were creating temporary tracing spans, setting the stats on them, and finishing the spans right away. This allowed for using (to be more precise, abusing) the existing infrastructure. The root of the problem is that in the vectorized engine we do not start per-operator span if stats collection is enabled at the moment, so we had to get around that limitation. However, this way is not how tracing spans are intended to be used and creates some performance slowdown in the hot path, so this commit refactors the situation. Now we are ensuring that there is always a tracing span available at the "root" components (either root materializer or an outbox), so when root components are finishing the vectorized stats collectors for their subtree of operators, there is a span to record the stats into. This required the following minor adjustments: - in the materializer, we now delegate attachment of the stats to the tracing span to the drainHelper (which does so on `ConsumerDone`). Note that the drainHelper doesn't get the recording from the span and leaves that to the materializer (this is needed in order to avoid collecting duplicate trace data). - in the outbox, we now start a "remote child span" (if there is a span in the parent context) in the beginning of `Run` method, and we attach that stats in `sendMetadata`. Release justification: low-risk update to existing functionality. Release note: None
One could run the relevant benchmark with `make bench PKG=./pkg/sql/tests BENCHES=BenchmarkKV//SQL`. Release note: None
Release justification: low-risk update to new functionality. Release note: None
Release justification: low-risk change and prerequisite for metrics improvement. Release note: None
Release justification: low-risk, high benefit change to existing functionality Release note (ops change): Added the `trace.spans_created.count` and `trace.registry.size` metrics that account for the number of created trace spans and tracked local root spans, respectively.
``` $ ./cmp.sh name old time/op new time/op delta KV/Scan/SQL/rows=1/sample_rate=X-16 267µs ± 9% 313µs ± 8% +17.32% (p=0.000 n=22+22) ``` Release justification: Release note (<category, see below>): <what> <show> <why>
``` $ go test -run - -bench SpanStats -count 1 ./pkg/util/tracing goos: linux goarch: amd64 pkg: github.com/cockroachdb/cockroach/pkg/util/tracing BenchmarkSpanStats/autoCollection=false,stats=0-16 280952 4326 ns/op BenchmarkSpanStats/autoCollection=false,stats=1-16 158040 7901 ns/op BenchmarkSpanStats/autoCollection=true,stats=0-16 471117 2631 ns/op BenchmarkSpanStats/autoCollection=true,stats=1-16 242348 5919 ns/op PASS ok github.com/cockroachdb/cockroach/pkg/util/tracing 8.414s ``` The case `autoCollection=false,stats=1` should roughly reflect the SQL usage in which stats are created, transported via a recording, ingested into a parent, and then retrieved and unmarshaled. The case `false,stats=0` should correspond to the non-sampled case. We are thus expecting roughly an overhead of 4.3µs/op when not sampled, 8µs/op when sampled, i.e. changing the sampling rate from 0 to 1 should add ~4µs/op. However, the macro-level benchmark shows much more: ``` KV/Scan/SQL/rows=1/sample_rate=X-16 268µs ± 9% 314µs ± 7% +17.03% (p=0.000 n=21+21) ``` That's a drop of 46µs incurred by changing the sampling rate from zero to one, much past what can be explained by this microbenchmark. The autoCollection=true numbers avoid an extra unmarshal-remarshal cycle of the recording. This is a possible optimization for SQL to investigate in the case in which the flow lives wholly on the gateway. Unfortunately, it is finicky to get right and causes lots of special casing, as SQL currently relies on all trace data arriving at the DistSQL RowReceiver via TrailingMeta (i.e. marshalled recordings). Release justification: Release note (<category, see below>): <what> <show> <why>
This shows that in this benchmark, we're creating one additional span: 7 without sampling, 8 with sampling. This one additional trace span is unlikely to cost ~40us, but avoiding it would establish more parity between the tests. Can we?
@yuzefovich I added That one additional trace span is unlikely to cost ~40us, but avoiding |
I tracked down the extra span, and it's this one: cockroach/pkg/sql/colfetcher/colbatch_scan.go Lines 92 to 96 in a9b582f
I optimized it out 😎 but as expected it didn't make it or break it: diff --git a/pkg/sql/colfetcher/colbatch_scan.go b/pkg/sql/colfetcher/colbatch_scan.go
index cae59582b8..6a69ff7f96 100644
--- a/pkg/sql/colfetcher/colbatch_scan.go
+++ b/pkg/sql/colfetcher/colbatch_scan.go
@@ -92,7 +92,12 @@ func (s *ColBatchScan) Next(ctx context.Context) coldata.Batch {
if execinfra.ShouldCollectStats(s.ctx, s.flowCtx) {
// We need to start a child span so that the only contention events
// present in the recording would be because of this cFetcher.
- s.ctx, s.tracingSpan = execinfra.ProcessorSpan(s.ctx, "colbatchscan")
+ sp := tracing.SpanFromContext(s.ctx)
+ if sp == nil {
+ panic("oops")
+ }
+ s.tracingSpan = sp
+ // s.ctx, s.tracingSpan = execinfra.ProcessorSpan(s.ctx, "colbatchscan")
}
}
bat, err := s.rf.NextBatch(s.ctx)
|
So after all of this, it seems to me that there's evidence that something in SQL is slower when stats collection is enabled, even when tracing just drops all payloads on the floor and is thus definitely not incurring overhead for them, and even when the number of trace spans created is identical. |
Nice experiment! |
Thanks Tobi. I think the benchmark you wrote is a reasonable representation of how SQL propagates stats.
These are the questions I would like to know the answer to as well :) As you found, we are creating an extra span in I'll keep on looking at things to optimize in the SQL land (already found one #61529 which is responsible for a third of so of the performance hit). What puzzles me a bit is that usually I'm using profiles to guide the code optimization, but this time I'm not able to do that. I tried increasing CPU sampling rate (from default 100hz to 40000hz), and still the profiles aren't very helpful. The only strategy that appears to be working is simply commenting out parts that run once sampling is enabled and seeing whether it has any impact on the microbenchmark. |
I share your confusion. I was trying to at least retroactively find the
with this mod @@ -317,12 +320,17 @@ func BenchmarkKV(b *testing.B) {
b.Fatal(err)
}
numSpansBefore := kv.(*kvSQL).tr.Metrics().SpanCreated.Count()
+ f, err := os.Create(fmt.Sprintf("%.2f.prof", sampleRate))
+ require.NoError(b, err)
+ defer f.Close()
b.ResetTimer()
+ pprof.StartCPUProfile(f)
for i := 0; i < b.N; i++ {
if err := opFn(kv, rows, i); err != nil {
b.Fatal(err)
}
}
+ pprof.StopCPUProfile()
b.StopTimer()
numSpansAfter := kv.(*kvSQL).tr.Metrics().SpanCreated.Count()
b.ReportMetric(float64(numSpansAfter-numSpansBefore)/float64(b.N), "spans/op") in the hope that
PS @yuzefovich you may also want to rebase on top of #61359. There is a rando alllocation in
|
Found as part of the investigation in cockroachdb#61328, where these happened to jump out at me in the CPU profile's top10 for a simple read-only workload. Release justification: low-risk perf improvements Release note: None
61535: base,kv: fix two error instantiations in the hot path r=stevendanna a=tbg Found as part of the investigation in #61328, where these happened to jump out at me in the CPU profile's top10 for a simple read-only workload. cc @cockroachlabs/kv Release justification: low-risk perf improvements Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
Old is
sample_rate=0
, new issample_rate=1
.Touches #59379.