[dnm] tracing: investigate perf regression under sql.txn_stats.sample_rate=1 #61328

tbg · 2021-03-02T11:10:17Z

Old is sample_rate=0, new is sample_rate=1.

$ ./cmd.sh
name                                 old time/op  new time/op  delta
KV/Scan/SQL/rows=1/sample_rate=X-16   282µs ± 7%   336µs ± 9%  +18.97%  (p=0.000 n=10+10)

Touches #59379.

cockroach-teamcity · 2021-03-02T11:10:29Z

This change is

yuzefovich

Reviewed 4 of 4 files at r8, 5 of 6 files at r9, 5 of 5 files at r10, 5 of 5 files at r11.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/sql/instrumentation.go, line 181 at r11 (raw file):

			// If we need to collect stats, create a non-verbose child span. Stats
			// will be added as structured metadata and processed in Finish.
			sp := tracing.SpanFromContext(ctx)

This is not quite an equivalent change. In connExecutor.execCmd we create a span for every statement, so sp is always non-nil, but with the current change we tell instrumentationHelper to not "finish" which means that it will not process execution stats attached to the trace. I prototyped the proper usage of that already present span in yuzefovich@58ab90b, but it doesn't seem to change the micro benchmark numbers.

pkg/sql/colflow/stats.go, line 257 at r9 (raw file):

	// is not the way they are supposed to be used, so this should be fixed.
	if false {
		_, span := tracing.ChildSpan(ctx, opName)

After addressing the span creation in #61380, the performance gains of this commit in the benchmark went away. What this change essentially does is to drop on the floor all of the execution statistics we collected, thus there is less information to propagate via the trace.

tbg · 2021-03-04T09:51:23Z

Updated the PR on top of your recent work, @yuzefovich

@irfansharif as we discussed yesterday, mind taking a look here with me? The branch should be relatively clean now. I added a microbenchmark that tries to capture exactly the overhead that would be incurred from setting the sampling rate from zero to one. It incurs a cost, but I don't see anything that can explain the slow down I see via ./cmp.sh.

@yuzefovich could you scrutinize the microbenchmark? Is there any significant work that is done as a result of propagating component stats that it doesn't capture?

$ go test -run - -bench SpanStats -count 1 ./pkg/util/tracing
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/cockroach/pkg/util/tracing
BenchmarkSpanStats/autoCollection=false,stats=0-16         	 280952	     4326 ns/op
BenchmarkSpanStats/autoCollection=false,stats=1-16         	 158040	     7901 ns/op

BenchmarkSpanStats/autoCollection=true,stats=0-16          	 471117	     2631 ns/op
BenchmarkSpanStats/autoCollection=true,stats=1-16          	 242348	     5919 ns/op
PASS
ok  	github.com/cockroachdb/cockroach/pkg/util/tracing	8.414s

The case autoCollection=false,stats=1 should roughly reflect the SQL
usage in which stats are created, transported via a recording, ingested
into a parent, and then retrieved and unmarshaled. The case false,stats=0
should correspond to the non-sampled case. We are thus expecting roughly
an overhead of 4.3µs/op when not sampled, 8µs/op when sampled, i.e.
changing the sampling rate from 0 to 1 should add ~4µs/op.
However, the macro-level benchmark shows much more:

KV/Scan/SQL/rows=1/sample_rate=X-16   268µs ± 9%   314µs ± 7%  +17.03%  (p=0.000 n=21+21)

That's a drop of 46µs incurred by changing the sampling rate from
zero to one, much past what can be explained by this microbenchmark.

The autoCollection=true numbers avoid an extra unmarshal-remarshal cycle
of the recording. This is a possible optimization for SQL to investigate
in the case in which the flow lives wholly on the gateway.
Unfortunately, it is finicky to get right and causes lots of special
casing, as SQL currently relies on all trace data arriving at the
DistSQL RowReceiver via TrailingMeta (i.e. marshalled recordings).

tbg · 2021-03-04T10:05:16Z

This is also consistent with this:

With RecordStructured turned into a noop:

name                                 old time/op  new time/op  delta
KV/Scan/SQL/rows=1/sample_rate=X-16   261µs ±10%   292µs ±10%  +12.08%  (p=0.000 n=20+21)
name                                 old time/op  new time/op  delta
KV/Scan/SQL/rows=1/sample_rate=X-16   267µs ±11%   293µs ± 6%  +9.40%  (p=0.000 n=21+21)

With RecordStructured doing its job:

name                                 old time/op  new time/op  delta
KV/Scan/SQL/rows=1/sample_rate=X-16   271µs ± 9%   310µs ± 8%  +14.69%  (p=0.000 n=21+21)
name                                 old time/op  new time/op  delta
KV/Scan/SQL/rows=1/sample_rate=X-16   276µs ± 7%   316µs ± 8%  +14.48%  (p=0.000 n=21+20)

If throwing away the ComponentStats offered to the Span, they won't get
marshalled or unmarshalled or even put on the wire. And yet, there's a 10%
performance hit for the setting. What is the expensive thing we're doing?
Creating more spans?

Fixes cockroachdb#59308. This commit adds support for range merge transactions to refresh. This is necessary to allow the merge transaction to have its write timestamp be bumped and still commit without retrying. In such cases, the transaction must refresh its reads, including its original read on the RHS's local range descriptor. If we were to block this refresh on the frozen RHS range, the merge would deadlock. On the surface, it seems unsafe to permit Refresh requests on an already subsumed RHS range, because the refresh's effect on the timestamp cache will never make it to the LHS leaseholder. This risks the future joint range serving a write that invalidates the Refresh. However, in this specific situation, we can be sure that such a serializability violation will not occur because the Range merge also writes to (deletes) this key. This means that if the Range merge transaction commits, its intent on the key will be resolved to the timestamp of the refresh and no future write will ever be able to violate the refresh. Conversely, if the Range merge transaction does not commit, then the merge will fail and the update to the RHS's timestamp cache will not be lost (not that this particularly matters in cases of aborted transactions). The same line of reasoning as the one above has motivated us to explore removing keys from a transaction's refresh spans when they are written to by the transaction, as the intents written by the transaction act as a form of pessimistic lock that obviate the need for the optimistic refresh. Such an improvement would eliminate the need for this special case, but until we generalize the mechanism to prune refresh spans based on intent spans, we're forced to live with this. See cockroachdb#59308 (comment) for why the original fix, which attempted to manually refresh the range merge transaction before it entered its critical phase, was not sufficient. Release justification: needed for new functionality.

This commit adds support for a stream client that can connect to a cluster and start replicating changes that happen on that stream. Release note: None Release justification: low-risk, high-benefit change

Previously, creating interleaved partitioned indexes panicked. This commit disallows their creation to prevent panicking. Fixes cockroachdb#60699 Release justification: This is a low risk change that prevents panics when attempting to create interleaved partitioned tables. Release note (bug fix): Creating interleaved partitioned indexes is now disallowed. Previously, the database would crash when trying to create one.

This commit removes redundant context argument from `trailingMetaCallback` in favor of just using `ProcessorBase.Ctx` when needed. This was already actually the case since the callback was always called with `ProcessorBase.Ctx`. We also had the wrong sequences of actions when collecting trailing metadata in inverted joiner, join reader, and zigzag joiner, where we first called `InternalClose` and then generated the metadata. That is wrong because `InternalClose` replaces the context used throughout the operation of a processor with the "original" context (the one passed into `StartInternal`). This is now fixed. Additionally, a few unnecessary method overwrites are removed and `ProcessorBase` now provides the default implementation of `ConsumerClosed` (which simply calls `InternalClose`). This commit also makes a slight adjustment to context management of the change aggregator. Release justification: low-risk cleanup. Release note: None

as of 1982047 this log line is now formatted with Go's templating system, which doesn't require that % is escaped. Before: ``` I210303 09:56:17.459802 311 2@server/status/runtime.go:553 ⋮ [n1] 76 runtime stats: 106 MiB RSS, 261 goroutines (stacks: 5.0 MiB), 29 MiB/63 MiB Go alloc/total (heap fragmentation: 6.8 MiB, heap reserved: 11 MiB, heap released: 12 MiB), 7.6 MiB/13 MiB CGO alloc/total (0.6 CGO/sec), 1.8/1.2 %%(u/s)time, 0.0 %%gc (0x ), 17 KiB/22 KiB (r/w)net ``` After: ``` I210303 09:59:17.607021 59 2@server/status/runtime.go:553 ⋮ [n1] 71 runtime stats: 92 MiB RSS, 255 goroutines (stacks: 3.7 MiB), 27 MiB/48 MiB Go alloc/total (heap fragmentation: 4.8 MiB, heap reserved: 1.3 MiB, heap released: 27 MiB), 10 MiB/15 MiB CGO alloc/total (0.0 CGO/sec), 0.0/0.0 %(u/s)time, 0.0 %gc (0x), 30 KiB/34 KiB (r/w)net ``` Release justification: Negligible risk. Release note: None

Fix a goroutine leak caused by gRPC streams. It turns out that receiving an error from an RPC's stream is not enough to cleanup the stream, you need to do something else as described here https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream This patch starts canceling the stream's context whenever the stream is discarded. Release note: None Release justification: Bug fix for new functionality.

Previously, in order to propagate execution statistics we were creating temporary tracing spans, setting the stats on them, and finishing the spans right away. This allowed for using (to be more precise, abusing) the existing infrastructure. The root of the problem is that in the vectorized engine we do not start per-operator span if stats collection is enabled at the moment, so we had to get around that limitation. However, this way is not how tracing spans are intended to be used and creates some performance slowdown in the hot path, so this commit refactors the situation. Now we are ensuring that there is always a tracing span available at the "root" components (either root materializer or an outbox), so when root components are finishing the vectorized stats collectors for their subtree of operators, there is a span to record the stats into. This required the following minor adjustments: - in the materializer, we now delegate attachment of the stats to the tracing span to the drainHelper (which does so on `ConsumerDone`). Note that the drainHelper doesn't get the recording from the span and leaves that to the materializer (this is needed in order to avoid collecting duplicate trace data). - in the outbox, we now start a "remote child span" (if there is a span in the parent context) in the beginning of `Run` method, and we attach that stats in `sendMetadata`. Release justification: low-risk update to existing functionality. Release note: None

One could run the relevant benchmark with `make bench PKG=./pkg/sql/tests BENCHES=BenchmarkKV//SQL`. Release note: None

Release justification: low-risk update to new functionality. Release note: None

Release justification: low-risk change and prerequisite for metrics improvement. Release note: None

Release justification: low-risk, high benefit change to existing functionality Release note (ops change): Added the `trace.spans_created.count` and `trace.registry.size` metrics that account for the number of created trace spans and tracked local root spans, respectively.

``` $ ./cmp.sh name old time/op new time/op delta KV/Scan/SQL/rows=1/sample_rate=X-16 267µs ± 9% 313µs ± 8% +17.32% (p=0.000 n=22+22) ``` Release justification: Release note (<category, see below>): <what> <show> <why>

``` $ go test -run - -bench SpanStats -count 1 ./pkg/util/tracing goos: linux goarch: amd64 pkg: github.com/cockroachdb/cockroach/pkg/util/tracing BenchmarkSpanStats/autoCollection=false,stats=0-16 280952 4326 ns/op BenchmarkSpanStats/autoCollection=false,stats=1-16 158040 7901 ns/op BenchmarkSpanStats/autoCollection=true,stats=0-16 471117 2631 ns/op BenchmarkSpanStats/autoCollection=true,stats=1-16 242348 5919 ns/op PASS ok github.com/cockroachdb/cockroach/pkg/util/tracing 8.414s ``` The case `autoCollection=false,stats=1` should roughly reflect the SQL usage in which stats are created, transported via a recording, ingested into a parent, and then retrieved and unmarshaled. The case `false,stats=0` should correspond to the non-sampled case. We are thus expecting roughly an overhead of 4.3µs/op when not sampled, 8µs/op when sampled, i.e. changing the sampling rate from 0 to 1 should add ~4µs/op. However, the macro-level benchmark shows much more: ``` KV/Scan/SQL/rows=1/sample_rate=X-16 268µs ± 9% 314µs ± 7% +17.03% (p=0.000 n=21+21) ``` That's a drop of 46µs incurred by changing the sampling rate from zero to one, much past what can be explained by this microbenchmark. The autoCollection=true numbers avoid an extra unmarshal-remarshal cycle of the recording. This is a possible optimization for SQL to investigate in the case in which the flow lives wholly on the gateway. Unfortunately, it is finicky to get right and causes lots of special casing, as SQL currently relies on all trace data arriving at the DistSQL RowReceiver via TrailingMeta (i.e. marshalled recordings). Release justification: Release note (<category, see below>): <what> <show> <why>

This shows that in this benchmark, we're creating one additional span: 7 without sampling, 8 with sampling. This one additional trace span is unlikely to cost ~40us, but avoiding it would establish more parity between the tests. Can we?

tbg · 2021-03-04T10:22:11Z

@yuzefovich I added span/op counts to the micro and SELECT benchmark I'm running. This shows that in ./cmp.sh benchmark, we're creating 8 spans when sampling, 7 when not sampling.

That one additional trace span is unlikely to cost ~40us, but avoiding
it would establish more parity between the tests. Can we do that, at least as a hack to see what the effect would be?

tbg · 2021-03-04T10:45:29Z

I tracked down the extra span, and it's this one:

cockroach/pkg/sql/colfetcher/colbatch_scan.go

Lines 92 to 96 in a9b582f

    
           if execinfra.ShouldCollectStats(s.ctx, s.flowCtx) { 
        
           	// We need to start a child span so that the only contention events 
        
           	// present in the recording would be because of this cFetcher. 
        
           	s.ctx, s.tracingSpan = execinfra.ProcessorSpan(s.ctx, "colbatchscan") 
        
           }

I optimized it out 😎 but as expected it didn't make it or break it:

diff --git a/pkg/sql/colfetcher/colbatch_scan.go b/pkg/sql/colfetcher/colbatch_scan.go
index cae59582b8..6a69ff7f96 100644
--- a/pkg/sql/colfetcher/colbatch_scan.go
+++ b/pkg/sql/colfetcher/colbatch_scan.go
@@ -92,7 +92,12 @@ func (s *ColBatchScan) Next(ctx context.Context) coldata.Batch {
                if execinfra.ShouldCollectStats(s.ctx, s.flowCtx) {
                        // We need to start a child span so that the only contention events
                        // present in the recording would be because of this cFetcher.
-                       s.ctx, s.tracingSpan = execinfra.ProcessorSpan(s.ctx, "colbatchscan")
+                       sp := tracing.SpanFromContext(s.ctx)
+                       if sp == nil {
+                               panic("oops")
+                       }
+                       s.tracingSpan = sp
+                       // s.ctx, s.tracingSpan = execinfra.ProcessorSpan(s.ctx, "colbatchscan")
                }
        }
        bat, err := s.rf.NextBatch(s.ctx)

name                                 old time/op   new time/op   delta
KV/Scan/SQL/rows=1/sample_rate=X-16    266µs ± 7%    316µs ± 7%  +18.68%  (p=0.000 n=21+21)

name                                 old spans/op  new spans/op  delta
KV/Scan/SQL/rows=1/sample_rate=X-16     7.00 ± 0%     7.00 ± 0%     ~     (all equal)

tbg · 2021-03-04T10:51:17Z

So after all of this, it seems to me that there's evidence that something in SQL is slower when stats collection is enabled, even when tracing just drops all payloads on the floor and is thus definitely not incurring overhead for them, and even when the number of trace spans created is identical.

awoods187 · 2021-03-04T14:48:24Z

Nice experiment!

yuzefovich · 2021-03-05T03:47:04Z

Thanks Tobi. I think the benchmark you wrote is a reasonable representation of how SQL propagates stats.

And yet, there's a 10% performance hit for the setting. What is the expensive thing we're doing? Creating more spans?

These are the questions I would like to know the answer to as well :)

As you found, we are creating an extra span in colbatch_scan, and I don't think we can get around that (without it we could be confusing contention events from different table readers running on the same node I think, to be confirmed), yet it doesn't appear to be the culprit.

I'll keep on looking at things to optimize in the SQL land (already found one #61529 which is responsible for a third of so of the performance hit). What puzzles me a bit is that usually I'm using profiles to guide the code optimization, but this time I'm not able to do that. I tried increasing CPU sampling rate (from default 100hz to 40000hz), and still the profiles aren't very helpful. The only strategy that appears to be working is simply commenting out parts that run once sampling is enabled and seeing whether it has any impact on the microbenchmark.

tbg · 2021-03-05T07:36:42Z

I share your confusion. I was trying to at least retroactively find the flowSpecToDiagram regression you just fixed (on this branch, which does not have the fix) but not a chance. I tried

#!/bin/bash

go test -c ./pkg/sql/tests
for rate in 0.0 1.0; do
env GOGC=off ./tests.test -test.run - -test.bench "BenchmarkKV/Scan/SQL/rows=1/sample_rate=${rate}" -test.benchtime=50000x
done

with this mod

@@ -317,12 +320,17 @@ func BenchmarkKV(b *testing.B) {
                                                                        b.Fatal(err)
                                                                }
                                                                numSpansBefore := kv.(*kvSQL).tr.Metrics().SpanCreated.Count()
+                                                               f, err := os.Create(fmt.Sprintf("%.2f.prof", sampleRate))
+                                                               require.NoError(b, err)
+                                                               defer f.Close()
                                                                b.ResetTimer()
+                                                               pprof.StartCPUProfile(f)
                                                                for i := 0; i < b.N; i++ {
                                                                        if err := opFn(kv, rows, i); err != nil {
                                                                                b.Fatal(err)
                                                                        }
                                                                }
+                                                               pprof.StopCPUProfile()
                                                                b.StopTimer()
                                                                numSpansAfter := kv.(*kvSQL).tr.Metrics().SpanCreated.Count()
                                                                b.ReportMetric(float64(numSpansAfter-numSpansBefore)/float64(b.N), "spans/op")

in the hope that go tool pprof -http :6060 -diff_base 0.00.prof 1.00.prof would at least have that method show up but nope. I did ultimately find it the top100 -cum at the very end for the 1.00 profile with focus=PlanAndRun:

     0.01s 0.061%  8.91%      0.58s  3.56%  github.com/cockroachdb/cockroach/pkg/sql.(*PlanningCtx).getDefaultSaveFlowsFunc.func1
         0     0%  8.91%      0.57s  3.50%  github.com/cockroachdb/cockroach/pkg/sql.(*PlanningCtx).flowSpecsToDiagram

PS @yuzefovich you may also want to rebase on top of #61359. There is a rando alllocation in startSpanGeneric that this PR fixes, which currently accounts for the bulk of time spent there:

.      150ms    379:	helper.crdbSpan.mu.structured.Reserve(maxStructuredEventsPerSpan)

Found as part of the investigation in cockroachdb#61328, where these happened to jump out at me in the CPU profile's top10 for a simple read-only workload. Release justification: low-risk perf improvements Release note: None

61535: base,kv: fix two error instantiations in the hot path r=stevendanna a=tbg Found as part of the investigation in #61328, where these happened to jump out at me in the CPU profile's top10 for a simple read-only workload. cc @cockroachlabs/kv Release justification: low-risk perf improvements Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

tbg requested a review from yuzefovich March 2, 2021 11:10

This was referenced Mar 2, 2021

tracing: memory budgeting #59188

Closed

sql: evaluate sql.txn_stats.sample_rate default #59379

Closed

yuzefovich reviewed Mar 3, 2021

View reviewed changes

tbg force-pushed the tracing-sample-rate branch from f5b607c to 8a61829 Compare March 4, 2021 09:48

nvanbenschoten and others added 17 commits March 4, 2021 11:09

streamclient: add client for sinkless CREATE REPLICATOIN STREAM

42cbdee

This commit adds support for a stream client that can connect to a cluster and start replicating changes that happen on that stream. Release note: None Release justification: low-risk, high-benefit change

[DNM] adjust KV benchmark to introduce sample rate

d94e463

One could run the relevant benchmark with `make bench PKG=./pkg/sql/tests BENCHES=BenchmarkKV//SQL`. Release note: None

WIP remove NodeLevelStats

7d843f3

WIP pool ComponentStats objects

f273b30

DNM more pooling and removing stuff from trace analyzer

e31d115

sql: use stmt's span for exec stats propagation

95970ec

Release justification: low-risk update to new functionality. Release note: None

metric: break dependency on 'log' package

bf5d4a4

Release justification: low-risk change and prerequisite for metrics improvement. Release note: None

Run benchmark without GC

4953761

``` $ ./cmp.sh name old time/op new time/op delta KV/Scan/SQL/rows=1/sample_rate=X-16 267µs ± 9% 313µs ± 8% +17.32% (p=0.000 n=22+22) ``` Release justification: Release note (<category, see below>): <what> <show> <why>

add spans/op measurement to BenchmarkKV

a0e2cf8

This shows that in this benchmark, we're creating one additional span: 7 without sampling, 8 with sampling. This one additional trace span is unlikely to cost ~40us, but avoiding it would establish more parity between the tests. Can we?

tbg force-pushed the tracing-sample-rate branch from 8a61829 to a0e2cf8 Compare March 4, 2021 10:21

tbg mentioned this pull request Mar 5, 2021

base,kv: fix two error instantiations in the hot path #61535

Merged

irfansharif self-assigned this Mar 10, 2021

tbg closed this Apr 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dnm] tracing: investigate perf regression under sql.txn_stats.sample_rate=1 #61328

[dnm] tracing: investigate perf regression under sql.txn_stats.sample_rate=1 #61328

tbg commented Mar 2, 2021

cockroach-teamcity commented Mar 2, 2021

yuzefovich left a comment

tbg commented Mar 4, 2021

tbg commented Mar 4, 2021

tbg commented Mar 4, 2021

tbg commented Mar 4, 2021

tbg commented Mar 4, 2021

awoods187 commented Mar 4, 2021

yuzefovich commented Mar 5, 2021

tbg commented Mar 5, 2021

[dnm] tracing: investigate perf regression under sql.txn_stats.sample_rate=1 #61328

[dnm] tracing: investigate perf regression under sql.txn_stats.sample_rate=1 #61328

Conversation

tbg commented Mar 2, 2021

cockroach-teamcity commented Mar 2, 2021

yuzefovich left a comment

Choose a reason for hiding this comment

tbg commented Mar 4, 2021

tbg commented Mar 4, 2021

tbg commented Mar 4, 2021

tbg commented Mar 4, 2021

tbg commented Mar 4, 2021

awoods187 commented Mar 4, 2021

yuzefovich commented Mar 5, 2021

tbg commented Mar 5, 2021