Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dnm] tracing: investigate perf regression under sql.txn_stats.sample_rate=1 #61328

Closed
wants to merge 17 commits into from

Conversation

tbg
Copy link
Member

@tbg tbg commented Mar 2, 2021

Old is sample_rate=0, new is sample_rate=1.

$ ./cmd.sh
name                                 old time/op  new time/op  delta
KV/Scan/SQL/rows=1/sample_rate=X-16   282µs ± 7%   336µs ± 9%  +18.97%  (p=0.000 n=10+10)

Touches #59379.

@tbg tbg requested a review from yuzefovich March 2, 2021 11:10
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r8, 5 of 6 files at r9, 5 of 5 files at r10, 5 of 5 files at r11.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @tbg)


pkg/sql/instrumentation.go, line 181 at r11 (raw file):

			// If we need to collect stats, create a non-verbose child span. Stats
			// will be added as structured metadata and processed in Finish.
			sp := tracing.SpanFromContext(ctx)

This is not quite an equivalent change. In connExecutor.execCmd we create a span for every statement, so sp is always non-nil, but with the current change we tell instrumentationHelper to not "finish" which means that it will not process execution stats attached to the trace. I prototyped the proper usage of that already present span in yuzefovich@58ab90b, but it doesn't seem to change the micro benchmark numbers.


pkg/sql/colflow/stats.go, line 257 at r9 (raw file):

	// is not the way they are supposed to be used, so this should be fixed.
	if false {
		_, span := tracing.ChildSpan(ctx, opName)

After addressing the span creation in #61380, the performance gains of this commit in the benchmark went away. What this change essentially does is to drop on the floor all of the execution statistics we collected, thus there is less information to propagate via the trace.

@tbg
Copy link
Member Author

tbg commented Mar 4, 2021

Updated the PR on top of your recent work, @yuzefovich

@irfansharif as we discussed yesterday, mind taking a look here with me? The branch should be relatively clean now. I added a microbenchmark that tries to capture exactly the overhead that would be incurred from setting the sampling rate from zero to one. It incurs a cost, but I don't see anything that can explain the slow down I see via ./cmp.sh.

@yuzefovich could you scrutinize the microbenchmark? Is there any significant work that is done as a result of propagating component stats that it doesn't capture?

$ go test -run - -bench SpanStats -count 1 ./pkg/util/tracing
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/cockroach/pkg/util/tracing
BenchmarkSpanStats/autoCollection=false,stats=0-16         	 280952	     4326 ns/op
BenchmarkSpanStats/autoCollection=false,stats=1-16         	 158040	     7901 ns/op

BenchmarkSpanStats/autoCollection=true,stats=0-16          	 471117	     2631 ns/op
BenchmarkSpanStats/autoCollection=true,stats=1-16          	 242348	     5919 ns/op
PASS
ok  	github.com/cockroachdb/cockroach/pkg/util/tracing	8.414s

The case autoCollection=false,stats=1 should roughly reflect the SQL
usage in which stats are created, transported via a recording, ingested
into a parent, and then retrieved and unmarshaled. The case false,stats=0
should correspond to the non-sampled case. We are thus expecting roughly
an overhead of 4.3µs/op when not sampled, 8µs/op when sampled, i.e.
changing the sampling rate from 0 to 1 should add ~4µs/op.
However, the macro-level benchmark shows much more:

KV/Scan/SQL/rows=1/sample_rate=X-16   268µs ± 9%   314µs ± 7%  +17.03%  (p=0.000 n=21+21)

That's a drop of 46µs incurred by changing the sampling rate from
zero to one, much past what can be explained by this microbenchmark.

The autoCollection=true numbers avoid an extra unmarshal-remarshal cycle
of the recording. This is a possible optimization for SQL to investigate
in the case in which the flow lives wholly on the gateway.
Unfortunately, it is finicky to get right and causes lots of special
casing, as SQL currently relies on all trace data arriving at the
DistSQL RowReceiver via TrailingMeta (i.e. marshalled recordings).

@tbg
Copy link
Member Author

tbg commented Mar 4, 2021

This is also consistent with this:

With RecordStructured turned into a noop:

name                                 old time/op  new time/op  delta
KV/Scan/SQL/rows=1/sample_rate=X-16   261µs ±10%   292µs ±10%  +12.08%  (p=0.000 n=20+21)
name                                 old time/op  new time/op  delta
KV/Scan/SQL/rows=1/sample_rate=X-16   267µs ±11%   293µs ± 6%  +9.40%  (p=0.000 n=21+21)

With RecordStructured doing its job:

name                                 old time/op  new time/op  delta
KV/Scan/SQL/rows=1/sample_rate=X-16   271µs ± 9%   310µs ± 8%  +14.69%  (p=0.000 n=21+21)
name                                 old time/op  new time/op  delta
KV/Scan/SQL/rows=1/sample_rate=X-16   276µs ± 7%   316µs ± 8%  +14.48%  (p=0.000 n=21+20)

If throwing away the ComponentStats offered to the Span, they won't get
marshalled or unmarshalled or even put on the wire. And yet, there's a 10%
performance hit for the setting. What is the expensive thing we're doing?
Creating more spans?

nvanbenschoten and others added 17 commits March 4, 2021 11:09
Fixes cockroachdb#59308.

This commit adds support for range merge transactions to refresh. This
is necessary to allow the merge transaction to have its write timestamp
be bumped and still commit without retrying. In such cases, the
transaction must refresh its reads, including its original read on the
RHS's local range descriptor. If we were to block this refresh on the
frozen RHS range, the merge would deadlock.

On the surface, it seems unsafe to permit Refresh requests on an already
subsumed RHS range, because the refresh's effect on the timestamp cache
will never make it to the LHS leaseholder. This risks the future joint
range serving a write that invalidates the Refresh. However, in this
specific situation, we can be sure that such a serializability violation
will not occur because the Range merge also writes to (deletes) this
key. This means that if the Range merge transaction commits, its intent
on the key will be resolved to the timestamp of the refresh and no
future write will ever be able to violate the refresh. Conversely, if the
Range merge transaction does not commit, then the merge will fail and
the update to the RHS's timestamp cache will not be lost (not that this
particularly matters in cases of aborted transactions).

The same line of reasoning as the one above has motivated us to explore
removing keys from a transaction's refresh spans when they are written
to by the transaction, as the intents written by the transaction act as
a form of pessimistic lock that obviate the need for the optimistic
refresh. Such an improvement would eliminate the need for this special
case, but until we generalize the mechanism to prune refresh spans based
on intent spans, we're forced to live with this.

See cockroachdb#59308 (comment)
for why the original fix, which attempted to manually refresh the range
merge transaction before it entered its critical phase, was not sufficient.

Release justification: needed for new functionality.
This commit adds support for a stream client that can connect to a
cluster and start replicating changes that happen on that stream.

Release note: None

Release justification: low-risk, high-benefit change
Previously, creating interleaved partitioned indexes panicked. This
commit disallows their creation to prevent panicking.

Fixes cockroachdb#60699

Release justification: This is a low risk change that prevents panics
when attempting to create interleaved partitioned tables.

Release note (bug fix): Creating interleaved partitioned indexes is now
disallowed. Previously, the database would crash when trying to create
one.
This commit removes redundant context argument from
`trailingMetaCallback` in favor of just using `ProcessorBase.Ctx` when
needed. This was already actually the case since the callback was always
called with `ProcessorBase.Ctx`. We also had the wrong sequences of
actions when collecting trailing metadata in inverted joiner, join
reader, and zigzag joiner, where we first called `InternalClose` and
then generated the metadata. That is wrong because `InternalClose`
replaces the context used throughout the operation of a processor with
the "original" context (the one passed into `StartInternal`). This is
now fixed.

Additionally, a few unnecessary method overwrites are removed and
`ProcessorBase` now provides the default implementation of
`ConsumerClosed` (which simply calls `InternalClose`).

This commit also makes a slight adjustment to context management of the
change aggregator.

Release justification: low-risk cleanup.

Release note: None
as of 1982047 this log line is now formatted with Go's
templating system, which doesn't require that % is escaped.

Before:

```
I210303 09:56:17.459802 311 2@server/status/runtime.go:553 ⋮ [n1] 76
runtime stats: 106 MiB RSS, 261 goroutines (stacks: 5.0 MiB), 29
MiB/63 MiB Go alloc/total (heap fragmentation: 6.8 MiB, heap reserved:
11 MiB, heap released: 12 MiB), 7.6 MiB/13 MiB CGO alloc/total (0.6
CGO/sec), 1.8/1.2 %%(u/s)time, 0.0 %%gc (0x ), 17 KiB/22 KiB (r/w)net
```

After:

```
I210303 09:59:17.607021 59 2@server/status/runtime.go:553 ⋮ [n1] 71
runtime stats: 92 MiB RSS, 255 goroutines (stacks: 3.7 MiB), 27 MiB/48
MiB Go alloc/total (heap fragmentation: 4.8 MiB, heap reserved: 1.3
MiB, heap released: 27 MiB), 10 MiB/15 MiB CGO alloc/total (0.0
CGO/sec), 0.0/0.0 %(u/s)time, 0.0 %gc (0x), 30 KiB/34 KiB (r/w)net
```

Release justification: Negligible risk.
Release note: None
Fix a goroutine leak caused by gRPC streams. It turns out that
receiving an error from an RPC's stream is not enough to cleanup the
stream, you need to do something else as described here
https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream

This patch starts canceling the stream's context whenever the stream is
discarded.

Release note: None
Release justification: Bug fix for new functionality.
Previously, in order to propagate execution statistics we were creating
temporary tracing spans, setting the stats on them, and finishing the
spans right away. This allowed for using (to be more precise, abusing)
the existing infrastructure. The root of the problem is that in the
vectorized engine we do not start per-operator span if stats collection
is enabled at the moment, so we had to get around that limitation.

However, this way is not how tracing spans are intended to be used and
creates some performance slowdown in the hot path, so this commit
refactors the situation. Now we are ensuring that there is always
a tracing span available at the "root" components (either root
materializer or an outbox), so when root components are finishing the
vectorized stats collectors for their subtree of operators, there is
a span to record the stats into.

This required the following minor adjustments:
- in the materializer, we now delegate attachment of the stats to the
tracing span to the drainHelper (which does so on `ConsumerDone`). Note
that the drainHelper doesn't get the recording from the span and leaves
that to the materializer (this is needed in order to avoid collecting
duplicate trace data).
- in the outbox, we now start a "remote child span" (if there is a span
in the parent context) in the beginning of `Run` method, and we attach
that stats in `sendMetadata`.

Release justification: low-risk update to existing functionality.

Release note: None
One could run the relevant benchmark with
`make bench PKG=./pkg/sql/tests BENCHES=BenchmarkKV//SQL`.

Release note: None
Release justification: low-risk update to new functionality.

Release note: None
Release justification: low-risk change and prerequisite for metrics
improvement.
Release note: None
Release justification: low-risk, high benefit change to existing
functionality
Release note (ops change): Added the `trace.spans_created.count` and
`trace.registry.size` metrics that account for the number of created
trace spans and tracked local root spans, respectively.
```
$ ./cmp.sh
name                                 old time/op  new time/op  delta
KV/Scan/SQL/rows=1/sample_rate=X-16   267µs ± 9%   313µs ± 8%  +17.32%  (p=0.000 n=22+22)
```

Release justification:
Release note (<category, see below>): <what> <show> <why>
```
$ go test -run - -bench SpanStats -count 1 ./pkg/util/tracing
goos: linux
goarch: amd64
pkg: github.com/cockroachdb/cockroach/pkg/util/tracing
BenchmarkSpanStats/autoCollection=false,stats=0-16         	 280952	     4326 ns/op
BenchmarkSpanStats/autoCollection=false,stats=1-16         	 158040	     7901 ns/op

BenchmarkSpanStats/autoCollection=true,stats=0-16          	 471117	     2631 ns/op
BenchmarkSpanStats/autoCollection=true,stats=1-16          	 242348	     5919 ns/op
PASS
ok  	github.com/cockroachdb/cockroach/pkg/util/tracing	8.414s
```

The case `autoCollection=false,stats=1` should roughly reflect the SQL
usage in which stats are created, transported via a recording, ingested
into a parent, and then retrieved and unmarshaled. The case `false,stats=0`
should correspond to the non-sampled case. We are thus expecting roughly
an overhead of 4.3µs/op when not sampled, 8µs/op when sampled, i.e.
changing the sampling rate from 0 to 1 should add ~4µs/op.
However, the macro-level benchmark shows much more:

```
KV/Scan/SQL/rows=1/sample_rate=X-16   268µs ± 9%   314µs ± 7%  +17.03%  (p=0.000 n=21+21)
```

That's a drop of 46µs incurred by changing the sampling rate from
zero to one, much past what can be explained by this microbenchmark.

The autoCollection=true numbers avoid an extra unmarshal-remarshal cycle
of the recording. This is a possible optimization for SQL to investigate
in the case in which the flow lives wholly on the gateway.
Unfortunately, it is finicky to get right and causes lots of special
casing, as SQL currently relies on all trace data arriving at the
DistSQL RowReceiver via TrailingMeta (i.e. marshalled recordings).

Release justification:
Release note (<category, see below>): <what> <show> <why>
This shows that in this benchmark, we're creating one additional span: 7
without sampling, 8 with sampling.

This one additional trace span is unlikely to cost ~40us, but avoiding
it would establish more parity between the tests. Can we?
@tbg
Copy link
Member Author

tbg commented Mar 4, 2021

@yuzefovich I added span/op counts to the micro and SELECT benchmark I'm running. This shows that in ./cmp.sh benchmark, we're creating 8 spans when sampling, 7 when not sampling.

That one additional trace span is unlikely to cost ~40us, but avoiding
it would establish more parity between the tests. Can we do that, at least as a hack to see what the effect would be?

@tbg
Copy link
Member Author

tbg commented Mar 4, 2021

I tracked down the extra span, and it's this one:

if execinfra.ShouldCollectStats(s.ctx, s.flowCtx) {
// We need to start a child span so that the only contention events
// present in the recording would be because of this cFetcher.
s.ctx, s.tracingSpan = execinfra.ProcessorSpan(s.ctx, "colbatchscan")
}

I optimized it out 😎 but as expected it didn't make it or break it:

diff --git a/pkg/sql/colfetcher/colbatch_scan.go b/pkg/sql/colfetcher/colbatch_scan.go
index cae59582b8..6a69ff7f96 100644
--- a/pkg/sql/colfetcher/colbatch_scan.go
+++ b/pkg/sql/colfetcher/colbatch_scan.go
@@ -92,7 +92,12 @@ func (s *ColBatchScan) Next(ctx context.Context) coldata.Batch {
                if execinfra.ShouldCollectStats(s.ctx, s.flowCtx) {
                        // We need to start a child span so that the only contention events
                        // present in the recording would be because of this cFetcher.
-                       s.ctx, s.tracingSpan = execinfra.ProcessorSpan(s.ctx, "colbatchscan")
+                       sp := tracing.SpanFromContext(s.ctx)
+                       if sp == nil {
+                               panic("oops")
+                       }
+                       s.tracingSpan = sp
+                       // s.ctx, s.tracingSpan = execinfra.ProcessorSpan(s.ctx, "colbatchscan")
                }
        }
        bat, err := s.rf.NextBatch(s.ctx)
name                                 old time/op   new time/op   delta
KV/Scan/SQL/rows=1/sample_rate=X-16    266µs ± 7%    316µs ± 7%  +18.68%  (p=0.000 n=21+21)

name                                 old spans/op  new spans/op  delta
KV/Scan/SQL/rows=1/sample_rate=X-16     7.00 ± 0%     7.00 ± 0%     ~     (all equal)

@tbg
Copy link
Member Author

tbg commented Mar 4, 2021

So after all of this, it seems to me that there's evidence that something in SQL is slower when stats collection is enabled, even when tracing just drops all payloads on the floor and is thus definitely not incurring overhead for them, and even when the number of trace spans created is identical.

@awoods187
Copy link
Contributor

Nice experiment!

@yuzefovich
Copy link
Member

Thanks Tobi. I think the benchmark you wrote is a reasonable representation of how SQL propagates stats.

And yet, there's a 10% performance hit for the setting. What is the expensive thing we're doing? Creating more spans?

These are the questions I would like to know the answer to as well :)

As you found, we are creating an extra span in colbatch_scan, and I don't think we can get around that (without it we could be confusing contention events from different table readers running on the same node I think, to be confirmed), yet it doesn't appear to be the culprit.

I'll keep on looking at things to optimize in the SQL land (already found one #61529 which is responsible for a third of so of the performance hit). What puzzles me a bit is that usually I'm using profiles to guide the code optimization, but this time I'm not able to do that. I tried increasing CPU sampling rate (from default 100hz to 40000hz), and still the profiles aren't very helpful. The only strategy that appears to be working is simply commenting out parts that run once sampling is enabled and seeing whether it has any impact on the microbenchmark.

@tbg
Copy link
Member Author

tbg commented Mar 5, 2021

I share your confusion. I was trying to at least retroactively find the flowSpecToDiagram regression you just fixed (on this branch, which does not have the fix) but not a chance. I tried

#!/bin/bash

go test -c ./pkg/sql/tests
for rate in 0.0 1.0; do
env GOGC=off ./tests.test -test.run - -test.bench "BenchmarkKV/Scan/SQL/rows=1/sample_rate=${rate}" -test.benchtime=50000x
done

with this mod

@@ -317,12 +320,17 @@ func BenchmarkKV(b *testing.B) {
                                                                        b.Fatal(err)
                                                                }
                                                                numSpansBefore := kv.(*kvSQL).tr.Metrics().SpanCreated.Count()
+                                                               f, err := os.Create(fmt.Sprintf("%.2f.prof", sampleRate))
+                                                               require.NoError(b, err)
+                                                               defer f.Close()
                                                                b.ResetTimer()
+                                                               pprof.StartCPUProfile(f)
                                                                for i := 0; i < b.N; i++ {
                                                                        if err := opFn(kv, rows, i); err != nil {
                                                                                b.Fatal(err)
                                                                        }
                                                                }
+                                                               pprof.StopCPUProfile()
                                                                b.StopTimer()
                                                                numSpansAfter := kv.(*kvSQL).tr.Metrics().SpanCreated.Count()
                                                                b.ReportMetric(float64(numSpansAfter-numSpansBefore)/float64(b.N), "spans/op")

in the hope that go tool pprof -http :6060 -diff_base 0.00.prof 1.00.prof would at least have that method show up but nope. I did ultimately find it the top100 -cum at the very end for the 1.00 profile with focus=PlanAndRun:

     0.01s 0.061%  8.91%      0.58s  3.56%  github.com/cockroachdb/cockroach/pkg/sql.(*PlanningCtx).getDefaultSaveFlowsFunc.func1
         0     0%  8.91%      0.57s  3.50%  github.com/cockroachdb/cockroach/pkg/sql.(*PlanningCtx).flowSpecsToDiagram

PS @yuzefovich you may also want to rebase on top of #61359. There is a rando alllocation in startSpanGeneric that this PR fixes, which currently accounts for the bulk of time spent there:

.      150ms    379:	helper.crdbSpan.mu.structured.Reserve(maxStructuredEventsPerSpan)

tbg added a commit to tbg/cockroach that referenced this pull request Mar 5, 2021
Found as part of the investigation in cockroachdb#61328, where these happened to
jump out at me in the CPU profile's top10 for a simple read-only
workload.

Release justification: low-risk perf improvements
Release note: None
craig bot pushed a commit that referenced this pull request Mar 5, 2021
61535: base,kv: fix two error instantiations in the hot path r=stevendanna a=tbg

Found as part of the investigation in #61328, where these happened to
jump out at me in the CPU profile's top10 for a simple read-only
workload.

cc @cockroachlabs/kv

Release justification: low-risk perf improvements
Release note: None


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@irfansharif irfansharif self-assigned this Mar 10, 2021
@tbg tbg closed this Apr 30, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants