[dnm] tracing: investigate perf regression under sql.txn_stats.sample_rate=1 #61328

Fixes cockroachdb#59308. This commit adds support for range merge transactions to refresh. This is necessary to allow the merge transaction to have its write timestamp be bumped and still commit without retrying. In such cases, the transaction must refresh its reads, including its original read on the RHS's local range descriptor. If we were to block this refresh on the frozen RHS range, the merge would deadlock. On the surface, it seems unsafe to permit Refresh requests on an already subsumed RHS range, because the refresh's effect on the timestamp cache will never make it to the LHS leaseholder. This risks the future joint range serving a write that invalidates the Refresh. However, in this specific situation, we can be sure that such a serializability violation will not occur because the Range merge also writes to (deletes) this key. This means that if the Range merge transaction commits, its intent on the key will be resolved to the timestamp of the refresh and no future write will ever be able to violate the refresh. Conversely, if the Range merge transaction does not commit, then the merge will fail and the update to the RHS's timestamp cache will not be lost (not that this particularly matters in cases of aborted transactions). The same line of reasoning as the one above has motivated us to explore removing keys from a transaction's refresh spans when they are written to by the transaction, as the intents written by the transaction act as a form of pessimistic lock that obviate the need for the optimistic refresh. Such an improvement would eliminate the need for this special case, but until we generalize the mechanism to prune refresh spans based on intent spans, we're forced to live with this. See cockroachdb#59308 (comment) for why the original fix, which attempted to manually refresh the range merge transaction before it entered its critical phase, was not sufficient. Release justification: needed for new functionality.

This commit adds support for a stream client that can connect to a cluster and start replicating changes that happen on that stream. Release note: None Release justification: low-risk, high-benefit change

Previously, creating interleaved partitioned indexes panicked. This commit disallows their creation to prevent panicking. Fixes cockroachdb#60699 Release justification: This is a low risk change that prevents panics when attempting to create interleaved partitioned tables. Release note (bug fix): Creating interleaved partitioned indexes is now disallowed. Previously, the database would crash when trying to create one.

This commit removes redundant context argument from `trailingMetaCallback` in favor of just using `ProcessorBase.Ctx` when needed. This was already actually the case since the callback was always called with `ProcessorBase.Ctx`. We also had the wrong sequences of actions when collecting trailing metadata in inverted joiner, join reader, and zigzag joiner, where we first called `InternalClose` and then generated the metadata. That is wrong because `InternalClose` replaces the context used throughout the operation of a processor with the "original" context (the one passed into `StartInternal`). This is now fixed. Additionally, a few unnecessary method overwrites are removed and `ProcessorBase` now provides the default implementation of `ConsumerClosed` (which simply calls `InternalClose`). This commit also makes a slight adjustment to context management of the change aggregator. Release justification: low-risk cleanup. Release note: None

as of 1982047 this log line is now formatted with Go's templating system, which doesn't require that % is escaped. Before: ``` I210303 09:56:17.459802 311 2@server/status/runtime.go:553 ⋮ [n1] 76 runtime stats: 106 MiB RSS, 261 goroutines (stacks: 5.0 MiB), 29 MiB/63 MiB Go alloc/total (heap fragmentation: 6.8 MiB, heap reserved: 11 MiB, heap released: 12 MiB), 7.6 MiB/13 MiB CGO alloc/total (0.6 CGO/sec), 1.8/1.2 %%(u/s)time, 0.0 %%gc (0x ), 17 KiB/22 KiB (r/w)net ``` After: ``` I210303 09:59:17.607021 59 2@server/status/runtime.go:553 ⋮ [n1] 71 runtime stats: 92 MiB RSS, 255 goroutines (stacks: 3.7 MiB), 27 MiB/48 MiB Go alloc/total (heap fragmentation: 4.8 MiB, heap reserved: 1.3 MiB, heap released: 27 MiB), 10 MiB/15 MiB CGO alloc/total (0.0 CGO/sec), 0.0/0.0 %(u/s)time, 0.0 %gc (0x), 30 KiB/34 KiB (r/w)net ``` Release justification: Negligible risk. Release note: None

Fix a goroutine leak caused by gRPC streams. It turns out that receiving an error from an RPC's stream is not enough to cleanup the stream, you need to do something else as described here https://pkg.go.dev/google.golang.org/grpc#ClientConn.NewStream This patch starts canceling the stream's context whenever the stream is discarded. Release note: None Release justification: Bug fix for new functionality.

Previously, in order to propagate execution statistics we were creating temporary tracing spans, setting the stats on them, and finishing the spans right away. This allowed for using (to be more precise, abusing) the existing infrastructure. The root of the problem is that in the vectorized engine we do not start per-operator span if stats collection is enabled at the moment, so we had to get around that limitation. However, this way is not how tracing spans are intended to be used and creates some performance slowdown in the hot path, so this commit refactors the situation. Now we are ensuring that there is always a tracing span available at the "root" components (either root materializer or an outbox), so when root components are finishing the vectorized stats collectors for their subtree of operators, there is a span to record the stats into. This required the following minor adjustments: - in the materializer, we now delegate attachment of the stats to the tracing span to the drainHelper (which does so on `ConsumerDone`). Note that the drainHelper doesn't get the recording from the span and leaves that to the materializer (this is needed in order to avoid collecting duplicate trace data). - in the outbox, we now start a "remote child span" (if there is a span in the parent context) in the beginning of `Run` method, and we attach that stats in `sendMetadata`. Release justification: low-risk update to existing functionality. Release note: None

One could run the relevant benchmark with `make bench PKG=./pkg/sql/tests BENCHES=BenchmarkKV//SQL`. Release note: None

Release justification: low-risk update to new functionality. Release note: None

Release justification: low-risk change and prerequisite for metrics improvement. Release note: None

Release justification: low-risk, high benefit change to existing functionality Release note (ops change): Added the `trace.spans_created.count` and `trace.registry.size` metrics that account for the number of created trace spans and tracked local root spans, respectively.

``` $ ./cmp.sh name old time/op new time/op delta KV/Scan/SQL/rows=1/sample_rate=X-16 267µs ± 9% 313µs ± 8% +17.32% (p=0.000 n=22+22) ``` Release justification: Release note (<category, see below>): <what> <show> <why>

``` $ go test -run - -bench SpanStats -count 1 ./pkg/util/tracing goos: linux goarch: amd64 pkg: github.com/cockroachdb/cockroach/pkg/util/tracing BenchmarkSpanStats/autoCollection=false,stats=0-16 280952 4326 ns/op BenchmarkSpanStats/autoCollection=false,stats=1-16 158040 7901 ns/op BenchmarkSpanStats/autoCollection=true,stats=0-16 471117 2631 ns/op BenchmarkSpanStats/autoCollection=true,stats=1-16 242348 5919 ns/op PASS ok github.com/cockroachdb/cockroach/pkg/util/tracing 8.414s ``` The case `autoCollection=false,stats=1` should roughly reflect the SQL usage in which stats are created, transported via a recording, ingested into a parent, and then retrieved and unmarshaled. The case `false,stats=0` should correspond to the non-sampled case. We are thus expecting roughly an overhead of 4.3µs/op when not sampled, 8µs/op when sampled, i.e. changing the sampling rate from 0 to 1 should add ~4µs/op. However, the macro-level benchmark shows much more: ``` KV/Scan/SQL/rows=1/sample_rate=X-16 268µs ± 9% 314µs ± 7% +17.03% (p=0.000 n=21+21) ``` That's a drop of 46µs incurred by changing the sampling rate from zero to one, much past what can be explained by this microbenchmark. The autoCollection=true numbers avoid an extra unmarshal-remarshal cycle of the recording. This is a possible optimization for SQL to investigate in the case in which the flow lives wholly on the gateway. Unfortunately, it is finicky to get right and causes lots of special casing, as SQL currently relies on all trace data arriving at the DistSQL RowReceiver via TrailingMeta (i.e. marshalled recordings). Release justification: Release note (<category, see below>): <what> <show> <why>

This shows that in this benchmark, we're creating one additional span: 7 without sampling, 8 with sampling. This one additional trace span is unlikely to cost ~40us, but avoiding it would establish more parity between the tests. Can we?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dnm] tracing: investigate perf regression under sql.txn_stats.sample_rate=1 #61328

[dnm] tracing: investigate perf regression under sql.txn_stats.sample_rate=1 #61328

Commits on Mar 4, 2021