sql: evaluate sql.txn_stats.sample_rate default #59379

asubiotto · 2021-01-25T15:49:22Z

#59132 adds the sql.txn_stats.sample_rate cluster setting. The default setting is 0, which means never sample stats. However, we're soon going to change the DB Console page to show these sampled execution stats, which means that this setting has to be nonzero to show interesting information.

This issue tracks quantifying the performance effect of turning this setting up to always sample and seeing if it is realistic to remove this setting at all. If not, we might want to have a low but nonzero sample rate to provide some interesting data.

The text was updated successfully, but these errors were encountered:

yuzefovich · 2021-02-24T18:43:40Z

Taking this over.

yuzefovich · 2021-02-24T21:53:36Z

Initial benchmarks suggest that using a sample rate of 0.5 or 1 has significant performance hit (on the order of 20%) on KV95 workload with 1 min ramp and 5 min duration on 3 node roachprod cluster:

GCE:
- sample_rate=0: 8584.3
- sample_rate=0.5: 7777.8, -9.4%
- sample_rate=1: 7120.9, -17.0%
AWS:
- sample_rate=0: 10485.8
- sample_rate=0.5: 8934.1, -14.8%
- sample_rate=1: 8237.3, -21.4%

Initial look at the profile suggests that crdbSpan.getRecording is to blame (the profile is over 30s):

yuzefovich · 2021-02-24T22:03:41Z

Looks like there are two main time usages there:

assigning an entry to tracingpb.RecordedSpan.Tags map:

ROUTINE ======================== github.com/cockroachdb/cockroach/pkg/util/tracing.(*crdbSpan).getRecordingLocked.func1 in /go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/crdbspan.go
         0      560ms (flat, cum)  0.98% of Total
         .          .    278:		rs.Finished = true
         .          .    279:	}
         .          .    280:
         .          .    281:	addTag := func(k, v string) {
         .          .    282:		if rs.Tags == nil {
         .       50ms    283:			rs.Tags = make(map[string]string)
         .          .    284:		}
         .      510ms    285:		rs.Tags[k] = v
         .          .    286:	}

calling tag.ValueStr()
It is possible that this comment is relevant:

ROUTINE ======================== github.com/cockroachdb/redact.StringWithoutMarkers in /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/redact/markers_print.go
      40ms      1.97s (flat, cum)  3.44% of Total
         .          .     91:// Note: if this function is ever found to be a performance
         .          .     92:// bottleneck, one can consider using an alternate implementation of
         .          .     93:// Sprint() which similarly calls the SafeFormat() methods but does
         .          .     94:// not introduce markers and instead writes to a string buffer
         .          .     95:// directly.
      10ms       10ms     96:func StringWithoutMarkers(f SafeFormatter) string {
      30ms      1.96s     97:	return Sprint(f).StripMarkers()
         .          .     98:}

that is being called by

         .          .     36:// String returns the node ID, or "?" if it is unset.
         .          .     37:func (n *NodeIDContainer) String() string {
      10ms      1.98s     38:	return redact.StringWithoutMarkers(n)

tbg · 2021-02-25T08:21:21Z

you're not accidentally starting verbose spans, right? Just to double check
I don't see why any tags would need to be added to the span. Tags are slow! I see code like this

cockroach/pkg/sql/flowinfra/outbox.go

Lines 221 to 225 in 5fd21c2

    
           if span != nil && span.IsVerbose() { 
        
           	m.statsCollectionEnabled = true 
        
           	span.SetTag(execinfrapb.FlowIDTagKey, m.flowCtx.ID.String()) 
        
           	span.SetTag(execinfrapb.StreamIDTagKey, m.streamID) 
        
           }

which makes it seem as though tags are somehow used to transport stats, which is not a good idea in the hot path. I think this is legacy code, but see if things like that are the problem.

do you see a similar slowdown in one of the in-process tests? It will be much more pleasant to iterate in a local setup.

This avoids some of the obviously problematic expensive stringifications observed in cockroachdb#59379. Release justification: low-risk performance optimization that unblocks sql stats. Release note: None

tbg · 2021-02-25T10:33:28Z

I couldn't really see the same kind of drop on a local three node roachprod cluster running stock kv95, but I do see that #61116 removes the stringification of tags completely (as intended). Try with that PR and see if that helps.

tbg · 2021-02-25T10:37:56Z

Resulting flame graph (bit confusing thanks to recursion, but you get the idea)

tbg · 2021-02-25T11:02:20Z

This seems unnecessary - the spanstats put tags in the span:

tbg · 2021-02-25T11:05:28Z

^-- the CPU slice here isn't maybe all that big altogether, but the allocs I think are substantial:

This avoids some of the obviously problematic expensive stringifications observed in cockroachdb#59379. Release justification: low-risk performance optimization that unblocks sql stats. Release note: None

asubiotto · 2021-02-25T13:25:13Z

Yeah, we've moved away from using tags in the spans (in favor of ComponentIDs in the ComponentStats proto) so we can probably remove that behavior.

yuzefovich · 2021-02-26T02:17:32Z

Thanks for looking into this Tobi!

Replying to your questions from #59379 (comment).

you're not accidentally starting verbose spans, right? Just to double check

I don't think so, we only use verbose spans when the tracing is enabled for the whole session or with EXPLAIN ANALYZE. I hope I'm not reading the code wrong here. Confirmed manually on my laptop that SetVerbose is not being called when we start sampling.

I don't see why any tags would need to be added to the span. Tags are slow! I see code like this

Ack. The code you linked gets executed only is the verbose tracing is enabled. I did find a few thing (#61158) where we would be setting tags in non-verbose tracing. Rerunning my benchmarks with both #61116 and #61158 cherry-picked, and I see an improvement, but the hit with tracing "always on" is still very high - on the order of 1510%.

do you see a similar slowdown in one of the in-process tests? It will be much more pleasant to iterate in a local setup.

I haven't tried any of the local benchmarks yet, but I'll post here if I find something useful.

61041: multiregionccl: gate multi-region database configuration behind a license r=ajstorm a=otan Resolves #59668 Release justification: low risk changes to new functionality Release note (enterprise change): Multi-region database creations are permitted as long as the cluster has a CockroachDB subscription. 61158: tracing: do not set tags when setting stats r=yuzefovich a=yuzefovich **colflow: do not set redundant tag on the tracing span** This is unnecessary given that we're setting the componentID below that includes the flowID. Release justification: low-risk update to new functionality. Release note: None **tracing: do not set tags when setting stats** We no longer need to set tags on the tracing span in order to propagate stats since we now propagate that info as a whole object (at the moment, twice - as a Structured payload and as DeprecatedStats protobuf; the latter will be removed after 21.1 branch is cut). Addresses: #59379. Release justification: low-risk update to new functionality. Release note: None Co-authored-by: Oliver Tan <otan@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

61116: tracing: elide tags from recording if not verbose r=tbg a=tbg This avoids some of the obviously problematic expensive stringifications observed in #59379. Release justification: low-risk performance optimization that unblocks sql stats. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>

yuzefovich · 2021-02-27T07:15:20Z

The numbers I got yesterday with both #61158 and #61116 cherry-picked with 3 min ramp and 15 min duration:

GCE:
- sample_rate=0: 9335.7
- sample_rate=0.05: 9235.3, -1.1%
- sample_rate=0.1: 9211.2, -1.3%
- sample_rate=0.15: 9078.7, -2.8%
AWS:
- sample_rate=0: 10636.2
- sample_rate=0.05: 10488.1, -1.4%
- sample_rate=0.1: 10481.1, -1.5%
- sample_rate=0.15: 10347.0, -2.7%

asubiotto · 2021-03-01T16:42:06Z

Nice! That looks promising cc @awoods187 @jordanlewis. What's the bottleneck now?

awoods187 · 2021-03-01T16:46:45Z

This is interesting. I was having to push into 0.5 values to see meaningful values for network and memory when running tpc-c with 10 warehouses via roachprod local previously.

I'm concerned that too low a sampling value will result in these columns being largely empty and a too high value causing significant performance degradation. Even the values of .15 impact the cluster 2.7%, something we haven't been willing to do in the past with optimizer or other execution changes.

asubiotto · 2021-03-01T16:47:30Z

~~Note that this is with the sampling rate fully turned on.~~ Oops my bad

yuzefovich · 2021-03-04T03:24:50Z

I implemented all hacky ideas mentioned so far in https://github.com/yuzefovich/cockroach/commits/sample-rate, however, this didn't change microbenchmark numbers nor KV95 numbers on 3 node cluster.

When looking at the profiles of 3 node cluster with sample_rate=1 I see some time spent in getRecording and ExtractStatsFromSpan. However, when diving into details I don't see much we can chop off - to me it all seems like propagating the execution stats through traces has inherent overhead which shows up as the performance hit. On 30 seconds profile I see this:

ROUTINE ======================== github.com/cockroachdb/cockroach/pkg/util/tracing.(*crdbSpan).getRecording in /go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/crdbspan.go
      60ms      1.88s (flat, cum)  3.39% of Total
...
         .      190ms    167:	result := make(Recording, 0, 1+len(s.mu.recording.children)+len(s.mu.recording.remoteSpans))
         .          .    168:	// Shallow-copy the children so we can process them without the lock.
      10ms       10ms    169:	children := s.mu.recording.children
         .      620ms    170:	result = append(result, s.getRecordingLocked(wantTags))
      10ms       50ms    171:	result = append(result, s.mu.recording.remoteSpans...)
         .       70ms    172:	s.mu.Unlock()
         .          .    173:
         .          .    174:	for _, child := range children {
      20ms      680ms    175:		result = append(result, child.getRecording(everyoneIsV211, wantTags)...)
         .          .    176:	}
         .          .    177:
         .          .    178:	// Sort the spans by StartTime, except the first Span (the root of this
         .          .    179:	// recording) which stays in place.
         .       50ms    180:	toSort := sortPool.Get().(*Recording) // avoids allocations in sort.Sort
      10ms       10ms    181:	*toSort = result[1:]
      10ms      150ms    182:	sort.Sort(toSort)
         .          .    183:	*toSort = nil
         .       30ms    184:	sortPool.Put(toSort)
ROUTINE ======================== github.com/cockroachdb/cockroach/pkg/util/tracing.(*crdbSpan).getRecordingLocked in /go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/crdbspan.go
      40ms      620ms (flat, cum)  1.12% of Total
...
      20ms       30ms    299:	if numEvents := s.mu.structured.Len(); numEvents != 0 {
         .       10ms    300:		rs.InternalStructured = make([]*types.Any, 0, numEvents)
         .          .    301:		for i := 0; i < numEvents; i++ {
         .       20ms    302:			event := s.mu.structured.Get(i).(Structured)
         .      480ms    303:			item, err := types.MarshalAny(event)
         .          .    304:			if err != nil {
         .          .    305:				// An error here is an error from Marshal; these
         .          .    306:				// are unlikely to happen.
         .          .    307:				continue
         .          .    308:			}
         .          .    309:			rs.InternalStructured = append(rs.InternalStructured, item)
         .          .    310:		}
         .          .    311:	}

ROUTINE ======================== github.com/cockroachdb/cockroach/pkg/sql/execinfrapb.ExtractStatsFromSpans.func1 in /go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfrapb/component_stats.go
      20ms      460ms (flat, cum)  0.83% of Total
...
      20ms       20ms    345:		span.Structured(func(item *types.Any) {
         .       30ms    346:			if !types.Is(item, componentStats) {
         .          .    347:				return
         .          .    348:			}
         .       40ms    349:			stats := NewComponentStats()
         .      340ms    350:			if err := protoutil.Unmarshal(item.Value, stats); err != nil {
         .          .    351:				return
         .          .    352:			}

At this point, I don't know what else to try to reduce the performance hit. Curious to hear whether @tbg or @irfansharif have some suggestions.

I've been using this script (no judgement please, I'm a bash noob :D) to run the KV95 benchmarks if anyone is interested in running it themselves.

tbg · 2021-03-04T10:48:42Z

The meta question that I have is why we're blaming tracing right now. My experiments over in #61328 (comment) (which I'd like you to scrutinize/confirm! I might have messed something up) indicate that the overhead of stats collection related to marshalling, transporting, unmarshalling the extra payloads should be in the single-digit microsecond range, but I see more like 40 microseconds in the single-in-mem-server benchmark. Let's discuss over on #61328.

asubiotto · 2021-03-04T14:34:20Z

It's possible that something else is slowing this down. The VectorizedStatsCollector might be slowing things down when inserted into a flow. It's possible that we should also experiment with pooling all operators in this flow to see what the change is. Since it's hard to see anything from the profiles, I think the way forward here is to list the differences between a non-instrumented flow and an instrumented flow and try to narrow down the search surface area by e.g. experimenting with not allocating stats collectors at all, or making batchInfoCollector.Next not do anything.

tbg · 2021-03-04T14:47:36Z

Note that ./cmp.sh (the macrobenchmark) intentionally disables garbage collection, so I don't think pooling should make a difference in at least that benchmark.

61380: colflow: clean up vectorized stats propagation r=yuzefovich a=yuzefovich Previously, in order to propagate execution statistics we were creating temporary tracing spans, setting the stats on them, and finishing the spans right away. This allowed for using (to be more precise, abusing) the existing infrastructure. The root of the problem is that in the vectorized engine we do not start per-operator span if stats collection is enabled at the moment, so we had to get around that limitation. However, this way is not how tracing spans are intended to be used and creates some performance slowdown in the hot path, so this commit refactors the situation. Now we are ensuring that there is always a tracing span available at the "root" components (either root materializer or an outbox), so when root components are finishing the vectorized stats collectors for their subtree of operators, there is a span to record the stats into. This required the following minor adjustments: - in the materializer, we now delegate attachment of the stats to the tracing span to the drainHelper (which does so on `ConsumerDone`). Note that the drainHelper doesn't get the recording from the span and leaves that to the materializer (this is needed in order to avoid collecting duplicate trace data). - in the outbox, we now start a "remote child span" (if there is a span in the parent context) in the beginning of `Run` method, and we attach that stats in `sendMetadata`. Addresses: #59379. Fixes: #59555. Release justification: low-risk update to existing functionality. Release note: None 61412: sql: clean up planNodeToRowSource r=yuzefovich a=yuzefovich This commit removes some redundant things that were kept during ccc5a8a. Namely, `planNodeToRowSource` doesn't need to track whether it was started or not now that `startExec` is called in `Start`. This also allows us to remove the override of `InternalClose` method. Release justification: low-risk update to existing functionality. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

61529: sql: save flow diagrams only when needed r=yuzefovich a=yuzefovich Previously, whenever we needed to save flows information, we would always generate flow diagrams. However, those are not used when we are sampling statements and become unnecessary work. This commit updates the default `saveFlows` function to only generate flow diagrams when needed (when we're collecting a stmt bundle or when we're running EXPLAIN ANALYZE stmt). Addresses: #59379. Release justification: low-risk update to new functionality. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

yuzefovich · 2021-03-05T03:48:23Z

Quick update is that with #61529 cherry-picked, the performance hit of sample_rate=1 went down noticeably, to -5.7% on GCE and -7.1% on AWS in a single 5 min run.

yuzefovich · 2021-03-05T04:38:04Z

In my experiments on df11d41 with Tobi's commits from #61328 cherry-picked (but without mine WIP commits) I observe the following explanation for the current 13% perf hit in the macro benchmark (KV/Scan, all numbers are approximate):

1% hit for simply creating a new span (the one in instrumentationHelper.Setup) and finishing it, without getting the recording. No stats are collected.
2-3% hit for getting the recording from that span when no stats have been attached to it. We're getting the trace data that is attached by other components (most likely below SQL, like dist sender).
1.5% for saving flows metadata by saveFlows function when there are no stats attached still. saveFlows also calls NewFlowsMetadata. It seems like NewFlowsMetadata is currently responsible for about half of 1.5% hit.
2.5% for calling GetQueryLevelStats on a trace with no stats payload. Note that recording query level stats into appStats and txnStats seems to be free.
2% for planning vectorized stats collectors, but without calling getStats on them and, therefore, still not adding stats payload to the trace.
1.5% for calling getStats (i.e. retrieving the execution stats) but still not attaching them to the trace.
1.5% for attaching the stats to the trace (i.e. performing the full action for sampling).

yuzefovich · 2021-03-05T06:32:58Z

I prototyped again a few things already mentioned before to see the impact on an actual KV95 run here.

the first commit removes an extra span created in instrumentationHelper.Setup and uses the already present one. This change seems to reduce the hit by a quarter or so. The numbers look as follows:
- GCE:
  - sample_rate=0.5: -0.8%
  - sample_rate=1: -4.1%
- AWS:
  - sample_rate=0.5: -3.5%
  - sample_rate=1: -6.7%
the second commit trims the trace analyzer to remove currently unused bookkeeping. This commit surprisingly seems to have thing a bit worse actually. The numbers look as follows:
- GCE:
  - sample_rate=0.5: -3.0%
  - sample_rate=1: -6.8%
- AWS:
  - sample_rate=0.5: -4.7%
  - sample_rate=1: -6.2%
the third commit pools ComponentStats objects. Note that the second commit made it easier to implement this commit. The numbers look as follows:
- GCE:
  - sample_rate=0.5: -3.2%
  - sample_rate=1: -6.9%
- AWS:
  - sample_rate=0.5: -2.7%
  - sample_rate=1: -5.0%
the fourth commit removes the additional tracing span created by the colbatchscans. Note that in the current form that commit cannot be merged to master because it would cause confusion during the computation of cumulative contention time when there are wrapped rowexec processors in the vectorized flow. The numbers look as follows:
- GCE:
  - sample_rate=0.5: -3.8%
  - sample_rate=1: -5.4%
- AWS:
  - sample_rate=0.5: -0.0%
  - sample_rate=1: -2.4%

Based on these results, it looks like the most beneficial thing is avoiding creating spans if possible. Thus, I'll open up a PR to include the first commit and will kick off more extensive benchmarks with it.

yuzefovich · 2021-03-05T16:45:30Z

With #61532 I got the following (3 numbers indicate durations of 5 min, 10 min, and 15 min, respectively):

GCE:
- sample_rate=0.05: +1.6%, -1.4%, +0.3%
- sample_rate=0.1: +1.6%, -1.0%, +1.3%
- sample_rate=0.25: -0.2%, -2.8%, -2.6%
- sample_rate=0.5: -2.5%, -4.5%, -3.0%
- sample_rate=1: -2.8%, -7.3%, -5.8%
AWS:
- sample_rate=0.05: -2.2%, -0.7%, -0.8%
- sample_rate=0.1: -0.1%, -0.6%, -1.0%
- sample_rate=0.25: -1.1%, -1.8%, -2.4%
- sample_rate=0.5: -3.3%, -3.6%, -3.3%
- sample_rate=1: -7.4%, -6.8%, -6.3%

yuzefovich · 2021-03-06T02:41:21Z

Reran the tests on 90d6ba4 with #61532 (Tobi suggested that #61359 might improve things). Looks like without other improvements sample_rate=0.1 will be the maximum recommended default value. The numbers are for 5 min and 10 min duration:

GCE:
- sample_rate=0.05: -1.8%, -0.4%
- sample_rate=0.1: -2.0%, -1.1%
- sample_rate=0.25: -3.1%, -2.0%
- sample_rate=0.5: -5.1%, -2.7%
- sample_rate=1: -8.4%, -5.6%
AWS:
- sample_rate=0.05: -2.2%, -0.6%
- sample_rate=0.1: -1.5%, -1.6%
- sample_rate=0.25: -3.1%, -3.2%
- sample_rate=0.5: -5.1%, -4.6%
- sample_rate=1: -8.9%, -8.6%

asubiotto · 2021-03-09T21:45:12Z

Let's set it to 0.1. I'll try to circle back later with a better sampling approach (i.e. sample the first occurrence of a fingerprint) in #61678.

61532: sql: use stmt's span for exec stats propagation r=yuzefovich a=yuzefovich Previously, when sampling the statement, we would always create a new tracing span. However, there is another span that we can use instead: we always create a tracing span for each statement in `connExecutor.execCmd`. That span is not used directly for anything and is needed because the transactions expect that a span is present in their context. This commit utilizes the present tracing span for the sampling purposes which gives us a performance boost (some benchmarks show that this eliminates about a quarter of the performance overhead with "always on" sampling"). Addresses: #59379. Release justification: low-risk update to new functionality. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>

irfansharif · 2021-03-10T05:13:14Z

I've yet to digest the experiments here and on #61328, but looking at TPC-C I'm seeing about 10% additional memory usage by just us creating new spans (mostly from connExecutor.execCmd): #59424 (comment). I don't suppose we can do much about that one, can we? I haven't worked out how that affects CPU usage. We're certainly taking on added GC pressure because of it. In any case, the memory overhead seems a bit worrying.

irfansharif · 2021-03-10T05:56:44Z

Hm, ok just familiarized myself with the above. I guess that one is pretty necessary. The analysis here and on #61328 look sound, y'all are way ahead of me. Just a sanity check (esp cause the %s above are using 5-15m runs), we arrived at 0.1 with GC switched on, right?

We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None

62998: sql: lower default sampling rate to 1% r=irfansharif a=irfansharif We arrived at the previous default rate of 10% back in #59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what #59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced #61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in #59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in #62227), but they're much too invasive to backport to 21.1. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>

We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None

asubiotto added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 25, 2021

asubiotto self-assigned this Feb 1, 2021

yuzefovich assigned yuzefovich and unassigned asubiotto Feb 24, 2021

yuzefovich changed the title ~~sql: evaluate sql.statement_stats.sample_rate default~~ sql: evaluate sql.txn_stats.sample_rate default Feb 24, 2021

tbg mentioned this issue Feb 25, 2021

tracing: elide tags from recording if not verbose #61116

Merged

yuzefovich mentioned this issue Feb 26, 2021

tracing: do not set tags when setting stats #61158

Merged

yuzefovich mentioned this issue Mar 5, 2021

sql: save flow diagrams only when needed #61529

Merged

yuzefovich mentioned this issue Mar 5, 2021

sql: use stmt's span for exec stats propagation #61532

Merged

This was referenced Mar 10, 2021

release-21.1: sql: use stmt's span for exec stats propagation #61758

Merged

sql: set default sample_rate to 0.1 #61760

Merged

craig bot closed this as completed in 7a26119 Mar 10, 2021

yuzefovich mentioned this issue Mar 10, 2021

release-21.1: sql: set default sample_rate to 0.1 #61815

Merged

irfansharif mentioned this issue Apr 1, 2021

sql: lower default sampling rate to 1% #62998

Merged

irfansharif mentioned this issue Apr 2, 2021

release-21.1: sql: lower default sampling rate to 1% #63006

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: evaluate sql.txn_stats.sample_rate default #59379

sql: evaluate sql.txn_stats.sample_rate default #59379

asubiotto commented Jan 25, 2021 •

edited by yuzefovich

Loading

yuzefovich commented Feb 24, 2021

yuzefovich commented Feb 24, 2021

yuzefovich commented Feb 24, 2021

tbg commented Feb 25, 2021

tbg commented Feb 25, 2021

tbg commented Feb 25, 2021

tbg commented Feb 25, 2021

tbg commented Feb 25, 2021

asubiotto commented Feb 25, 2021

yuzefovich commented Feb 26, 2021 •

edited

Loading

yuzefovich commented Feb 27, 2021

asubiotto commented Mar 1, 2021

awoods187 commented Mar 1, 2021 •

edited

Loading

asubiotto commented Mar 1, 2021 •

edited

Loading

yuzefovich commented Mar 4, 2021

tbg commented Mar 4, 2021

asubiotto commented Mar 4, 2021

tbg commented Mar 4, 2021

yuzefovich commented Mar 5, 2021 •

edited

Loading

yuzefovich commented Mar 5, 2021

yuzefovich commented Mar 5, 2021 •

edited

Loading

yuzefovich commented Mar 5, 2021 •

edited

Loading

yuzefovich commented Mar 6, 2021

asubiotto commented Mar 9, 2021

irfansharif commented Mar 10, 2021

irfansharif commented Mar 10, 2021 •

edited

Loading

sql: evaluate sql.txn_stats.sample_rate default #59379

sql: evaluate sql.txn_stats.sample_rate default #59379

Comments

asubiotto commented Jan 25, 2021 • edited by yuzefovich Loading

yuzefovich commented Feb 24, 2021

yuzefovich commented Feb 24, 2021

yuzefovich commented Feb 24, 2021

tbg commented Feb 25, 2021

tbg commented Feb 25, 2021

tbg commented Feb 25, 2021

tbg commented Feb 25, 2021

tbg commented Feb 25, 2021

asubiotto commented Feb 25, 2021

yuzefovich commented Feb 26, 2021 • edited Loading

yuzefovich commented Feb 27, 2021

asubiotto commented Mar 1, 2021

awoods187 commented Mar 1, 2021 • edited Loading

asubiotto commented Mar 1, 2021 • edited Loading

yuzefovich commented Mar 4, 2021

tbg commented Mar 4, 2021

asubiotto commented Mar 4, 2021

tbg commented Mar 4, 2021

yuzefovich commented Mar 5, 2021 • edited Loading

yuzefovich commented Mar 5, 2021

yuzefovich commented Mar 5, 2021 • edited Loading

yuzefovich commented Mar 5, 2021 • edited Loading

yuzefovich commented Mar 6, 2021

asubiotto commented Mar 9, 2021

irfansharif commented Mar 10, 2021

irfansharif commented Mar 10, 2021 • edited Loading

asubiotto commented Jan 25, 2021 •

edited by yuzefovich

Loading

yuzefovich commented Feb 26, 2021 •

edited

Loading

awoods187 commented Mar 1, 2021 •

edited

Loading

asubiotto commented Mar 1, 2021 •

edited

Loading

yuzefovich commented Mar 5, 2021 •

edited

Loading

yuzefovich commented Mar 5, 2021 •

edited

Loading

yuzefovich commented Mar 5, 2021 •

edited

Loading

irfansharif commented Mar 10, 2021 •

edited

Loading