Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: evaluate sql.txn_stats.sample_rate default #59379

Closed
asubiotto opened this issue Jan 25, 2021 · 35 comments · Fixed by #61760
Closed

sql: evaluate sql.txn_stats.sample_rate default #59379

asubiotto opened this issue Jan 25, 2021 · 35 comments · Fixed by #61760
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@asubiotto
Copy link
Contributor

asubiotto commented Jan 25, 2021

#59132 adds the sql.txn_stats.sample_rate cluster setting. The default setting is 0, which means never sample stats. However, we're soon going to change the DB Console page to show these sampled execution stats, which means that this setting has to be nonzero to show interesting information.

This issue tracks quantifying the performance effect of turning this setting up to always sample and seeing if it is realistic to remove this setting at all. If not, we might want to have a low but nonzero sample rate to provide some interesting data.

@asubiotto asubiotto added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Jan 25, 2021
@asubiotto asubiotto self-assigned this Feb 1, 2021
@yuzefovich
Copy link
Member

Taking this over.

@yuzefovich yuzefovich assigned yuzefovich and unassigned asubiotto Feb 24, 2021
@yuzefovich yuzefovich changed the title sql: evaluate sql.statement_stats.sample_rate default sql: evaluate sql.txn_stats.sample_rate default Feb 24, 2021
@yuzefovich
Copy link
Member

Initial benchmarks suggest that using a sample rate of 0.5 or 1 has significant performance hit (on the order of 20%) on KV95 workload with 1 min ramp and 5 min duration on 3 node roachprod cluster:

  • GCE:
    • sample_rate=0: 8584.3
    • sample_rate=0.5: 7777.8, -9.4%
    • sample_rate=1: 7120.9, -17.0%
  • AWS:
    • sample_rate=0: 10485.8
    • sample_rate=0.5: 8934.1, -14.8%
    • sample_rate=1: 8237.3, -21.4%

Initial look at the profile suggests that crdbSpan.getRecording is to blame (the profile is over 30s):
Screen Shot 2021-02-24 at 1 37 07 PM

@yuzefovich
Copy link
Member

Looks like there are two main time usages there:

  • assigning an entry to tracingpb.RecordedSpan.Tags map:
ROUTINE ======================== github.com/cockroachdb/cockroach/pkg/util/tracing.(*crdbSpan).getRecordingLocked.func1 in /go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/crdbspan.go
         0      560ms (flat, cum)  0.98% of Total
         .          .    278:		rs.Finished = true
         .          .    279:	}
         .          .    280:
         .          .    281:	addTag := func(k, v string) {
         .          .    282:		if rs.Tags == nil {
         .       50ms    283:			rs.Tags = make(map[string]string)
         .          .    284:		}
         .      510ms    285:		rs.Tags[k] = v
         .          .    286:	}
  • calling tag.ValueStr()
    It is possible that this comment is relevant:
ROUTINE ======================== github.com/cockroachdb/redact.StringWithoutMarkers in /go/src/github.com/cockroachdb/cockroach/vendor/github.com/cockroachdb/redact/markers_print.go
      40ms      1.97s (flat, cum)  3.44% of Total
         .          .     91:// Note: if this function is ever found to be a performance
         .          .     92:// bottleneck, one can consider using an alternate implementation of
         .          .     93:// Sprint() which similarly calls the SafeFormat() methods but does
         .          .     94:// not introduce markers and instead writes to a string buffer
         .          .     95:// directly.
      10ms       10ms     96:func StringWithoutMarkers(f SafeFormatter) string {
      30ms      1.96s     97:	return Sprint(f).StripMarkers()
         .          .     98:}

that is being called by

         .          .     36:// String returns the node ID, or "?" if it is unset.
         .          .     37:func (n *NodeIDContainer) String() string {
      10ms      1.98s     38:	return redact.StringWithoutMarkers(n)

@tbg
Copy link
Member

tbg commented Feb 25, 2021

  1. you're not accidentally starting verbose spans, right? Just to double check
  2. I don't see why any tags would need to be added to the span. Tags are slow! I see code like this

if span != nil && span.IsVerbose() {
m.statsCollectionEnabled = true
span.SetTag(execinfrapb.FlowIDTagKey, m.flowCtx.ID.String())
span.SetTag(execinfrapb.StreamIDTagKey, m.streamID)
}

which makes it seem as though tags are somehow used to transport stats, which is not a good idea in the hot path. I think this is legacy code, but see if things like that are the problem.

  1. do you see a similar slowdown in one of the in-process tests? It will be much more pleasant to iterate in a local setup.

tbg added a commit to tbg/cockroach that referenced this issue Feb 25, 2021
This avoids some of the obviously problematic expensive stringifications
observed in cockroachdb#59379.

Release justification: low-risk performance optimization that unblocks sql stats.
Release note: None
@tbg
Copy link
Member

tbg commented Feb 25, 2021

I couldn't really see the same kind of drop on a local three node roachprod cluster running stock kv95, but I do see that #61116 removes the stringification of tags completely (as intended). Try with that PR and see if that helps.

@tbg
Copy link
Member

tbg commented Feb 25, 2021

Resulting flame graph (bit confusing thanks to recursion, but you get the idea)

image

@tbg
Copy link
Member

tbg commented Feb 25, 2021

This seems unnecessary - the spanstats put tags in the span:

image

@tbg
Copy link
Member

tbg commented Feb 25, 2021

^-- the CPU slice here isn't maybe all that big altogether, but the allocs I think are substantial:

image

tbg added a commit to tbg/cockroach that referenced this issue Feb 25, 2021
This avoids some of the obviously problematic expensive stringifications
observed in cockroachdb#59379.

Release justification: low-risk performance optimization that unblocks sql stats.
Release note: None
@asubiotto
Copy link
Contributor Author

Yeah, we've moved away from using tags in the spans (in favor of ComponentIDs in the ComponentStats proto) so we can probably remove that behavior.

@yuzefovich
Copy link
Member

yuzefovich commented Feb 26, 2021

Thanks for looking into this Tobi!

Replying to your questions from #59379 (comment).

  1. you're not accidentally starting verbose spans, right? Just to double check

I don't think so, we only use verbose spans when the tracing is enabled for the whole session or with EXPLAIN ANALYZE. I hope I'm not reading the code wrong here. Confirmed manually on my laptop that SetVerbose is not being called when we start sampling.

  1. I don't see why any tags would need to be added to the span. Tags are slow! I see code like this

Ack. The code you linked gets executed only is the verbose tracing is enabled. I did find a few thing (#61158) where we would be setting tags in non-verbose tracing. Rerunning my benchmarks with both #61116 and #61158 cherry-picked, and I see an improvement, but the hit with tracing "always on" is still very high - on the order of 1510%.

  1. do you see a similar slowdown in one of the in-process tests? It will be much more pleasant to iterate in a local setup.

I haven't tried any of the local benchmarks yet, but I'll post here if I find something useful.

craig bot pushed a commit that referenced this issue Feb 26, 2021
61041: multiregionccl: gate multi-region database configuration behind a license r=ajstorm a=otan

Resolves #59668 

Release justification: low risk changes to new functionality

Release note (enterprise change): Multi-region database creations are
permitted as long as the cluster has a CockroachDB subscription.

61158: tracing: do not set tags when setting stats r=yuzefovich a=yuzefovich

**colflow: do not set redundant tag on the tracing span**

This is unnecessary given that we're setting the componentID below that
includes the flowID.

Release justification: low-risk update to new functionality.

Release note: None

**tracing: do not set tags when setting stats**

We no longer need to set tags on the tracing span in order to propagate
stats since we now propagate that info as a whole object (at the moment,
twice - as a Structured payload and as DeprecatedStats protobuf; the
latter will be removed after 21.1 branch is cut).

Addresses: #59379.

Release justification: low-risk update to new functionality.

Release note: None

Co-authored-by: Oliver Tan <otan@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Feb 26, 2021
61116: tracing: elide tags from recording if not verbose r=tbg a=tbg

This avoids some of the obviously problematic expensive stringifications
observed in #59379.

Release justification: low-risk performance optimization that unblocks sql stats.
Release note: None


Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
@yuzefovich
Copy link
Member

The numbers I got yesterday with both #61158 and #61116 cherry-picked with 3 min ramp and 15 min duration:

  • GCE:
    • sample_rate=0: 9335.7
    • sample_rate=0.05: 9235.3, -1.1%
    • sample_rate=0.1: 9211.2, -1.3%
    • sample_rate=0.15: 9078.7, -2.8%
  • AWS:
    • sample_rate=0: 10636.2
    • sample_rate=0.05: 10488.1, -1.4%
    • sample_rate=0.1: 10481.1, -1.5%
    • sample_rate=0.15: 10347.0, -2.7%

@asubiotto
Copy link
Contributor Author

Nice! That looks promising cc @awoods187 @jordanlewis. What's the bottleneck now?

@awoods187
Copy link
Contributor

awoods187 commented Mar 1, 2021

This is interesting. I was having to push into 0.5 values to see meaningful values for network and memory when running tpc-c with 10 warehouses via roachprod local previously.

I'm concerned that too low a sampling value will result in these columns being largely empty and a too high value causing significant performance degradation. Even the values of .15 impact the cluster 2.7%, something we haven't been willing to do in the past with optimizer or other execution changes.

@asubiotto
Copy link
Contributor Author

asubiotto commented Mar 1, 2021

Note that this is with the sampling rate fully turned on. Oops my bad

@yuzefovich
Copy link
Member

I implemented all hacky ideas mentioned so far in https://github.com/yuzefovich/cockroach/commits/sample-rate, however, this didn't change microbenchmark numbers nor KV95 numbers on 3 node cluster.

When looking at the profiles of 3 node cluster with sample_rate=1 I see some time spent in getRecording and ExtractStatsFromSpan. However, when diving into details I don't see much we can chop off - to me it all seems like propagating the execution stats through traces has inherent overhead which shows up as the performance hit. On 30 seconds profile I see this:

ROUTINE ======================== github.com/cockroachdb/cockroach/pkg/util/tracing.(*crdbSpan).getRecording in /go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/crdbspan.go
      60ms      1.88s (flat, cum)  3.39% of Total
...
         .      190ms    167:	result := make(Recording, 0, 1+len(s.mu.recording.children)+len(s.mu.recording.remoteSpans))
         .          .    168:	// Shallow-copy the children so we can process them without the lock.
      10ms       10ms    169:	children := s.mu.recording.children
         .      620ms    170:	result = append(result, s.getRecordingLocked(wantTags))
      10ms       50ms    171:	result = append(result, s.mu.recording.remoteSpans...)
         .       70ms    172:	s.mu.Unlock()
         .          .    173:
         .          .    174:	for _, child := range children {
      20ms      680ms    175:		result = append(result, child.getRecording(everyoneIsV211, wantTags)...)
         .          .    176:	}
         .          .    177:
         .          .    178:	// Sort the spans by StartTime, except the first Span (the root of this
         .          .    179:	// recording) which stays in place.
         .       50ms    180:	toSort := sortPool.Get().(*Recording) // avoids allocations in sort.Sort
      10ms       10ms    181:	*toSort = result[1:]
      10ms      150ms    182:	sort.Sort(toSort)
         .          .    183:	*toSort = nil
         .       30ms    184:	sortPool.Put(toSort)
ROUTINE ======================== github.com/cockroachdb/cockroach/pkg/util/tracing.(*crdbSpan).getRecordingLocked in /go/src/github.com/cockroachdb/cockroach/pkg/util/tracing/crdbspan.go
      40ms      620ms (flat, cum)  1.12% of Total
...
      20ms       30ms    299:	if numEvents := s.mu.structured.Len(); numEvents != 0 {
         .       10ms    300:		rs.InternalStructured = make([]*types.Any, 0, numEvents)
         .          .    301:		for i := 0; i < numEvents; i++ {
         .       20ms    302:			event := s.mu.structured.Get(i).(Structured)
         .      480ms    303:			item, err := types.MarshalAny(event)
         .          .    304:			if err != nil {
         .          .    305:				// An error here is an error from Marshal; these
         .          .    306:				// are unlikely to happen.
         .          .    307:				continue
         .          .    308:			}
         .          .    309:			rs.InternalStructured = append(rs.InternalStructured, item)
         .          .    310:		}
         .          .    311:	}
ROUTINE ======================== github.com/cockroachdb/cockroach/pkg/sql/execinfrapb.ExtractStatsFromSpans.func1 in /go/src/github.com/cockroachdb/cockroach/pkg/sql/execinfrapb/component_stats.go
      20ms      460ms (flat, cum)  0.83% of Total
...
      20ms       20ms    345:		span.Structured(func(item *types.Any) {
         .       30ms    346:			if !types.Is(item, componentStats) {
         .          .    347:				return
         .          .    348:			}
         .       40ms    349:			stats := NewComponentStats()
         .      340ms    350:			if err := protoutil.Unmarshal(item.Value, stats); err != nil {
         .          .    351:				return
         .          .    352:			}

At this point, I don't know what else to try to reduce the performance hit. Curious to hear whether @tbg or @irfansharif have some suggestions.

I've been using this script (no judgement please, I'm a bash noob :D) to run the KV95 benchmarks if anyone is interested in running it themselves.

@tbg
Copy link
Member

tbg commented Mar 4, 2021

The meta question that I have is why we're blaming tracing right now. My experiments over in #61328 (comment) (which I'd like you to scrutinize/confirm! I might have messed something up) indicate that the overhead of stats collection related to marshalling, transporting, unmarshalling the extra payloads should be in the single-digit microsecond range, but I see more like 40 microseconds in the single-in-mem-server benchmark. Let's discuss over on #61328.

@asubiotto
Copy link
Contributor Author

It's possible that something else is slowing this down. The VectorizedStatsCollector might be slowing things down when inserted into a flow. It's possible that we should also experiment with pooling all operators in this flow to see what the change is. Since it's hard to see anything from the profiles, I think the way forward here is to list the differences between a non-instrumented flow and an instrumented flow and try to narrow down the search surface area by e.g. experimenting with not allocating stats collectors at all, or making batchInfoCollector.Next not do anything.

@tbg
Copy link
Member

tbg commented Mar 4, 2021

Note that ./cmp.sh (the macrobenchmark) intentionally disables garbage collection, so I don't think pooling should make a difference in at least that benchmark.

craig bot pushed a commit that referenced this issue Mar 4, 2021
61380: colflow: clean up vectorized stats propagation r=yuzefovich a=yuzefovich

Previously, in order to propagate execution statistics we were creating
temporary tracing spans, setting the stats on them, and finishing the
spans right away. This allowed for using (to be more precise, abusing)
the existing infrastructure. The root of the problem is that in the
vectorized engine we do not start per-operator span if stats collection
is enabled at the moment, so we had to get around that limitation.

However, this way is not how tracing spans are intended to be used and
creates some performance slowdown in the hot path, so this commit
refactors the situation. Now we are ensuring that there is always
a tracing span available at the "root" components (either root
materializer or an outbox), so when root components are finishing the
vectorized stats collectors for their subtree of operators, there is
a span to record the stats into.

This required the following minor adjustments:
- in the materializer, we now delegate attachment of the stats to the
tracing span to the drainHelper (which does so on `ConsumerDone`). Note
that the drainHelper doesn't get the recording from the span and leaves
that to the materializer (this is needed in order to avoid collecting
duplicate trace data).
- in the outbox, we now start a "remote child span" (if there is a span
in the parent context) in the beginning of `Run` method, and we attach
that stats in `sendMetadata`.

Addresses: #59379.
Fixes: #59555.

Release justification: low-risk update to existing functionality.

Release note: None

61412: sql: clean up planNodeToRowSource r=yuzefovich a=yuzefovich

This commit removes some redundant things that were kept during
ccc5a8a. Namely, `planNodeToRowSource`
doesn't need to track whether it was started or not now that `startExec`
is called in `Start`. This also allows us to remove the override of
`InternalClose` method.

Release justification: low-risk update to existing functionality.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
craig bot pushed a commit that referenced this issue Mar 5, 2021
61529: sql: save flow diagrams only when needed r=yuzefovich a=yuzefovich

Previously, whenever we needed to save flows information, we would
always generate flow diagrams. However, those are not used when we are
sampling statements and become unnecessary work. This commit updates the
default `saveFlows` function to only generate flow diagrams when needed
(when we're collecting a stmt bundle or when we're running EXPLAIN
ANALYZE stmt).

Addresses: #59379.

Release justification: low-risk update to new functionality.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
@yuzefovich
Copy link
Member

yuzefovich commented Mar 5, 2021

Quick update is that with #61529 cherry-picked, the performance hit of sample_rate=1 went down noticeably, to -5.7% on GCE and -7.1% on AWS in a single 5 min run.

@yuzefovich
Copy link
Member

In my experiments on df11d41 with Tobi's commits from #61328 cherry-picked (but without mine WIP commits) I observe the following explanation for the current 13% perf hit in the macro benchmark (KV/Scan, all numbers are approximate):

  • 1% hit for simply creating a new span (the one in instrumentationHelper.Setup) and finishing it, without getting the recording. No stats are collected.
  • 2-3% hit for getting the recording from that span when no stats have been attached to it. We're getting the trace data that is attached by other components (most likely below SQL, like dist sender).
  • 1.5% for saving flows metadata by saveFlows function when there are no stats attached still. saveFlows also calls NewFlowsMetadata. It seems like NewFlowsMetadata is currently responsible for about half of 1.5% hit.
  • 2.5% for calling GetQueryLevelStats on a trace with no stats payload. Note that recording query level stats into appStats and txnStats seems to be free.
  • 2% for planning vectorized stats collectors, but without calling getStats on them and, therefore, still not adding stats payload to the trace.
  • 1.5% for calling getStats (i.e. retrieving the execution stats) but still not attaching them to the trace.
  • 1.5% for attaching the stats to the trace (i.e. performing the full action for sampling).

@yuzefovich
Copy link
Member

yuzefovich commented Mar 5, 2021

I prototyped again a few things already mentioned before to see the impact on an actual KV95 run here.

  • the first commit removes an extra span created in instrumentationHelper.Setup and uses the already present one. This change seems to reduce the hit by a quarter or so. The numbers look as follows:
    • GCE:
      • sample_rate=0.5: -0.8%
      • sample_rate=1: -4.1%
    • AWS:
      • sample_rate=0.5: -3.5%
      • sample_rate=1: -6.7%
  • the second commit trims the trace analyzer to remove currently unused bookkeeping. This commit surprisingly seems to have thing a bit worse actually. The numbers look as follows:
    • GCE:
      • sample_rate=0.5: -3.0%
      • sample_rate=1: -6.8%
    • AWS:
      • sample_rate=0.5: -4.7%
      • sample_rate=1: -6.2%
  • the third commit pools ComponentStats objects. Note that the second commit made it easier to implement this commit. The numbers look as follows:
    • GCE:
      • sample_rate=0.5: -3.2%
      • sample_rate=1: -6.9%
    • AWS:
      • sample_rate=0.5: -2.7%
      • sample_rate=1: -5.0%
  • the fourth commit removes the additional tracing span created by the colbatchscans. Note that in the current form that commit cannot be merged to master because it would cause confusion during the computation of cumulative contention time when there are wrapped rowexec processors in the vectorized flow. The numbers look as follows:
    • GCE:
      • sample_rate=0.5: -3.8%
      • sample_rate=1: -5.4%
    • AWS:
      • sample_rate=0.5: -0.0%
      • sample_rate=1: -2.4%

Based on these results, it looks like the most beneficial thing is avoiding creating spans if possible. Thus, I'll open up a PR to include the first commit and will kick off more extensive benchmarks with it.

@yuzefovich
Copy link
Member

yuzefovich commented Mar 5, 2021

With #61532 I got the following (3 numbers indicate durations of 5 min, 10 min, and 15 min, respectively):

  • GCE:
    • sample_rate=0.05: +1.6%, -1.4%, +0.3%
    • sample_rate=0.1: +1.6%, -1.0%, +1.3%
    • sample_rate=0.25: -0.2%, -2.8%, -2.6%
    • sample_rate=0.5: -2.5%, -4.5%, -3.0%
    • sample_rate=1: -2.8%, -7.3%, -5.8%
  • AWS:
    • sample_rate=0.05: -2.2%, -0.7%, -0.8%
    • sample_rate=0.1: -0.1%, -0.6%, -1.0%
    • sample_rate=0.25: -1.1%, -1.8%, -2.4%
    • sample_rate=0.5: -3.3%, -3.6%, -3.3%
    • sample_rate=1: -7.4%, -6.8%, -6.3%

@yuzefovich
Copy link
Member

Reran the tests on 90d6ba4 with #61532 (Tobi suggested that #61359 might improve things). Looks like without other improvements sample_rate=0.1 will be the maximum recommended default value. The numbers are for 5 min and 10 min duration:

  • GCE:
    • sample_rate=0.05: -1.8%, -0.4%
    • sample_rate=0.1: -2.0%, -1.1%
    • sample_rate=0.25: -3.1%, -2.0%
    • sample_rate=0.5: -5.1%, -2.7%
    • sample_rate=1: -8.4%, -5.6%
  • AWS:
    • sample_rate=0.05: -2.2%, -0.6%
    • sample_rate=0.1: -1.5%, -1.6%
    • sample_rate=0.25: -3.1%, -3.2%
    • sample_rate=0.5: -5.1%, -4.6%
    • sample_rate=1: -8.9%, -8.6%

@asubiotto
Copy link
Contributor Author

Let's set it to 0.1. I'll try to circle back later with a better sampling approach (i.e. sample the first occurrence of a fingerprint) in #61678.

craig bot pushed a commit that referenced this issue Mar 9, 2021
61532: sql: use stmt's span for exec stats propagation r=yuzefovich a=yuzefovich

Previously, when sampling the statement, we would always create a new
tracing span. However, there is another span that we can use instead:
we always create a tracing span for each statement in
`connExecutor.execCmd`. That span is not used directly for anything and
is needed because the transactions expect that a span is present in
their context. This commit utilizes the present tracing span for the
sampling purposes which gives us a performance boost (some benchmarks
show that this eliminates about a quarter of the performance overhead
with "always on" sampling").

Addresses: #59379.

Release justification: low-risk update to new functionality.

Release note: None

Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
@irfansharif
Copy link
Contributor

I've yet to digest the experiments here and on #61328, but looking at TPC-C I'm seeing about 10% additional memory usage by just us creating new spans (mostly from connExecutor.execCmd): #59424 (comment). I don't suppose we can do much about that one, can we? I haven't worked out how that affects CPU usage. We're certainly taking on added GC pressure because of it. In any case, the memory overhead seems a bit worrying.

@irfansharif
Copy link
Contributor

irfansharif commented Mar 10, 2021

Hm, ok just familiarized myself with the above. I guess that one is pretty necessary. The analysis here and on #61328 look sound, y'all are way ahead of me. Just a sanity check (esp cause the %s above are using 5-15m runs), we arrived at 0.1 with GC switched on, right?

@craig craig bot closed this as completed in 7a26119 Mar 10, 2021
irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 1, 2021
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what cockroachdb#59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced cockroachdb#61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in cockroachdb#59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in cockroachdb#62227), but they're much too invasive to backport to
21.1. It's unfortunate that we only discovered the overhead this late in
the development cycle. It was due to two major reasons:
- cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by
  propagating real tracing spans across rpc boundaries). We had done
  sanity checking for the tracing overhead before this point, but failed
  to realize that cockroachdb#59992 would merit re-analysis.
- The test that alerted us to the degradation (tpccbench) had be
  persistently failing for a myriad of other reasons, so we didn't learn
  until too late that tracing was the latest offendor. tpccbench also
  doesn't deal with VM overload well (something cockroachdb#62361 hopes to
  address), and after tracing was enabled for realsies, this was the
  dominant failure mode. This resulted in perf data not making it's way
  to roachperf, which further hid possible indicators we had a major
  regression on our hands. We also didn't have a healthy process looking
  at roachperf on a continual basis, something we're looking to rectify
  going forward. We would've picked up on this regression had we been
  closely monitoring the kv95 charts.

Release note: None
craig bot pushed a commit that referenced this issue Apr 2, 2021
62998: sql: lower default sampling rate to 1% r=irfansharif a=irfansharif

We arrived at the previous default rate of 10% back in #59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what #59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced #61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in #59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in #62227), but they're much too invasive to backport to
21.1.

Release note: None

Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
irfansharif added a commit to irfansharif/cockroach that referenced this issue Apr 2, 2021
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was
back when we were creating real tracing spans for all statements, and
for sampled statements, we were propagating additional stats payloads.
Consequently what cockroachdb#59379 ended up measuring (and finding the overhead
acceptable) for was the performance hit we would incur for propagating
stats payloads for statements already using real tracing spans.

Since then, the landscape has changed. Notably we introduced cockroachdb#61777,
which made it so that we were only using real tracing spans for sampled
statements. This was done after performance analysis in cockroachdb#59424 showed
that the use of real tracing spans in all statements resulted in
tremendous overhead, for no real benefit.

What this now leaves us with is a sampling rate that was tuned by only
considering the stats payload overhead. What we want now is to also
consider the overhead of using real tracing spans for sampled
statements, vs. not. Doing this analysis gives us a very different
picture for what the default rate should be.

---

To find out what the overhead for sampled statements are currently, we
experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple
benchmark that does little more than one-off statements, so should give
us a concise picture of the sampling overhead. We ran six experiments
in total (each corresponding to a pair of read+write rows), done in
groups of three (each group corresponding to a table below). Each
run in turn is comprised of 10 iterations of kv95, and what's varied
between each run is the default sampling rate. We pin a sampling rate of
0.0 as the baseline that effectively switches off sampling entirely (and
tracing), and measure the throughput degradation as we vary the sampling
rate.

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #1   | 69817.90         | 69406.37
    0.01 / read  / #1   | 69300.35  -0.74% | 68717.23  -0.99%
    0.10 / read  / #1   | 67743.35  -2.97% | 67601.81  -2.60%
    0.00 / write / #1   |  3672.55         |  3653.63
    0.01 / write / #1   |  3647.65  -0.68% |  3615.90  -1.03%
    0.10 / write / #1   |  3567.20  -2.87% |  3558.90  -2.59%

                          ops/sec            ops/sec
    --------------------|------------------|------------------
    rate   op      grp  | median    diff   | mean      diff
    --------------------|------------------|------------------
    0.00 / read  / #2   | 69440.80          68893.24
    0.01 / read  / #2   | 69481.55  +0.06%  69463.13  +0.82% (probably in the noise margin)
    0.10 / read  / #2   | 67841.80  -2.30%  66992.55  -2.76%
    0.00 / write / #2   |  3652.45           3625.24
    0.01 / write / #2   |  3657.55  -0.14%   3654.34  +0.80%
    0.10 / write / #2   |  3570.75  -2.24%   3526.04  -2.74%

The results above suggest that the current default rate of 10% is too
high, and a 1% rate is much more acceptable.

---

The fact that the cost of sampling is largely dominated by tracing is
extremely unfortunate. We have ideas for how that can be improved
(prototyped in cockroachdb#62227), but they're much too invasive to backport to
21.1. It's unfortunate that we only discovered the overhead this late in
the development cycle. It was due to two major reasons:
- cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by
  propagating real tracing spans across rpc boundaries). We had done
  sanity checking for the tracing overhead before this point, but failed
  to realize that cockroachdb#59992 would merit re-analysis.
- The test that alerted us to the degradation (tpccbench) had be
  persistently failing for a myriad of other reasons, so we didn't learn
  until too late that tracing was the latest offendor. tpccbench also
  doesn't deal with VM overload well (something cockroachdb#62361 hopes to
  address), and after tracing was enabled for realsies, this was the
  dominant failure mode. This resulted in perf data not making it's way
  to roachperf, which further hid possible indicators we had a major
  regression on our hands. We also didn't have a healthy process looking
  at roachperf on a continual basis, something we're looking to rectify
  going forward. We would've picked up on this regression had we been
  closely monitoring the kv95 charts.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants