-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: evaluate sql.txn_stats.sample_rate default #59379
Comments
Taking this over. |
Looks like there are two main time usages there:
that is being called by
|
cockroach/pkg/sql/flowinfra/outbox.go Lines 221 to 225 in 5fd21c2
which makes it seem as though tags are somehow used to transport stats, which is not a good idea in the hot path. I think this is legacy code, but see if things like that are the problem.
|
This avoids some of the obviously problematic expensive stringifications observed in cockroachdb#59379. Release justification: low-risk performance optimization that unblocks sql stats. Release note: None
I couldn't really see the same kind of drop on a local three node roachprod cluster running stock |
This avoids some of the obviously problematic expensive stringifications observed in cockroachdb#59379. Release justification: low-risk performance optimization that unblocks sql stats. Release note: None
Yeah, we've moved away from using tags in the spans (in favor of |
Thanks for looking into this Tobi! Replying to your questions from #59379 (comment).
I don't think so, we only use verbose spans when the tracing is enabled for the whole session or with EXPLAIN ANALYZE. I hope I'm not reading the code wrong here. Confirmed manually on my laptop that
Ack. The code you linked gets executed only is the verbose tracing is enabled. I did find a few thing (#61158) where we would be setting tags in non-verbose tracing. Rerunning my benchmarks with both #61116 and #61158 cherry-picked, and I see an improvement, but the hit with tracing "always on" is still very high - on the order of
I haven't tried any of the local benchmarks yet, but I'll post here if I find something useful. |
61041: multiregionccl: gate multi-region database configuration behind a license r=ajstorm a=otan Resolves #59668 Release justification: low risk changes to new functionality Release note (enterprise change): Multi-region database creations are permitted as long as the cluster has a CockroachDB subscription. 61158: tracing: do not set tags when setting stats r=yuzefovich a=yuzefovich **colflow: do not set redundant tag on the tracing span** This is unnecessary given that we're setting the componentID below that includes the flowID. Release justification: low-risk update to new functionality. Release note: None **tracing: do not set tags when setting stats** We no longer need to set tags on the tracing span in order to propagate stats since we now propagate that info as a whole object (at the moment, twice - as a Structured payload and as DeprecatedStats protobuf; the latter will be removed after 21.1 branch is cut). Addresses: #59379. Release justification: low-risk update to new functionality. Release note: None Co-authored-by: Oliver Tan <otan@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
61116: tracing: elide tags from recording if not verbose r=tbg a=tbg This avoids some of the obviously problematic expensive stringifications observed in #59379. Release justification: low-risk performance optimization that unblocks sql stats. Release note: None Co-authored-by: Tobias Grieger <tobias.b.grieger@gmail.com>
The numbers I got yesterday with both #61158 and #61116 cherry-picked with 3 min ramp and 15 min duration:
|
Nice! That looks promising cc @awoods187 @jordanlewis. What's the bottleneck now? |
This is interesting. I was having to push into 0.5 values to see meaningful values for network and memory when running tpc-c with 10 warehouses via roachprod local previously. I'm concerned that too low a sampling value will result in these columns being largely empty and a too high value causing significant performance degradation. Even the values of .15 impact the cluster 2.7%, something we haven't been willing to do in the past with optimizer or other execution changes. |
|
I implemented all hacky ideas mentioned so far in https://github.com/yuzefovich/cockroach/commits/sample-rate, however, this didn't change microbenchmark numbers nor KV95 numbers on 3 node cluster. When looking at the profiles of 3 node cluster with
At this point, I don't know what else to try to reduce the performance hit. Curious to hear whether @tbg or @irfansharif have some suggestions. I've been using this script (no judgement please, I'm a bash noob :D) to run the KV95 benchmarks if anyone is interested in running it themselves. |
The meta question that I have is why we're blaming tracing right now. My experiments over in #61328 (comment) (which I'd like you to scrutinize/confirm! I might have messed something up) indicate that the overhead of stats collection related to marshalling, transporting, unmarshalling the extra payloads should be in the single-digit microsecond range, but I see more like 40 microseconds in the single-in-mem-server benchmark. Let's discuss over on #61328. |
It's possible that something else is slowing this down. The |
Note that |
61380: colflow: clean up vectorized stats propagation r=yuzefovich a=yuzefovich Previously, in order to propagate execution statistics we were creating temporary tracing spans, setting the stats on them, and finishing the spans right away. This allowed for using (to be more precise, abusing) the existing infrastructure. The root of the problem is that in the vectorized engine we do not start per-operator span if stats collection is enabled at the moment, so we had to get around that limitation. However, this way is not how tracing spans are intended to be used and creates some performance slowdown in the hot path, so this commit refactors the situation. Now we are ensuring that there is always a tracing span available at the "root" components (either root materializer or an outbox), so when root components are finishing the vectorized stats collectors for their subtree of operators, there is a span to record the stats into. This required the following minor adjustments: - in the materializer, we now delegate attachment of the stats to the tracing span to the drainHelper (which does so on `ConsumerDone`). Note that the drainHelper doesn't get the recording from the span and leaves that to the materializer (this is needed in order to avoid collecting duplicate trace data). - in the outbox, we now start a "remote child span" (if there is a span in the parent context) in the beginning of `Run` method, and we attach that stats in `sendMetadata`. Addresses: #59379. Fixes: #59555. Release justification: low-risk update to existing functionality. Release note: None 61412: sql: clean up planNodeToRowSource r=yuzefovich a=yuzefovich This commit removes some redundant things that were kept during ccc5a8a. Namely, `planNodeToRowSource` doesn't need to track whether it was started or not now that `startExec` is called in `Start`. This also allows us to remove the override of `InternalClose` method. Release justification: low-risk update to existing functionality. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
61529: sql: save flow diagrams only when needed r=yuzefovich a=yuzefovich Previously, whenever we needed to save flows information, we would always generate flow diagrams. However, those are not used when we are sampling statements and become unnecessary work. This commit updates the default `saveFlows` function to only generate flow diagrams when needed (when we're collecting a stmt bundle or when we're running EXPLAIN ANALYZE stmt). Addresses: #59379. Release justification: low-risk update to new functionality. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Quick update is that with #61529 cherry-picked, the performance hit of |
In my experiments on df11d41 with Tobi's commits from #61328 cherry-picked (but without mine WIP commits) I observe the following explanation for the current 13% perf hit in the macro benchmark (KV/Scan, all numbers are approximate):
|
I prototyped again a few things already mentioned before to see the impact on an actual KV95 run here.
Based on these results, it looks like the most beneficial thing is avoiding creating spans if possible. Thus, I'll open up a PR to include the first commit and will kick off more extensive benchmarks with it. |
With #61532 I got the following (3 numbers indicate durations of 5 min, 10 min, and 15 min, respectively):
|
Reran the tests on 90d6ba4 with #61532 (Tobi suggested that #61359 might improve things). Looks like without other improvements
|
Let's set it to 0.1. I'll try to circle back later with a better sampling approach (i.e. sample the first occurrence of a fingerprint) in #61678. |
61532: sql: use stmt's span for exec stats propagation r=yuzefovich a=yuzefovich Previously, when sampling the statement, we would always create a new tracing span. However, there is another span that we can use instead: we always create a tracing span for each statement in `connExecutor.execCmd`. That span is not used directly for anything and is needed because the transactions expect that a span is present in their context. This commit utilizes the present tracing span for the sampling purposes which gives us a performance boost (some benchmarks show that this eliminates about a quarter of the performance overhead with "always on" sampling"). Addresses: #59379. Release justification: low-risk update to new functionality. Release note: None Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
I've yet to digest the experiments here and on #61328, but looking at TPC-C I'm seeing about 10% additional memory usage by just us creating new spans (mostly from |
Hm, ok just familiarized myself with the above. I guess that one is pretty necessary. The analysis here and on #61328 look sound, y'all are way ahead of me. Just a sanity check (esp cause the %s above are using 5-15m runs), we arrived at 0.1 with GC switched on, right? |
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None
62998: sql: lower default sampling rate to 1% r=irfansharif a=irfansharif We arrived at the previous default rate of 10% back in #59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what #59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced #61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in #59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in #62227), but they're much too invasive to backport to 21.1. Release note: None Co-authored-by: irfan sharif <irfanmahmoudsharif@gmail.com>
We arrived at the previous default rate of 10% back in cockroachdb#59379. This was back when we were creating real tracing spans for all statements, and for sampled statements, we were propagating additional stats payloads. Consequently what cockroachdb#59379 ended up measuring (and finding the overhead acceptable) for was the performance hit we would incur for propagating stats payloads for statements already using real tracing spans. Since then, the landscape has changed. Notably we introduced cockroachdb#61777, which made it so that we were only using real tracing spans for sampled statements. This was done after performance analysis in cockroachdb#59424 showed that the use of real tracing spans in all statements resulted in tremendous overhead, for no real benefit. What this now leaves us with is a sampling rate that was tuned by only considering the stats payload overhead. What we want now is to also consider the overhead of using real tracing spans for sampled statements, vs. not. Doing this analysis gives us a very different picture for what the default rate should be. --- To find out what the overhead for sampled statements are currently, we experimented with kv95/enc=false/nodes=1/cpu=32. It's a simple benchmark that does little more than one-off statements, so should give us a concise picture of the sampling overhead. We ran six experiments in total (each corresponding to a pair of read+write rows), done in groups of three (each group corresponding to a table below). Each run in turn is comprised of 10 iterations of kv95, and what's varied between each run is the default sampling rate. We pin a sampling rate of 0.0 as the baseline that effectively switches off sampling entirely (and tracing), and measure the throughput degradation as we vary the sampling rate. ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #1 | 69817.90 | 69406.37 0.01 / read / #1 | 69300.35 -0.74% | 68717.23 -0.99% 0.10 / read / #1 | 67743.35 -2.97% | 67601.81 -2.60% 0.00 / write / #1 | 3672.55 | 3653.63 0.01 / write / #1 | 3647.65 -0.68% | 3615.90 -1.03% 0.10 / write / #1 | 3567.20 -2.87% | 3558.90 -2.59% ops/sec ops/sec --------------------|------------------|------------------ rate op grp | median diff | mean diff --------------------|------------------|------------------ 0.00 / read / #2 | 69440.80 68893.24 0.01 / read / #2 | 69481.55 +0.06% 69463.13 +0.82% (probably in the noise margin) 0.10 / read / #2 | 67841.80 -2.30% 66992.55 -2.76% 0.00 / write / #2 | 3652.45 3625.24 0.01 / write / #2 | 3657.55 -0.14% 3654.34 +0.80% 0.10 / write / #2 | 3570.75 -2.24% 3526.04 -2.74% The results above suggest that the current default rate of 10% is too high, and a 1% rate is much more acceptable. --- The fact that the cost of sampling is largely dominated by tracing is extremely unfortunate. We have ideas for how that can be improved (prototyped in cockroachdb#62227), but they're much too invasive to backport to 21.1. It's unfortunate that we only discovered the overhead this late in the development cycle. It was due to two major reasons: - cockroachdb#59992 landed late in the cycle, and enabled tracing for realsies (by propagating real tracing spans across rpc boundaries). We had done sanity checking for the tracing overhead before this point, but failed to realize that cockroachdb#59992 would merit re-analysis. - The test that alerted us to the degradation (tpccbench) had be persistently failing for a myriad of other reasons, so we didn't learn until too late that tracing was the latest offendor. tpccbench also doesn't deal with VM overload well (something cockroachdb#62361 hopes to address), and after tracing was enabled for realsies, this was the dominant failure mode. This resulted in perf data not making it's way to roachperf, which further hid possible indicators we had a major regression on our hands. We also didn't have a healthy process looking at roachperf on a continual basis, something we're looking to rectify going forward. We would've picked up on this regression had we been closely monitoring the kv95 charts. Release note: None
#59132 adds the
sql.txn_stats.sample_rate
cluster setting. The default setting is 0, which means never sample stats. However, we're soon going to change the DB Console page to show these sampled execution stats, which means that this setting has to be nonzero to show interesting information.This issue tracks quantifying the performance effect of turning this setting up to always sample and seeing if it is realistic to remove this setting at all. If not, we might want to have a low but nonzero sample rate to provide some interesting data.
The text was updated successfully, but these errors were encountered: