-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: introduce sql.statement_stats.sample_rate to sample execution stats #59132
Conversation
The first two commits are separate PRs by our KV friends. |
I like starting this at zero and then, with testing, picking the right sampling metric for default for 21.1. cc @Annebirzin we will need to think about what the UI should look like if people turn this off (it means we won't have a number of inputs that power columns/other values). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also like the current sampling approach with starting at 0 sampling probability. Left a few nit comments too.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @RaduBerinde)
pkg/roachpb/app_stats.proto, line 95 at r4 (raw file):
optional NumericStat max_mem_usage = 18 [(gogoproto.nullable) = false]; // StatCollectionCount keeps track of how many times execution stats were
nit: tabs instead of spaces.
pkg/sql/instrumentation.go, line 44 at r4 (raw file):
func(f float64) error { if f >= 0 && f <= 1 { return nil
super nit: I think usually (conventionally?) the "error-path" is inside of the if
and the "normal path" is outside.
pkg/sql/instrumentation.go, line 227 at r4 (raw file):
stmtStats.mu.Lock() stmtStats.mu.data.StatCollectionCount++ // Record trace-related statistics. A count of 1 is passed given that this
nit: "A count of 1 ..." part seems no longer relevant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde and @yuzefovich)
pkg/sql/instrumentation.go, line 44 at r4 (raw file):
Previously, yuzefovich (Yahor Yuzefovich) wrote…
super nit: I think usually (conventionally?) the "error-path" is inside of the
if
and the "normal path" is outside.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)
pkg/sql/instrumentation.go, line 165 at r7 (raw file):
ih.savePlanForStats = appStats.shouldSaveLogicalPlanDescription(fingerprint, implicitTxn) ih.collectStats = collectStmtStatsSampleRate.Get(&cfg.Settings.SV) > rand.Float64()
rand.Float64()
goes through a global mutex; this is unnecessary synchronization between different queries. We should create a separate rand source in each executor. And we should skip the call altogether if the setting is 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also added an integration test. There's a bug in max memory stat propagation that I'm debugging and will fix in another PR.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)
pkg/sql/instrumentation.go, line 165 at r7 (raw file):
Previously, RaduBerinde wrote…
rand.Float64()
goes through a global mutex; this is unnecessary synchronization between different queries. We should create a separate rand source in each executor. And we should skip the call altogether if the setting is 0.
Done.
Both PRs that this depends on have now been merged. |
Ready for another look. |
@asubiotto @awoods187 I explored some designs for when sampling is set at 0 (figma designs here) Would this just apply to the 'rows/bytes' and 'contention' column data? Also let me know any thoughts on the wording |
I like those designs! |
Nice! It would only apply to contention, max memory, and network. |
Thanks @asubiotto I updated designs to reflect the contention, max memory and network columns. Also, since those columns are next to each other, I added a second option where we just show 'sampling off' text in the center column and gray out all 3 cells so there isn't so much repetition. (the other two columns would just be empty) |
Extracted ui changes into a separate, third, commit. @dhartunian, I would be grateful if you could review the third commit (it's pretty small). Friendly ping @RaduBerinde on the first two commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
modulo a couple of comments. Sorry for the delay
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @dhartunian, and @yuzefovich)
pkg/roachpb/app_stats.proto, line 92 at r11 (raw file):
optional NumericStat bytes_sent_over_network = 17 [(gogoproto.nullable) = false]; // MaxMemUsage collects the maximum memory usage that occurred on a node.
I assume these statistics cover a certain time range? Can you add some explanation about that to the overall message comment?
pkg/sql/instrumentation.go, line 66 at r10 (raw file):
collectBundle bool // collectStast is set when we are collecting stats for a statement.
[nit] collectStats typo (+ I'd make this collectExecStats).
pkg/sql/instrumentation.go, line 290 at r10 (raw file):
} // ShouldCollectStats returns true if we should collect statement execution
[nit] I'd make this ShouldCollectExecStats
. No need to do the same in other places which are under execinfra scope.
pkg/sql/instrumentation.go, line 182 at r11 (raw file):
// If we need to collect stats, create a non-verbose child span. Stats // will be added as structured metadata and processed in Finish. newCtx, ih.sp = tracing.EnsureChildSpan(ctx, cfg.AmbientCtx.Tracer, "traced statement")
I think origCtx needs to be set in this case.. By the way, why is there a difference between this and what we're doing below (StartVerboseTrace)? We no longer want to SetVerbose? (if so, can we change the existing invocation)?
Previously, each processor would make the decision of whether stats should be collected by checking whether recording was enabled on a span obtained through the context. This commit centralizes that decision by moving it into the conn executor's instrumentation helper. Processors now observe a CollectStats bool set on the FlowCtx in order to determine whether they should collect stats or not. Release note: None
This commit adds sql.statement_stats.sample_rate, which defines a probability that a statement will be sampled. This commit also adds MaxMemoryUsage to the StatementStatistics DB Console protobuf. Release note: None
Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! No problem
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian and @yuzefovich)
pkg/roachpb/app_stats.proto, line 92 at r11 (raw file):
Previously, RaduBerinde wrote…
I assume these statistics cover a certain time range? Can you add some explanation about that to the overall message comment?
Done.
pkg/sql/instrumentation.go, line 290 at r10 (raw file):
Previously, RaduBerinde wrote…
[nit] I'd make this
ShouldCollectExecStats
. No need to do the same in other places which are under execinfra scope.
Done. Also updated the StatementStatistics field similarly since it made it clearer.
pkg/sql/instrumentation.go, line 182 at r11 (raw file):
Done.
why is there a difference between this and what we're doing below (StartVerboseTrace)? We no longer want to SetVerbose?
Because we want to keep this as cheap as possible. I think verbose tracing traces a bunch of stuff that we do want to keep in bundles (e.g. the full trace) but are not interested in in this case since we're only interested in stats.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 8 files at r13, 3 of 3 files at r14.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)
TFTRs! bors r=RaduBerinde,dhartunian |
Build succeeded: |
pkg/sql/instrumentation.go, line 182 at r11 (raw file): Previously, asubiotto (Alfonso Subiotto Marqués) wrote…
Makes sense, thanks! |
59299: execstats: add cumulative contention time to the TraceAnalyzer r=RaduBerinde a=asubiotto Depends on #59132 Closes #56797 This commit adds the summing of each processor's contention time to the TraceAnalyzer. This stat is returned as a node and query level stat. Additionally, this commit adds a ContentionTime field to the StatementStatistics protobuf used in the DB Console. Release note: None Co-authored-by: Alfonso Subiotto Marques <alfonso@cockroachlabs.com>
Depends on #58897
Depends on #59103
This PR puts the "always-on" into always-on EXPLAIN ANALYZE. Take a look at separate commits for details. What actually goes on is that we're taking the slightly safer route of introducing a cluster setting which defines a sample rate for execution stats. These execution stats are collected and propagated using background tracing, which is cheaper than verbose tracing. This allows us to power the new DB Console statement stats views. Currently we still need user input in order to turn up this sample rate.
The sample rate is 0 by default in this PR for safety reasons. I'd like to discuss the default value of this cluster setting or whether we need it at all separately before the 21.1 release, but this gives us a nice escape hatch if for whatever reason stats collection results in poor performance.
Closes #54556