sql: introduce sql.statement_stats.sample_rate to sample execution stats #59132

asubiotto · 2021-01-19T11:02:46Z

Depends on #58897
Depends on #59103

This PR puts the "always-on" into always-on EXPLAIN ANALYZE. Take a look at separate commits for details. What actually goes on is that we're taking the slightly safer route of introducing a cluster setting which defines a sample rate for execution stats. These execution stats are collected and propagated using background tracing, which is cheaper than verbose tracing. This allows us to power the new DB Console statement stats views. Currently we still need user input in order to turn up this sample rate.

The sample rate is 0 by default in this PR for safety reasons. I'd like to discuss the default value of this cluster setting or whether we need it at all separately before the 21.1 release, but this gives us a nice escape hatch if for whatever reason stats collection results in poor performance.

Closes #54556

cockroach-teamcity · 2021-01-19T11:02:57Z

This change is

asubiotto · 2021-01-19T11:03:52Z

The first two commits are separate PRs by our KV friends.

awoods187 · 2021-01-19T14:16:37Z

I like starting this at zero and then, with testing, picking the right sampling metric for default for 21.1. cc @Annebirzin we will need to think about what the UI should look like if people turn this off (it means we won't have a number of inputs that power columns/other values).

yuzefovich

I also like the current sampling approach with starting at 0 sampling probability. Left a few nit comments too.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @RaduBerinde)

pkg/roachpb/app_stats.proto, line 95 at r4 (raw file):

  optional NumericStat max_mem_usage = 18 [(gogoproto.nullable) = false];

	// StatCollectionCount keeps track of how many times execution stats were

nit: tabs instead of spaces.

pkg/sql/instrumentation.go, line 44 at r4 (raw file):

	func(f float64) error {
		if f >= 0 && f <= 1 {
			return nil

super nit: I think usually (conventionally?) the "error-path" is inside of the if and the "normal path" is outside.

pkg/sql/instrumentation.go, line 227 at r4 (raw file):

			stmtStats.mu.Lock()
			stmtStats.mu.data.StatCollectionCount++
			// Record trace-related statistics. A count of 1 is passed given that this

nit: "A count of 1 ..." part seems no longer relevant.

asubiotto

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde and @yuzefovich)

pkg/sql/instrumentation.go, line 44 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

super nit: I think usually (conventionally?) the "error-path" is inside of the if and the "normal path" is outside.

Done.

RaduBerinde

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)

pkg/sql/instrumentation.go, line 165 at r7 (raw file):

	ih.savePlanForStats = appStats.shouldSaveLogicalPlanDescription(fingerprint, implicitTxn)

	ih.collectStats = collectStmtStatsSampleRate.Get(&cfg.Settings.SV) > rand.Float64()

rand.Float64() goes through a global mutex; this is unnecessary synchronization between different queries. We should create a separate rand source in each executor. And we should skip the call altogether if the setting is 0.

asubiotto

Also added an integration test. There's a bug in max memory stat propagation that I'm debugging and will fix in another PR.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)

pkg/sql/instrumentation.go, line 165 at r7 (raw file):

Previously, RaduBerinde wrote…

rand.Float64() goes through a global mutex; this is unnecessary synchronization between different queries. We should create a separate rand source in each executor. And we should skip the call altogether if the setting is 0.

Done.

asubiotto · 2021-01-22T10:55:05Z

Both PRs that this depends on have now been merged.

asubiotto · 2021-01-25T14:13:50Z

Ready for another look.

Annebirzin · 2021-01-25T20:20:14Z

@asubiotto @awoods187 I explored some designs for when sampling is set at 0 (figma designs here)

Would this just apply to the 'rows/bytes' and 'contention' column data?

Also let me know any thoughts on the wording

awoods187 · 2021-01-25T23:57:56Z

I like those designs!

asubiotto · 2021-01-26T09:53:24Z

Nice! It would only apply to contention, max memory, and network.

Annebirzin · 2021-01-26T16:21:00Z

Thanks @asubiotto I updated designs to reflect the contention, max memory and network columns.

Also, since those columns are next to each other, I added a second option where we just show 'sampling off' text in the center column and gray out all 3 cells so there isn't so much repetition. (the other two columns would just be empty)

asubiotto · 2021-01-27T10:00:03Z

Extracted ui changes into a separate, third, commit. @dhartunian, I would be grateful if you could review the third commit (it's pretty small). Friendly ping @RaduBerinde on the first two commits.

RaduBerinde

modulo a couple of comments. Sorry for the delay

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @dhartunian, and @yuzefovich)

pkg/roachpb/app_stats.proto, line 92 at r11 (raw file):

  optional NumericStat bytes_sent_over_network = 17 [(gogoproto.nullable) = false];

  // MaxMemUsage collects the maximum memory usage that occurred on a node.

I assume these statistics cover a certain time range? Can you add some explanation about that to the overall message comment?

pkg/sql/instrumentation.go, line 66 at r10 (raw file):

	collectBundle bool

	// collectStast is set when we are collecting stats for a statement.

[nit] collectStats typo (+ I'd make this collectExecStats).

pkg/sql/instrumentation.go, line 290 at r10 (raw file):

}

// ShouldCollectStats returns true if we should collect statement execution

[nit] I'd make this ShouldCollectExecStats. No need to do the same in other places which are under execinfra scope.

pkg/sql/instrumentation.go, line 182 at r11 (raw file):

			// If we need to collect stats, create a non-verbose child span. Stats
			// will be added as structured metadata and processed in Finish.
			newCtx, ih.sp = tracing.EnsureChildSpan(ctx, cfg.AmbientCtx.Tracer, "traced statement")

I think origCtx needs to be set in this case.. By the way, why is there a difference between this and what we're doing below (StartVerboseTrace)? We no longer want to SetVerbose? (if so, can we change the existing invocation)?

Previously, each processor would make the decision of whether stats should be collected by checking whether recording was enabled on a span obtained through the context. This commit centralizes that decision by moving it into the conn executor's instrumentation helper. Processors now observe a CollectStats bool set on the FlowCtx in order to determine whether they should collect stats or not. Release note: None

This commit adds sql.statement_stats.sample_rate, which defines a probability that a statement will be sampled. This commit also adds MaxMemoryUsage to the StatementStatistics DB Console protobuf. Release note: None

Release note: None

asubiotto

TFTR! No problem

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian and @yuzefovich)

pkg/roachpb/app_stats.proto, line 92 at r11 (raw file):

Previously, RaduBerinde wrote…

I assume these statistics cover a certain time range? Can you add some explanation about that to the overall message comment?

Done.

pkg/sql/instrumentation.go, line 290 at r10 (raw file):

Previously, RaduBerinde wrote…

[nit] I'd make this ShouldCollectExecStats. No need to do the same in other places which are under execinfra scope.

Done. Also updated the StatementStatistics field similarly since it made it clearer.

pkg/sql/instrumentation.go, line 182 at r11 (raw file):
Done.

why is there a difference between this and what we're doing below (StartVerboseTrace)? We no longer want to SetVerbose?

Because we want to keep this as cheap as possible. I think verbose tracing traces a bunch of stuff that we do want to keep in bundles (e.g. the full trace) but are not interested in in this case since we're only interested in stats.

dhartunian

ui changes

Reviewed 2 of 8 files at r13, 3 of 3 files at r14.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)

asubiotto · 2021-01-28T15:22:39Z

TFTRs!

bors r=RaduBerinde,dhartunian

craig · 2021-01-28T16:15:21Z

Build succeeded:

GitHub CI (Cockroach)

RaduBerinde · 2021-01-28T16:53:45Z

pkg/sql/instrumentation.go, line 182 at r11 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Done.

why is there a difference between this and what we're doing below (StartVerboseTrace)? We no longer want to SetVerbose?

Because we want to keep this as cheap as possible. I think verbose tracing traces a bunch of stuff that we do want to keep in bundles (e.g. the full trace) but are not interested in in this case since we're only interested in stats.

Makes sense, thanks!

59299: execstats: add cumulative contention time to the TraceAnalyzer r=RaduBerinde a=asubiotto Depends on #59132 Closes #56797 This commit adds the summing of each processor's contention time to the TraceAnalyzer. This stat is returned as a node and query level stat. Additionally, this commit adds a ContentionTime field to the StatementStatistics protobuf used in the DB Console. Release note: None Co-authored-by: Alfonso Subiotto Marques <alfonso@cockroachlabs.com>

asubiotto requested review from RaduBerinde and a team January 19, 2021 11:02

asubiotto mentioned this pull request Jan 19, 2021

sql: always on EXPLAIN ANALYZE #54556

Closed

3 tasks

yuzefovich reviewed Jan 19, 2021

View reviewed changes

asubiotto added the do-not-merge bors won't merge a PR with this label. label Jan 20, 2021

asubiotto commented Jan 20, 2021

View reviewed changes

RaduBerinde reviewed Jan 20, 2021

View reviewed changes

asubiotto commented Jan 21, 2021

View reviewed changes

asubiotto removed the do-not-merge bors won't merge a PR with this label. label Jan 22, 2021

asubiotto mentioned this pull request Jan 22, 2021

execstats: add cumulative contention time to the TraceAnalyzer #59299

Merged

asubiotto requested a review from a team January 25, 2021 14:11

asubiotto mentioned this pull request Jan 25, 2021

sql: evaluate sql.txn_stats.sample_rate default #59379

Closed

dhartunian mentioned this pull request Jan 25, 2021

cluster-ui: update statements page cockroachdb/ui#178

Merged

4 tasks

asubiotto requested a review from dhartunian January 27, 2021 09:58

RaduBerinde approved these changes Jan 27, 2021

View reviewed changes

asubiotto added 3 commits January 28, 2021 12:39

sql: add cluster setting to sample execution stats

32f93ec

This commit adds sql.statement_stats.sample_rate, which defines a probability that a statement will be sampled. This commit also adds MaxMemoryUsage to the StatementStatistics DB Console protobuf. Release note: None

ui: update UI to handle newly-added protobuf fields

2615ca9

Release note: None

asubiotto commented Jan 28, 2021

View reviewed changes

dhartunian requested a review from yuzefovich January 28, 2021 15:21

dhartunian approved these changes Jan 28, 2021

View reviewed changes

craig bot merged commit 7d27d9a into cockroachdb:master Jan 28, 2021

asubiotto deleted the psdb branch February 1, 2021 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql: introduce sql.statement_stats.sample_rate to sample execution stats #59132

sql: introduce sql.statement_stats.sample_rate to sample execution stats #59132

asubiotto commented Jan 19, 2021

cockroach-teamcity commented Jan 19, 2021

asubiotto commented Jan 19, 2021

awoods187 commented Jan 19, 2021

yuzefovich left a comment

asubiotto left a comment

RaduBerinde left a comment

asubiotto left a comment

asubiotto commented Jan 22, 2021

asubiotto commented Jan 25, 2021

Annebirzin commented Jan 25, 2021 •

edited

Loading

awoods187 commented Jan 25, 2021

asubiotto commented Jan 26, 2021

Annebirzin commented Jan 26, 2021

asubiotto commented Jan 27, 2021

RaduBerinde left a comment

asubiotto left a comment

dhartunian left a comment

asubiotto commented Jan 28, 2021

craig bot commented Jan 28, 2021

RaduBerinde commented Jan 28, 2021

sql: introduce sql.statement_stats.sample_rate to sample execution stats #59132

sql: introduce sql.statement_stats.sample_rate to sample execution stats #59132

Conversation

asubiotto commented Jan 19, 2021

cockroach-teamcity commented Jan 19, 2021

asubiotto commented Jan 19, 2021

awoods187 commented Jan 19, 2021

yuzefovich left a comment

Choose a reason for hiding this comment

asubiotto left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

asubiotto left a comment

Choose a reason for hiding this comment

asubiotto commented Jan 22, 2021

asubiotto commented Jan 25, 2021

Annebirzin commented Jan 25, 2021 • edited Loading

awoods187 commented Jan 25, 2021

asubiotto commented Jan 26, 2021

Annebirzin commented Jan 26, 2021

asubiotto commented Jan 27, 2021

RaduBerinde left a comment

Choose a reason for hiding this comment

asubiotto left a comment

Choose a reason for hiding this comment

dhartunian left a comment

Choose a reason for hiding this comment

asubiotto commented Jan 28, 2021

craig bot commented Jan 28, 2021

RaduBerinde commented Jan 28, 2021

Annebirzin commented Jan 25, 2021 •

edited

Loading