Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: introduce sql.statement_stats.sample_rate to sample execution stats #59132

Merged
merged 3 commits into from
Jan 28, 2021
Merged

Conversation

asubiotto
Copy link
Contributor

Depends on #58897
Depends on #59103

This PR puts the "always-on" into always-on EXPLAIN ANALYZE. Take a look at separate commits for details. What actually goes on is that we're taking the slightly safer route of introducing a cluster setting which defines a sample rate for execution stats. These execution stats are collected and propagated using background tracing, which is cheaper than verbose tracing. This allows us to power the new DB Console statement stats views. Currently we still need user input in order to turn up this sample rate.

The sample rate is 0 by default in this PR for safety reasons. I'd like to discuss the default value of this cluster setting or whether we need it at all separately before the 21.1 release, but this gives us a nice escape hatch if for whatever reason stats collection results in poor performance.

Closes #54556

@asubiotto asubiotto requested review from RaduBerinde and a team January 19, 2021 11:02
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@asubiotto
Copy link
Contributor Author

The first two commits are separate PRs by our KV friends.

@asubiotto asubiotto mentioned this pull request Jan 19, 2021
3 tasks
@awoods187
Copy link
Contributor

I like starting this at zero and then, with testing, picking the right sampling metric for default for 21.1. cc @Annebirzin we will need to think about what the UI should look like if people turn this off (it means we won't have a number of inputs that power columns/other values).

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also like the current sampling approach with starting at 0 sampling probability. Left a few nit comments too.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @RaduBerinde)


pkg/roachpb/app_stats.proto, line 95 at r4 (raw file):

  optional NumericStat max_mem_usage = 18 [(gogoproto.nullable) = false];

	// StatCollectionCount keeps track of how many times execution stats were

nit: tabs instead of spaces.


pkg/sql/instrumentation.go, line 44 at r4 (raw file):

	func(f float64) error {
		if f >= 0 && f <= 1 {
			return nil

super nit: I think usually (conventionally?) the "error-path" is inside of the if and the "normal path" is outside.


pkg/sql/instrumentation.go, line 227 at r4 (raw file):

			stmtStats.mu.Lock()
			stmtStats.mu.data.StatCollectionCount++
			// Record trace-related statistics. A count of 1 is passed given that this

nit: "A count of 1 ..." part seems no longer relevant.

@asubiotto asubiotto added the do-not-merge bors won't merge a PR with this label. label Jan 20, 2021
Copy link
Contributor Author

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @RaduBerinde and @yuzefovich)


pkg/sql/instrumentation.go, line 44 at r4 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

super nit: I think usually (conventionally?) the "error-path" is inside of the if and the "normal path" is outside.

Done.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @asubiotto and @yuzefovich)


pkg/sql/instrumentation.go, line 165 at r7 (raw file):

	ih.savePlanForStats = appStats.shouldSaveLogicalPlanDescription(fingerprint, implicitTxn)

	ih.collectStats = collectStmtStatsSampleRate.Get(&cfg.Settings.SV) > rand.Float64()

rand.Float64() goes through a global mutex; this is unnecessary synchronization between different queries. We should create a separate rand source in each executor. And we should skip the call altogether if the setting is 0.

Copy link
Contributor Author

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also added an integration test. There's a bug in max memory stat propagation that I'm debugging and will fix in another PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @yuzefovich)


pkg/sql/instrumentation.go, line 165 at r7 (raw file):

Previously, RaduBerinde wrote…

rand.Float64() goes through a global mutex; this is unnecessary synchronization between different queries. We should create a separate rand source in each executor. And we should skip the call altogether if the setting is 0.

Done.

@asubiotto asubiotto removed the do-not-merge bors won't merge a PR with this label. label Jan 22, 2021
@asubiotto
Copy link
Contributor Author

Both PRs that this depends on have now been merged.

@asubiotto
Copy link
Contributor Author

Ready for another look.

@Annebirzin
Copy link

Annebirzin commented Jan 25, 2021

@asubiotto @awoods187 I explored some designs for when sampling is set at 0 (figma designs here)

Would this just apply to the 'rows/bytes' and 'contention' column data?

Also let me know any thoughts on the wording

@awoods187
Copy link
Contributor

I like those designs!

@asubiotto
Copy link
Contributor Author

Nice! It would only apply to contention, max memory, and network.

@Annebirzin
Copy link

Thanks @asubiotto I updated designs to reflect the contention, max memory and network columns.

Also, since those columns are next to each other, I added a second option where we just show 'sampling off' text in the center column and gray out all 3 cells so there isn't so much repetition. (the other two columns would just be empty)

Screen Shot 2021-01-26 at 11 17 12 AM
Screen Shot 2021-01-26 at 11 17 02 AM

@asubiotto
Copy link
Contributor Author

Extracted ui changes into a separate, third, commit. @dhartunian, I would be grateful if you could review the third commit (it's pretty small). Friendly ping @RaduBerinde on the first two commits.

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: modulo a couple of comments. Sorry for the delay

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @asubiotto, @dhartunian, and @yuzefovich)


pkg/roachpb/app_stats.proto, line 92 at r11 (raw file):

  optional NumericStat bytes_sent_over_network = 17 [(gogoproto.nullable) = false];

  // MaxMemUsage collects the maximum memory usage that occurred on a node.

I assume these statistics cover a certain time range? Can you add some explanation about that to the overall message comment?


pkg/sql/instrumentation.go, line 66 at r10 (raw file):

	collectBundle bool

	// collectStast is set when we are collecting stats for a statement.

[nit] collectStats typo (+ I'd make this collectExecStats).


pkg/sql/instrumentation.go, line 290 at r10 (raw file):

}

// ShouldCollectStats returns true if we should collect statement execution

[nit] I'd make this ShouldCollectExecStats. No need to do the same in other places which are under execinfra scope.


pkg/sql/instrumentation.go, line 182 at r11 (raw file):

			// If we need to collect stats, create a non-verbose child span. Stats
			// will be added as structured metadata and processed in Finish.
			newCtx, ih.sp = tracing.EnsureChildSpan(ctx, cfg.AmbientCtx.Tracer, "traced statement")

I think origCtx needs to be set in this case.. By the way, why is there a difference between this and what we're doing below (StartVerboseTrace)? We no longer want to SetVerbose? (if so, can we change the existing invocation)?

Previously, each processor would make the decision of whether stats should be
collected by checking whether recording was enabled on a span obtained through
the context. This commit centralizes that decision by moving it into the conn
executor's instrumentation helper. Processors now observe a CollectStats bool
set on the FlowCtx in order to determine whether they should collect stats or
not.

Release note: None
This commit adds sql.statement_stats.sample_rate, which defines a probability
that a statement will be sampled.

This commit also adds MaxMemoryUsage to the StatementStatistics DB Console
protobuf.

Release note: None
Copy link
Contributor Author

@asubiotto asubiotto left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! No problem

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @dhartunian and @yuzefovich)


pkg/roachpb/app_stats.proto, line 92 at r11 (raw file):

Previously, RaduBerinde wrote…

I assume these statistics cover a certain time range? Can you add some explanation about that to the overall message comment?

Done.


pkg/sql/instrumentation.go, line 290 at r10 (raw file):

Previously, RaduBerinde wrote…

[nit] I'd make this ShouldCollectExecStats. No need to do the same in other places which are under execinfra scope.

Done. Also updated the StatementStatistics field similarly since it made it clearer.


pkg/sql/instrumentation.go, line 182 at r11 (raw file):
Done.

why is there a difference between this and what we're doing below (StartVerboseTrace)? We no longer want to SetVerbose?

Because we want to keep this as cheap as possible. I think verbose tracing traces a bunch of stuff that we do want to keep in bundles (e.g. the full trace) but are not interested in in this case since we're only interested in stats.

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ui changes :lgtm:

Reviewed 2 of 8 files at r13, 3 of 3 files at r14.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @yuzefovich)

@asubiotto
Copy link
Contributor Author

TFTRs!

bors r=RaduBerinde,dhartunian

@craig
Copy link
Contributor

craig bot commented Jan 28, 2021

Build succeeded:

@craig craig bot merged commit 7d27d9a into cockroachdb:master Jan 28, 2021
@RaduBerinde
Copy link
Member


pkg/sql/instrumentation.go, line 182 at r11 (raw file):

Previously, asubiotto (Alfonso Subiotto Marqués) wrote…

Done.

why is there a difference between this and what we're doing below (StartVerboseTrace)? We no longer want to SetVerbose?

Because we want to keep this as cheap as possible. I think verbose tracing traces a bunch of stuff that we do want to keep in bundles (e.g. the full trace) but are not interested in in this case since we're only interested in stats.

Makes sense, thanks!

craig bot pushed a commit that referenced this pull request Feb 1, 2021
59299: execstats: add cumulative contention time to the TraceAnalyzer r=RaduBerinde a=asubiotto

Depends on #59132 
Closes #56797 

This commit adds the summing of each processor's contention time to the
TraceAnalyzer. This stat is returned as a node and query level stat.

Additionally, this commit adds a ContentionTime field to the
StatementStatistics protobuf used in the DB Console.

Release note: None

Co-authored-by: Alfonso Subiotto Marques <alfonso@cockroachlabs.com>
@asubiotto asubiotto deleted the psdb branch February 1, 2021 15:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: always on EXPLAIN ANALYZE
7 participants