insights: identify statements with high retry counts #86415

matthewtodd · 2022-08-18T19:43:12Z

Addresses #85827.

This change undoes the previous "concerns" concept from #85345 and
replaces it with a "problems" field.¹² We then mark slow statements
retried more than 5 times as having a "HighRetryCount" problem.³ Slow
statements without any specifically identified problems will have
"Unknown" in their list of problems. We will add support for further
problem detection in future commits.

Release justification: Category 2: Bug fixes and low-risk updates to new
functionality

Release note (ops change): The new
sql.insights.high_retry_count.threshold cluster setting may be used to
configure how many times a slow statement (as identified by the
execution insights system) must have been retried to be marked as having
a high retry count.

Every execution insight we offer happens because a statement was
slow, so the "SlowExecution" concern was the only one we'd ever have. ↩
The protocol buffer changes are safe because we haven't deployed
this code anywhere yet. ↩
This threshold may be configured via the
sql.insights.high_retry_count.threshold cluster setting. ↩

cockroach-teamcity · 2022-08-18T19:43:20Z

This change is

matthewtodd

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/sql/sqlstats/insights/insights.proto line 25 at r2 (raw file):

  HighWaitTime = 3;
  HighRetryCount = 4;
  FailedExecution = 5;

I'm working through documentation for these values now.

maryliag

Reviewed 13 of 13 files at r1, 1 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)

pkg/sql/sqlstats/insights/detector.go line 48 at r1 (raw file):

	result := false
	for _, d := range a.detectors {
		result = d.isSlow(statement) || result

can you do a check if the value is already true, break for loop, so you don't need to continue checking

pkg/sql/sqlstats/insights/insights.go line 86 at r1 (raw file):

	settings.TenantWritable,
	"sql.insights.high_retry_count.threshold",
	"slow statements retried more than this number of times will be marked as having a high retry count",

should you add something like "... marked as an insight with high..." ? Just want to clarify that this will be marked on the insights, in case they want to check

Addresses #85827. This change undoes the previous "concerns" concept from #85345 and replaces it with a "problems" field.[^1][^2] We then mark slow statements retried more than 10 times as having a "HighRetryCount" problem.[^3] Slow statements without any specifically identified problems will have "Unknown" in their list of problems. We will add support for further problem detection in future commits. [^1]: Every execution insight we offer happens because a statement was slow, so the "SlowExecution" concern was the only one we'd ever have. [^2]: The protocol buffer changes are safe because we haven't deployed this code anywhere yet. [^3]: This threshold may be configured via the `sql.insights.high_retry_count.threshold` cluster setting. Release justification: Category 2: Bug fixes and low-risk updates to new functionality Release note (ops change): The new `sql.insights.high_retry_count.threshold` cluster setting may be used to configure how many times a slow statement (as identified by the execution insights system) must have been retried to be marked as having a high retry count.

matthewtodd

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @maryliag)

pkg/sql/sqlstats/insights/detector.go line 48 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

can you do a check if the value is already true, break for loop, so you don't need to continue checking

That's what the comment is about -- we don't short-circuit the for loop so that all the detectors have the chance to observe the statement. This is important for the (fancy) anomaly detector, so that it gets to build up its sense of normal.

pkg/sql/sqlstats/insights/insights.go line 86 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

should you add something like "... marked as an insight with high..." ? Just want to clarify that this will be marked on the insights, in case they want to check

Yeah, I'm playing with how to use the names. Our insight is that this statement was slow because of being retried a bunch. I nodded toward that with the phrase "slow statements" here, leaning on the insights namespace in the cluster setting to provide the context. But I think we can do more to make this clear, especially calling out that we won't be identifying all highly-retried statements, just the slow ones.

I've just updated to "the number of retries a slow statement must have undergone for its high retry count to be highlighted as a potential problem," leaning on the word "problem" from the insights proto / table. WDYT?

maryliag

Just one question, otherwise
Great to see the room for all other problems we can just add in 😄

Reviewed 5 of 6 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)

pkg/sql/sqlstats/insights/detector.go line 48 at r1 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

That's what the comment is about -- we don't short-circuit the for loop so that all the detectors have the chance to observe the statement. This is important for the (fancy) anomaly detector, so that it gets to build up its sense of normal.

ahhh I see what you mean now!

pkg/sql/sqlstats/insights/insights.go line 86 at r1 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

Yeah, I'm playing with how to use the names. Our insight is that this statement was slow because of being retried a bunch. I nodded toward that with the phrase "slow statements" here, leaning on the insights namespace in the cluster setting to provide the context. But I think we can do more to make this clear, especially calling out that we won't be identifying all highly-retried statements, just the slow ones.

I've just updated to "the number of retries a slow statement must have undergone for its high retry count to be highlighted as a potential problem," leaning on the word "problem" from the insights proto / table. WDYT?

I like this approach! Thank you for updating!

pkg/sql/sqlstats/insights/insights.proto line 36 at r4 (raw file):

  // This statement was slow because of being retried multiple times, again due
  // to contention. The "high" threshold may be configured by the

is a retry always because of contention?

matthewtodd

TFTR, @maryliag!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)

pkg/sql/sqlstats/insights/insights.proto line 36 at r4 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

is a retry always because of contention?

AFAIK? I threw these together quickly from the Problems table in the doc.

matthewtodd · 2022-08-19T00:16:17Z

bors r+

craig · 2022-08-19T01:17:30Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-08-19T04:06:18Z

Build failed:

Bazel Essential CI (Cockroach)

matthewtodd · 2022-08-19T11:48:33Z

bors r+

craig · 2022-08-19T13:48:02Z

Build succeeded:

Bazel Essential CI (Cockroach)

matthewtodd requested review from a team August 18, 2022 19:43

matthewtodd commented Aug 18, 2022

View reviewed changes

maryliag reviewed Aug 18, 2022

View reviewed changes

matthewtodd commented Aug 18, 2022

View reviewed changes

maryliag approved these changes Aug 18, 2022

View reviewed changes

matthewtodd commented Aug 19, 2022

View reviewed changes

craig bot merged commit 8eabfb7 into cockroachdb:master Aug 19, 2022

cockroach-teamcity mentioned this pull request Aug 19, 2022

insights: identify statements with high retry counts cockroachdb/docs#14889

Closed

matthewtodd deleted the insights-high-retry-count branch August 19, 2022 15:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

insights: identify statements with high retry counts #86415

insights: identify statements with high retry counts #86415

matthewtodd commented Aug 18, 2022 •

edited

Loading

cockroach-teamcity commented Aug 18, 2022

matthewtodd left a comment

maryliag left a comment

matthewtodd left a comment

maryliag left a comment

matthewtodd left a comment

matthewtodd commented Aug 19, 2022

craig bot commented Aug 19, 2022

craig bot commented Aug 19, 2022

matthewtodd commented Aug 19, 2022

craig bot commented Aug 19, 2022

insights: identify statements with high retry counts #86415

insights: identify statements with high retry counts #86415

Conversation

matthewtodd commented Aug 18, 2022 • edited Loading

Footnotes

cockroach-teamcity commented Aug 18, 2022

matthewtodd left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

matthewtodd left a comment

Choose a reason for hiding this comment

maryliag left a comment

Choose a reason for hiding this comment

matthewtodd left a comment

Choose a reason for hiding this comment

matthewtodd commented Aug 19, 2022

craig bot commented Aug 19, 2022

craig bot commented Aug 19, 2022

matthewtodd commented Aug 19, 2022

craig bot commented Aug 19, 2022

matthewtodd commented Aug 18, 2022 •

edited

Loading