Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

insights: identify statements with high retry counts #86415

Merged
merged 1 commit into from
Aug 19, 2022
Merged

insights: identify statements with high retry counts #86415

merged 1 commit into from
Aug 19, 2022

Conversation

matthewtodd
Copy link
Contributor

@matthewtodd matthewtodd commented Aug 18, 2022

Addresses #85827.

This change undoes the previous "concerns" concept from #85345 and
replaces it with a "problems" field.12 We then mark slow statements
retried more than 5 times as having a "HighRetryCount" problem.3 Slow
statements without any specifically identified problems will have
"Unknown" in their list of problems. We will add support for further
problem detection in future commits.

Release justification: Category 2: Bug fixes and low-risk updates to new
functionality

Release note (ops change): The new
sql.insights.high_retry_count.threshold cluster setting may be used to
configure how many times a slow statement (as identified by the
execution insights system) must have been retried to be marked as having
a high retry count.

Footnotes

  1. Every execution insight we offer happens because a statement was
    slow, so the "SlowExecution" concern was the only one we'd ever have.

  2. The protocol buffer changes are safe because we haven't deployed
    this code anywhere yet.

  3. This threshold may be configured via the
    sql.insights.high_retry_count.threshold cluster setting.

@matthewtodd matthewtodd requested review from a team August 18, 2022 19:43
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor Author

@matthewtodd matthewtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/sqlstats/insights/insights.proto line 25 at r2 (raw file):

  HighWaitTime = 3;
  HighRetryCount = 4;
  FailedExecution = 5;

I'm working through documentation for these values now.

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 13 of 13 files at r1, 1 of 4 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)


pkg/sql/sqlstats/insights/detector.go line 48 at r1 (raw file):

	result := false
	for _, d := range a.detectors {
		result = d.isSlow(statement) || result

can you do a check if the value is already true, break for loop, so you don't need to continue checking


pkg/sql/sqlstats/insights/insights.go line 86 at r1 (raw file):

	settings.TenantWritable,
	"sql.insights.high_retry_count.threshold",
	"slow statements retried more than this number of times will be marked as having a high retry count",

should you add something like "... marked as an insight with high..." ? Just want to clarify that this will be marked on the insights, in case they want to check

Addresses #85827.

This change undoes the previous "concerns" concept from #85345 and
replaces it with a "problems" field.[^1][^2] We then mark slow statements
retried more than 10 times as having a "HighRetryCount" problem.[^3] Slow
statements without any specifically identified problems will have
"Unknown" in their list of problems. We will add support for further
problem detection in future commits.

[^1]: Every execution insight we offer happens because a statement was
  slow, so the "SlowExecution" concern was the only one we'd ever have.

[^2]: The protocol buffer changes are safe because we haven't deployed
  this code anywhere yet.

[^3]: This threshold may be configured via the
  `sql.insights.high_retry_count.threshold` cluster setting.

Release justification: Category 2: Bug fixes and low-risk updates to new
functionality

Release note (ops change): The new
`sql.insights.high_retry_count.threshold` cluster setting may be used to
configure how many times a slow statement (as identified by the
execution insights system) must have been retried to be marked as having
a high retry count.
Copy link
Contributor Author

@matthewtodd matthewtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @maryliag)


pkg/sql/sqlstats/insights/detector.go line 48 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

can you do a check if the value is already true, break for loop, so you don't need to continue checking

That's what the comment is about -- we don't short-circuit the for loop so that all the detectors have the chance to observe the statement. This is important for the (fancy) anomaly detector, so that it gets to build up its sense of normal.


pkg/sql/sqlstats/insights/insights.go line 86 at r1 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

should you add something like "... marked as an insight with high..." ? Just want to clarify that this will be marked on the insights, in case they want to check

Yeah, I'm playing with how to use the names. Our insight is that this statement was slow because of being retried a bunch. I nodded toward that with the phrase "slow statements" here, leaning on the insights namespace in the cluster setting to provide the context. But I think we can do more to make this clear, especially calling out that we won't be identifying all highly-retried statements, just the slow ones.

I've just updated to "the number of retries a slow statement must have undergone for its high retry count to be highlighted as a potential problem," leaning on the word "problem" from the insights proto / table. WDYT?

Copy link
Contributor

@maryliag maryliag left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just one question, otherwise :lgtm:
Great to see the room for all other problems we can just add in 😄

Reviewed 5 of 6 files at r4, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)


pkg/sql/sqlstats/insights/detector.go line 48 at r1 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

That's what the comment is about -- we don't short-circuit the for loop so that all the detectors have the chance to observe the statement. This is important for the (fancy) anomaly detector, so that it gets to build up its sense of normal.

ahhh I see what you mean now!


pkg/sql/sqlstats/insights/insights.go line 86 at r1 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

Yeah, I'm playing with how to use the names. Our insight is that this statement was slow because of being retried a bunch. I nodded toward that with the phrase "slow statements" here, leaning on the insights namespace in the cluster setting to provide the context. But I think we can do more to make this clear, especially calling out that we won't be identifying all highly-retried statements, just the slow ones.

I've just updated to "the number of retries a slow statement must have undergone for its high retry count to be highlighted as a potential problem," leaning on the word "problem" from the insights proto / table. WDYT?

I like this approach! Thank you for updating!


pkg/sql/sqlstats/insights/insights.proto line 36 at r4 (raw file):

  // This statement was slow because of being retried multiple times, again due
  // to contention. The "high" threshold may be configured by the

is a retry always because of contention?

Copy link
Contributor Author

@matthewtodd matthewtodd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR, @maryliag!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)


pkg/sql/sqlstats/insights/insights.proto line 36 at r4 (raw file):

Previously, maryliag (Marylia Gutierrez) wrote…

is a retry always because of contention?

AFAIK? I threw these together quickly from the Problems table in the doc.

@matthewtodd
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 19, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Aug 19, 2022

Build failed:

@matthewtodd
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 19, 2022

Build succeeded:

@craig craig bot merged commit 8eabfb7 into cockroachdb:master Aug 19, 2022
@matthewtodd matthewtodd deleted the insights-high-retry-count branch August 19, 2022 15:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants