-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
insights: identify statements with high retry counts #86415
insights: identify statements with high retry counts #86415
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/sqlstats/insights/insights.proto
line 25 at r2 (raw file):
HighWaitTime = 3; HighRetryCount = 4; FailedExecution = 5;
I'm working through documentation for these values now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 13 of 13 files at r1, 1 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)
pkg/sql/sqlstats/insights/detector.go
line 48 at r1 (raw file):
result := false for _, d := range a.detectors { result = d.isSlow(statement) || result
can you do a check if the value is already true, break for loop, so you don't need to continue checking
pkg/sql/sqlstats/insights/insights.go
line 86 at r1 (raw file):
settings.TenantWritable, "sql.insights.high_retry_count.threshold", "slow statements retried more than this number of times will be marked as having a high retry count",
should you add something like "... marked as an insight with high..." ? Just want to clarify that this will be marked on the insights, in case they want to check
Addresses #85827. This change undoes the previous "concerns" concept from #85345 and replaces it with a "problems" field.[^1][^2] We then mark slow statements retried more than 10 times as having a "HighRetryCount" problem.[^3] Slow statements without any specifically identified problems will have "Unknown" in their list of problems. We will add support for further problem detection in future commits. [^1]: Every execution insight we offer happens because a statement was slow, so the "SlowExecution" concern was the only one we'd ever have. [^2]: The protocol buffer changes are safe because we haven't deployed this code anywhere yet. [^3]: This threshold may be configured via the `sql.insights.high_retry_count.threshold` cluster setting. Release justification: Category 2: Bug fixes and low-risk updates to new functionality Release note (ops change): The new `sql.insights.high_retry_count.threshold` cluster setting may be used to configure how many times a slow statement (as identified by the execution insights system) must have been retried to be marked as having a high retry count.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @maryliag)
pkg/sql/sqlstats/insights/detector.go
line 48 at r1 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
can you do a check if the value is already true, break for loop, so you don't need to continue checking
That's what the comment is about -- we don't short-circuit the for loop so that all the detectors have the chance to observe the statement. This is important for the (fancy) anomaly detector, so that it gets to build up its sense of normal.
pkg/sql/sqlstats/insights/insights.go
line 86 at r1 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
should you add something like "... marked as an insight with high..." ? Just want to clarify that this will be marked on the insights, in case they want to check
Yeah, I'm playing with how to use the names. Our insight is that this statement was slow because of being retried a bunch. I nodded toward that with the phrase "slow statements" here, leaning on the insights
namespace in the cluster setting to provide the context. But I think we can do more to make this clear, especially calling out that we won't be identifying all highly-retried statements, just the slow ones.
I've just updated to "the number of retries a slow statement must have undergone for its high retry count to be highlighted as a potential problem," leaning on the word "problem" from the insights proto / table. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question, otherwise
Great to see the room for all other problems we can just add in 😄
Reviewed 5 of 6 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)
pkg/sql/sqlstats/insights/detector.go
line 48 at r1 (raw file):
Previously, matthewtodd (Matthew Todd) wrote…
That's what the comment is about -- we don't short-circuit the for loop so that all the detectors have the chance to observe the statement. This is important for the (fancy) anomaly detector, so that it gets to build up its sense of normal.
ahhh I see what you mean now!
pkg/sql/sqlstats/insights/insights.go
line 86 at r1 (raw file):
Previously, matthewtodd (Matthew Todd) wrote…
Yeah, I'm playing with how to use the names. Our insight is that this statement was slow because of being retried a bunch. I nodded toward that with the phrase "slow statements" here, leaning on the
insights
namespace in the cluster setting to provide the context. But I think we can do more to make this clear, especially calling out that we won't be identifying all highly-retried statements, just the slow ones.I've just updated to "the number of retries a slow statement must have undergone for its high retry count to be highlighted as a potential problem," leaning on the word "problem" from the insights proto / table. WDYT?
I like this approach! Thank you for updating!
pkg/sql/sqlstats/insights/insights.proto
line 36 at r4 (raw file):
// This statement was slow because of being retried multiple times, again due // to contention. The "high" threshold may be configured by the
is a retry always because of contention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR, @maryliag!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @maryliag and @matthewtodd)
pkg/sql/sqlstats/insights/insights.proto
line 36 at r4 (raw file):
Previously, maryliag (Marylia Gutierrez) wrote…
is a retry always because of contention?
AFAIK? I threw these together quickly from the Problems table in the doc.
bors r+ |
Build failed (retrying...): |
Build failed: |
bors r+ |
Build succeeded: |
Addresses #85827.
This change undoes the previous "concerns" concept from #85345 and
replaces it with a "problems" field.12 We then mark slow statements
retried more than 5 times as having a "HighRetryCount" problem.3 Slow
statements without any specifically identified problems will have
"Unknown" in their list of problems. We will add support for further
problem detection in future commits.
Release justification: Category 2: Bug fixes and low-risk updates to new
functionality
Release note (ops change): The new
sql.insights.high_retry_count.threshold
cluster setting may be used toconfigure how many times a slow statement (as identified by the
execution insights system) must have been retried to be marked as having
a high retry count.
Footnotes
Every execution insight we offer happens because a statement was
slow, so the "SlowExecution" concern was the only one we'd ever have. ↩
The protocol buffer changes are safe because we haven't deployed
this code anywhere yet. ↩
This threshold may be configured via the
sql.insights.high_retry_count.threshold
cluster setting. ↩