sqlstats: record idle latency for statements #91098

matthewtodd · 2022-11-01T21:53:41Z

In an attempt to identify application callsites holding transactions open while doing adjacent work, we measure a per-statement fingerprint "idle latency" and surface it in the sqlstats tables.

Developers may then search for statements with higher-than-usual idle latencies to find slow spots in their applications.

Release note (sql change): A new NumericStat, idleLat, was introduced to the statistics column of crdb_internal.statement_statistics, reporting the time spent waiting for the client to send the statement while holding a transaction open. Developers may use this stat to identify opportunities for restructuring their apps (perhaps moving unrelated work out of the transaction or splitting it into smaller transactions) to reduce contention.

cockroach-teamcity · 2022-11-01T21:53:49Z

This change is

rafiss

the approach looks good to me! i had a few nits, and also i think we should expose this in the transaction stats in the DB internals as well.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter and @matthewtodd)

-- commits line 13 at r1:
nit: it should be Release note (sql change) and include something about how a user can access this new data

pkg/roachpb/app_stats.go line 122 at r1 (raw file):

// Add combines other into TransactionStatistics.
func (t *TransactionStatistics) Add(other *TransactionStatistics) {

shouldn't this new metric be in TransactionStatistics instead of (or maybe in addition to) StatementStatistics?

pkg/roachpb/app_stats.proto line 65 at r1 (raw file):

  // ClientLat is the time spent waiting for the client to send the statement.
  optional NumericStat client_lat = 28 [(gogoproto.nullable) = false];

nit: i think "client latency" might be a bit confusing to understand. to me, it sounds like "total latency as experienced by the client."

what about a name like "idle latency" or "statement wait latency"

also, here and in the release note, do we need some clarification that this really is only meaningful for explicit transactions?

pkg/sql/executor_statement_metrics.go line 127 at r1 (raw file):

	// Collect the statistics.
	clientLatRaw := phaseTimes.GetClientLatency(ex.statsCollector.PreviousPhaseTimes())

this step will compute the "client latency" for the current statement. i couldn't find the place where we compute the running total and show the total "client latency" for the transaction. is that what the frontend aggregateNumericStats function does?

pkg/sql/sessionphase/session_phase.go line 215 at r1 (raw file):

// GetClientLatency deduces the rough amount of time spent waiting
// for the client while the transaction is open.

nit: can the comment specify what the behavior is for implicit transactions? it should just return 0 right?

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter and @matthewtodd)

pkg/sql/sessionphase/session_phase.go line 226 at r1 (raw file):

	}

	// In general, we have been waiting for the client since the end

I'm curious how accurate will this be if the previous statement returned a lot of data (i.e. the query resulted in millions of rows that were communicated to the client)?

matthewtodd

Thanks, Rafi! Agreed re: transaction stats. I'll be adding those in a second PR.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter and @rafiss)

-- commits line 13 at r1:

Previously, rafiss (Rafi Shamim) wrote…

nit: it should be Release note (sql change) and include something about how a user can access this new data

Oops, thanks, I forgot crdb_internal.statement_statistics was one of the documented ones.

pkg/roachpb/app_stats.go line 122 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

shouldn't this new metric be in TransactionStatistics instead of (or maybe in addition to) StatementStatistics?

It will be, coming in a second PR. Per @bobvawter, the per-statement timings were the most interesting for his approach.

pkg/roachpb/app_stats.proto line 65 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: i think "client latency" might be a bit confusing to understand. to me, it sounds like "total latency as experienced by the client."

what about a name like "idle latency" or "statement wait latency"

also, here and in the release note, do we need some clarification that this really is only meaningful for explicit transactions?

That's good, I do like "idle latency" better. And will clarify the explicit transaction bit.

pkg/sql/executor_statement_metrics.go line 127 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

this step will compute the "client latency" for the current statement. i couldn't find the place where we compute the running total and show the total "client latency" for the transaction. is that what the frontend aggregateNumericStats function does?

The transaction-level aggregation will come in a second PR.

pkg/sql/sessionphase/session_phase.go line 215 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

nit: can the comment specify what the behavior is for implicit transactions? it should just return 0 right?

Good idea, will change.

j82w

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @matthewtodd, and @rafiss)

pkg/sql/sessionphase/session_phase.go line 228 at r3 (raw file):

	// In general, we have been waiting for the client since the end
	// of the previous execution.
	waitingSince := previous.times[PlannerEndExecStmt]

Is the IdleLatency alway the time before the statement? If so how does it track the latency between the last statement and the commit?

j82w · 2022-11-02T19:52:37Z

pkg/sql/sqlstats/sslocal/sql_stats_test.go line 1218 at r3 (raw file):

		},
		{
			name: "multiple transactions",

How about adding a test scenario with 'select pg_sleep(1)' to verify it doesn't include the execution time of the previous statement?

matthewtodd

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @rafiss, and @yuzefovich)

pkg/sql/sessionphase/session_phase.go line 226 at r1 (raw file):

Previously, yuzefovich (Yahor Yuzefovich) wrote…

I'm curious how accurate will this be if the previous statement returned a lot of data (i.e. the query resulted in millions of rows that were communicated to the client)?

That's a good question! Let me look back at SessionQueryServiced to see (a) if it might give us better accuracy and (b) if it's available when we need it.

I had also spent some time looking into Jordan's suggestion of measuring the wait time for pgwire messages, but the mapping from those up to actual SQL executions felt hard to get (and keep?) right. That may be worth revisiting if we're not able to get satisfying accuracy this way.

matthewtodd

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @j82w, @rafiss, and @yuzefovich)

pkg/sql/sessionphase/session_phase.go line 228 at r3 (raw file):

Previously, j82w wrote…

Is the IdleLatency alway the time before the statement? If so how does it track the latency between the last statement and the commit?

Yes, it is. We don't yet track the latency between the last statement and the commit; that'll be for the PR that introduces the transaction-level stat.

pkg/sql/sqlstats/sslocal/sql_stats_test.go line 1218 at r3 (raw file):

Previously, j82w wrote…

How about adding a test scenario with 'select pg_sleep(1)' to verify it doesn't include the execution time of the previous statement?

Good idea, will do.

matthewtodd

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @j82w, @rafiss, and @yuzefovich)

pkg/sql/sessionphase/session_phase.go line 226 at r1 (raw file):

Previously, matthewtodd (Matthew Todd) wrote…

That's a good question! Let me look back at SessionQueryServiced to see (a) if it might give us better accuracy and (b) if it's available when we need it.

I had also spent some time looking into Jordan's suggestion of measuring the wait time for pgwire messages, but the mapping from those up to actual SQL executions felt hard to get (and keep?) right. That may be worth revisiting if we're not able to get satisfying accuracy this way.

SessionQueryServiced does happen slightly later, so I've switched to using it.

And I've gone down a bit of a rabbit hole looking for how this'll behave in the context of large result sets:

As far as I can tell, SessionQueryServiced won't be set until all of the results have been pushed down into to the network stack, at which point the "idle" clock will start ticking for the next query. (@yuzefovich, could you check me here?)

If that assessment is correct, then this idle clock will effectively penalize the client (i.e. mark its next query as having a large idle latency) for the essential overhead in handling these large responses.

...and I think we're okay with that? Since the intent of this feature is to identify unnecessarily long-running transactions that could be candidates for breaking apart, I think we do want these large responses to stand out.

I did find the results_buffer_size session var, hoping to use it to somehow expose some synchronous interaction with the client (wondering if it would behave like the defaultRowFetchSize connection parameter to the pg jdbc driver), but it only appears to affect flushing frequency.

rafiss

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @j82w, and @yuzefovich)

pkg/sql/sessionphase/session_phase.go line 226 at r1 (raw file):

If that assessment is correct, then this idle clock will effectively penalize the client (i.e. mark its next query as having a large idle latency) for the essential overhead in handling these large responses.

...and I think we're okay with that? Since the intent of this feature is to identify unnecessarily long-running transactions that could be candidates for breaking apart, I think we do want these large responses to stand out.

I also agree that this behavior sounds good and it's useful to track as part of this metric

yuzefovich

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @bobvawter, @j82w, @matthewtodd, and @rafiss)

pkg/sql/sessionphase/session_phase.go line 226 at r1 (raw file):

Previously, rafiss (Rafi Shamim) wrote…

If that assessment is correct, then this idle clock will effectively penalize the client (i.e. mark its next query as having a large idle latency) for the essential overhead in handling these large responses.

...and I think we're okay with that? Since the intent of this feature is to identify unnecessarily long-running transactions that could be candidates for breaking apart, I think we do want these large responses to stand out.

I also agree that this behavior sounds good and it's useful to track as part of this metric

Sounds good to me too, thanks for checking!

matthewtodd · 2022-11-07T20:46:12Z

Thanks, everyone! The failing test here is #91420, to be fixed by #91421. I'll rebase and ping you all for the when that lands.

jordanlewis · 2022-11-07T20:56:14Z

If that assessment is correct, then this idle clock will effectively penalize the client (i.e. mark its next query as having a large idle latency) for the essential overhead in handling these large responses.

I'm not sure that this is a good idea. I think it could lead to confusing outcomes - people look at statements in isolation, and they might see a large idle team and won't know what to make of it.

A related question - are we measuring the delay in between two statements within a transaction with this PR? And, how would that delay show up in the UI?

I think it might be good to do some basic testing of different scenarios so we can figure out how this stat maps to common user behaviors.

cc @kevin-v-ngo for any thoughts as well.

jordanlewis · 2022-11-07T20:56:43Z

(My comment isn't blocking for merging the PR)

Part of #86667 In an attempt to identify application callsites holding transactions open while doing adjacent work, we measure a per-statement fingerprint "idle latency" and surface it in the sqlstats tables. Developers may then search for statements with higher-than-usual idle latencies to find slow spots in their applications. Release note (sql change): A new NumericStat, `idleLat`, was introduced to the statistics column of `crdb_internal.statement_statistics`, reporting the time spent waiting for the client to send the statement while holding a transaction open. Developers may use this stat to identify opportunities for restructuring their apps (perhaps moving unrelated work out of the transaction or splitting it into smaller transactions) to reduce contention.

bobvawter · 2022-11-08T14:33:43Z

A related question - are we measuring the delay in between two statements within a transaction with this PR? And, how would that delay show up in the UI?

The questions that I want to answer for customers are:

Are transaction times long because there's an excessive amount of "think time" in the workload?
Which statement(s) in the transaction have the biggest think-time gaps?
If possible, identify cases where draining a large result set back to the client takes a non-trivial amount of time.

The goal of the above is to let us make specific recommendations to customers about where the biggest wins in breaking up transactions would be or where a workload isn't keeping up with CRDB.

rafiss · 2022-11-08T21:14:37Z

are we measuring the delay in between two statements within a transaction with this PR?

According to https://reviewable.io/reviews/cockroachdb/cockroach/91098#-NFsqt6g4024WqV5V8YZ:-NFsugzZ3DPIu3oIwiVY:b-msldg1 and https://reviewable.io/reviews/cockroachdb/cockroach/91098#-NFtsABhEKUu6lKhY34V:-NFtuq2mDqfaxrQvTAkC:b-ujc2q9, that is coming in a future commit

rafiss

lgtm!

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @bobvawter, @j82w, and @matthewtodd)

matthewtodd · 2022-11-09T15:25:11Z

Thanks for the reviews, all!

Before merging, I've discovered these measurements are accurate when running queries in psql but zero-ish in cockroach sql. 🤔 I'm suspecting some optimizations in pgx (which we use in cockroach sql) may account for the difference. I'm going to wire in a pgx variant of the tests to investigate.

If that assessment is correct, then this idle clock will effectively penalize the client (i.e. mark its next query as having a large idle latency) for the essential overhead in handling these large responses.

I'm not sure that this is a good idea. I think it could lead to confusing outcomes - people look at statements in isolation, and they might see a large idle team and won't know what to make of it.

I do agree about confusing outcomes here, though as far as I can tell, we're starting & stopping the idle clock at the latest & earliest times we can (is that true?), and I think we're still providing a useful signal of something that has a real impact on contention: "go look at what happens before this statement." How does that sound?

rafiss · 2022-11-09T16:46:43Z

accurate when running queries in psql but zero-ish in cockroach sql

I would guess that this is because psql only executes the query you type in, where as cockroach sql runs additional things such as SHOW DATABASE, and SHOW LAST QUERY STATISTICS.

matthewtodd · 2022-11-09T18:24:39Z

I'm going to wire in a pgx variant of the tests to investigate.

I did do this locally, and all the tests passed, as expected. So it's not pgx.

accurate when running queries in psql but zero-ish in cockroach sql

I would guess that this is because psql only executes the query you type in, where as cockroach sql runs additional things such as SHOW DATABASE, and SHOW LAST QUERY STATISTICS.

Thanks, @rafiss, that was it! 🎉

Poking around, I can now get accurate idle latencies from cockroach sql sessions by running \set check_syntax = false in those sessions. I think we're good to merge here!

bors r+

craig · 2022-11-09T23:45:45Z

This PR was included in a batch that was canceled, it will be automatically retried

craig · 2022-11-10T00:27:32Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-11-10T00:55:17Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2022-11-10T02:00:36Z

Build failed:

Bazel Essential CI (Cockroach)

matthewtodd · 2022-11-10T12:30:36Z

bors r+

craig · 2022-11-10T13:28:45Z

Build succeeded:

Bazel Essential CI (Cockroach)

92281: sqlstats: add idle_lat to node_statement_statistics r=matthewtodd a=matthewtodd Part of #86667. For parity with our existing statement-level work in #91098. (No release note here since `node_statement_statistics` is not one of the ["Use in production" `crdb_internal` tables](https://www.cockroachlabs.com/docs/stable/crdb-internal.html#tables).) Release note: None 92303: sql: fix flaky role logic test r=ZhouXing19 a=rafiss fixes #92164 The role IDs are not important to this test, and were causing flakiness since the backing sequence may generate IDs with gaps. Release note: None Co-authored-by: Matthew Todd <todd@cockroachlabs.com> Co-authored-by: Rafi Shamim <rafi@cockroachlabs.com>

92632: opt: fix rare incorrect results due to sort between paired joins r=DrewKimball a=DrewKimball Previously, it was possible for paired joins to produce incorrect results in the case when an ordering was required of their output, and a sort was added between the paired joins to enforce the ordering. This patch prevents a sort from being added to the output of the first join in a set of paired joins. This is necessary because the continuation column that is used to indicate false positives matched by the first join relies on the ordering being maintained between the joins. Fixes #89603 Release note: None 92669: roachtest/cdc: export stats for initial scan test to roachperf r=jayshrivastava a=jayshrivastava This change updates the cdc/initial_scan_only test to produce a `stats.json` artifact to be consumed by roachprod. This file contains stats for p99 foreground latency, changefeed throughput, and CPU usage. Release note: None Epic: None <img width="940" alt="image" src="https://user-images.githubusercontent.com/18633281/204564990-740e86e2-5c43-4d45-a715-4932428a5851.png"> 92693: dev: add rewritable paths for ccl execbuilder tests r=rharding6373 a=rharding6373 There are some ccl tests that use test files in `/pkg/sql/opt/exec/execbuilder`. This commit adds this as a rewritable path so that we can use the `--rewrite` flag with `dev`. Release note: None Epic: None 92695: sqlstats: record idle latency for transactions r=matthewtodd a=matthewtodd Part of #86667 Follows #91098 Release note (sql change): A new NumericStat, idleLat, was introduced to the statistics column of crdb_internal.transaction_statistics, reporting the time spent waiting for the client to send statements while holding a transaction open. 92760: streamclient: replace usage of deprecated ioutil.ReadFile function r=stevendanna a=andyyang890 This patch fixes a lint error resulting from a usage of the deprecated ioutil.ReadFile function. Fixes #92761 Release note: None 92763: jobsprotectedtsccl: unskip TestJobsProtectedTimestamp r=ajwerner a=ajwerner It was fixed by #92692. Fixes #91865. Release note: None Co-authored-by: Drew Kimball <drewk@cockroachlabs.com> Co-authored-by: Jayant Shrivastava <jayants@cockroachlabs.com> Co-authored-by: rharding6373 <rharding6373@users.noreply.github.com> Co-authored-by: Matthew Todd <todd@cockroachlabs.com> Co-authored-by: Andy Yang <yang@cockroachlabs.com> Co-authored-by: Andrew Werner <awerner32@gmail.com>

matthewtodd marked this pull request as ready for review November 2, 2022 14:02

matthewtodd requested a review from a team November 2, 2022 14:02

matthewtodd requested a review from a team as a code owner November 2, 2022 14:02

matthewtodd requested review from rafiss and bobvawter November 2, 2022 14:02

rafiss requested changes Nov 2, 2022

View reviewed changes

yuzefovich reviewed Nov 2, 2022

View reviewed changes

matthewtodd commented Nov 2, 2022

View reviewed changes

matthewtodd changed the title ~~sqlstats: record client latency for statements~~ sqlstats: record idle latency for statements Nov 2, 2022

j82w reviewed Nov 2, 2022

View reviewed changes

matthewtodd commented Nov 2, 2022

View reviewed changes

matthewtodd commented Nov 7, 2022

View reviewed changes

rafiss reviewed Nov 7, 2022

View reviewed changes

yuzefovich reviewed Nov 7, 2022

View reviewed changes

rafiss approved these changes Nov 8, 2022

View reviewed changes

craig bot merged commit 53e75b0 into cockroachdb:master Nov 10, 2022

cockroach-teamcity mentioned this pull request Nov 10, 2022

PR #91098 - sqlstats: record idle latency for statements cockroachdb/docs#15593

Open

ericharmeling mentioned this pull request Nov 11, 2022

sql: skip flakey TestSQLStatsIdleLatencies #91712

Merged

matthewtodd deleted the inter-statement-latency branch November 15, 2022 15:14

matthewtodd mentioned this pull request Nov 21, 2022

sqlstats: add idle_lat to node_statement_statistics #92281

Merged

matthewtodd mentioned this pull request Nov 29, 2022

sqlstats: record idle latency for transactions #92695

Merged

kevin-v-ngo mentioned this pull request Apr 26, 2023

23.1 Acceptance testing: Provide a clear tool tip to describe Idle time in the console #102361

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sqlstats: record idle latency for statements #91098

sqlstats: record idle latency for statements #91098

matthewtodd commented Nov 1, 2022 •

edited

Loading

cockroach-teamcity commented Nov 1, 2022

rafiss left a comment

yuzefovich left a comment

matthewtodd left a comment

j82w left a comment

j82w commented Nov 2, 2022

matthewtodd left a comment

matthewtodd left a comment

matthewtodd left a comment

rafiss left a comment

yuzefovich left a comment

matthewtodd commented Nov 7, 2022 •

edited by petermattis

Loading

jordanlewis commented Nov 7, 2022

jordanlewis commented Nov 7, 2022

bobvawter commented Nov 8, 2022

rafiss commented Nov 8, 2022

rafiss left a comment

matthewtodd commented Nov 9, 2022

rafiss commented Nov 9, 2022

matthewtodd commented Nov 9, 2022

craig bot commented Nov 9, 2022

craig bot commented Nov 10, 2022

craig bot commented Nov 10, 2022

craig bot commented Nov 10, 2022

matthewtodd commented Nov 10, 2022

craig bot commented Nov 10, 2022

sqlstats: record idle latency for statements #91098

sqlstats: record idle latency for statements #91098

Conversation

matthewtodd commented Nov 1, 2022 • edited Loading

cockroach-teamcity commented Nov 1, 2022

rafiss left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

matthewtodd left a comment

Choose a reason for hiding this comment

j82w left a comment

Choose a reason for hiding this comment

j82w commented Nov 2, 2022

matthewtodd left a comment

Choose a reason for hiding this comment

matthewtodd left a comment

Choose a reason for hiding this comment

matthewtodd left a comment

Choose a reason for hiding this comment

rafiss left a comment

Choose a reason for hiding this comment

yuzefovich left a comment

Choose a reason for hiding this comment

matthewtodd commented Nov 7, 2022 • edited by petermattis Loading

jordanlewis commented Nov 7, 2022

jordanlewis commented Nov 7, 2022

bobvawter commented Nov 8, 2022

rafiss commented Nov 8, 2022

rafiss left a comment

Choose a reason for hiding this comment

matthewtodd commented Nov 9, 2022

rafiss commented Nov 9, 2022

matthewtodd commented Nov 9, 2022

craig bot commented Nov 9, 2022

craig bot commented Nov 10, 2022

craig bot commented Nov 10, 2022

craig bot commented Nov 10, 2022

matthewtodd commented Nov 10, 2022

craig bot commented Nov 10, 2022

matthewtodd commented Nov 1, 2022 •

edited

Loading

matthewtodd commented Nov 7, 2022 •

edited by petermattis

Loading