server: add cpu profiler #95623

Santamaura · 2023-01-20T20:11:36Z

This PR adds a cpu profiler to the server package.
The following cluster settings have been added to configure
the cpu profiler:

server.cpu_profile.cpu_usage_combined_threshold is the baseline
value for when cpu profiles should be taken
server.cpu_profile.interval is when the high water mark resets to
the cpu_usage_combined_threshold value
server.cpu_profile.duration is how long a cpu profile is taken
server.cpu_profile.enabled is whether the on/off switch of the
cpu profiler

Fixes: #75799

Release note: None

cockroach-teamcity · 2023-01-20T20:11:44Z

This change is

tbg

Exciting! Haven't reviewed in detail yet but left a general comment about finding a better set of signals to trigger the collection on.

tbg · 2023-01-24T09:59:07Z

pkg/server/env_sampler.go

@@ -188,6 +197,9 @@ func startSampleEnvironment(
 					if queryProfiler != nil {
 						queryProfiler.MaybeDumpQueries(ctx, cfg.sessionRegistry, cfg.st)
 					}
+					if cpuProfiler != nil {
+						cpuProfiler.MaybeTakeProfile(ctx, cfg.runtime.HostDiskReadCount.Value(), cfg.runtime.HostDiskWriteCount.Value(), cfg.runtime.Uptime.Value())


Disk read and write counts don't seem like the appropriate metrics to trigger a CPU profile here. We usually care about CPU profiles when the CPU is busy! The combined normalized CPU usage in [0,1] is cfg.runtime.CPUCombinedPercentNorm.Value(). I think (but you should verify) that we recommend users don't run with CPU >= 70% utilized, so maybe a default threshold of 0.8 makes sense.

But taking a step back, we have admission control, too. If admission control limits CPU usage, we still want to know what consumes CPU, but admission control might limit CPU effectively enough to stay below threshold. I think it would make sense to take CPU profiles when CPU admission control is active, too, regardless of CPU utilization. I think this will need a little bit of plumbing and maybe some advice by the likes of @irfansharif and @sumeerbhola on how we should go about it.

Santamaura

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @irfansharif, @koorosh, @smg260, @sumeerbhola, and @zachlite)

pkg/server/env_sampler.go line 201 at r1 (raw file):

cfg.runtime.CPUCombinedPercentNorm.Value()

Thanks for clarifying which metric we should use, have switched to using it.

we recommend users don't run with CPU >= 70% utilized, so maybe a default threshold of 0.8 makes sense

From the docs it says crdb should not persistently use >80% CPU so maybe 0.8 is indeed the ideal default.

I think it would make sense to take CPU profiles when CPU admission control is active, too, regardless of CPU utilization

I'm not too familiar with admission control, would we simply want to know if admission control is enabled (and if it is we would override the cpu usage threshold setting default to some lower value) or use an admission control metric for another comparison?

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @sumeerbhola, and @zachlite)

pkg/server/env_sampler.go line 201 at r1 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

cfg.runtime.CPUCombinedPercentNorm.Value()

Thanks for clarifying which metric we should use, have switched to using it.

we recommend users don't run with CPU >= 70% utilized, so maybe a default threshold of 0.8 makes sense

From the docs it says crdb should not persistently use >80% CPU so maybe 0.8 is indeed the ideal default.

I think it would make sense to take CPU profiles when CPU admission control is active, too, regardless of CPU utilization

I'm not too familiar with admission control, would we simply want to know if admission control is enabled (and if it is we would override the cpu usage threshold setting default to some lower value) or use an admission control metric for another comparison?

AC, for "regular" work (i.e. not backups, changefeeds, etc.) will try to keep CPU use as high as possible. If the incident work on a given node is all high-priority and consumes >90% of CPU, it'll keep it at that threshold. If AC is intentionally reducing CPU consumption of "elastic" work (backups, changefeeds, etc.) it's still trying to keep it as high as possible. The amount of "regular work" being done WRT CPU% on those nodes is unaffected, just the elastic CPU % is lowered. So I think we can ignore AC for this PR.

Only briefly looking at MaybeTakeProfile, we're taking CPU profiles once-per-minute when CPU % > 80%? Have you also considered what would take to capture a CPU % when there's a noticeable "spike"? Feel free to defer this to a future PR, perhaps filing a dedicated issue for it, but in at least some of the incidents we've tied to #75799 we were also interested in these "spikes" where we didn't truly exceed 80%, but the spikiness was not understood by us. Perhaps we can devise some simple "spike detection" heuristics for it.

Santamaura

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @sumeerbhola, and @zachlite)

pkg/server/env_sampler.go line 201 at r1 (raw file):

we were also interested in these "spikes" where we didn't truly exceed 80%, but the spikiness was not understood by us. Perhaps we can devise some simple "spike detection" heuristics for it.

Hmm what if the CpuProfiler stored the last CPUCombinedPercentNorm.Value() regardless of whether a profile is taken and then we calculate a delta between the current and previous. We then check this delta against a new setting (e.g. server.cpu_profile.cpu_usage_delta) and if it exceeds the setting we take a profile? What do you think, would this approach handle interesting cpu spikes? If so what do you think the default delta should be for the setting?

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @sumeerbhola, and @zachlite)

pkg/server/env_sampler.go line 201 at r1 (raw file):

Previously, Santamaura (Alex Santamaura) wrote…

we were also interested in these "spikes" where we didn't truly exceed 80%, but the spikiness was not understood by us. Perhaps we can devise some simple "spike detection" heuristics for it.

Hmm what if the CpuProfiler stored the last CPUCombinedPercentNorm.Value() regardless of whether a profile is taken and then we calculate a delta between the current and previous. We then check this delta against a new setting (e.g. server.cpu_profile.cpu_usage_delta) and if it exceeds the setting we take a profile? What do you think, would this approach handle interesting cpu spikes? If so what do you think the default delta should be for the setting?

This "spike" detection worth separating out into its own PR to not stall this, very useful one. But for your question, it's hard to answer without some trial and error. I found https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362 which conveniently comes with a very simple go implementation that we could perhaps consider instead. It takes a moving mean of the last N measurements of CPU %, and triggers if a new sample is some standard deviations away from the moving mean -- which sounds closer to what I think we want. It would work for various baseline CPU %s, even when nodes are running hot (~65%) or cold (~10%). What do you think? Want to file an issue and track separately? Quick experiments would tell us how effective it is.

irfansharif

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @sumeerbhola, and @zachlite)

pkg/server/env_sampler.go line 201 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

This "spike" detection worth separating out into its own PR to not stall this, very useful one. But for your question, it's hard to answer without some trial and error. I found https://stackoverflow.com/questions/22583391/peak-signal-detection-in-realtime-timeseries-data/22640362#22640362 which conveniently comes with a very simple go implementation that we could perhaps consider instead. It takes a moving mean of the last N measurements of CPU %, and triggers if a new sample is some standard deviations away from the moving mean -- which sounds closer to what I think we want. It would work for various baseline CPU %s, even when nodes are running hot (~65%) or cold (~10%). What do you think? Want to file an issue and track separately? Quick experiments would tell us how effective it is.

The Go implementation: https://github.com/MicahParks/peakdetect/blob/master/peakdetect.go.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @smg260, @tbg, and @zachlite)

pkg/server/env_sampler.go line 201 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

The Go implementation: https://github.com/MicahParks/peakdetect/blob/master/peakdetect.go.

CPU utilization is fine as one of the signals, but it is coarse grained (wrt time) and misses behavior at tiny time scales that is observed by admission control to do throttling. We have seen production situations where CPU utilization is low, but CPU admission control is active (sometimes coupled with a spiky runnable goroutine count, and sometimes not).
If we need to make a decision periodically, I'd suggest also calling into admission control to see if it recommends doing anything based on things it observed since the last such call. We can evolve the admission control implementation for this method (that would return a bool) over time -- for starters it could be whether the KV slots were exhausted for more than N milliseconds (say 100 milliseconds) during that 1min (if that was the interval).

This is orthogonal to the spike detection discussion above, in my mind. If there was a one time short spike that isn't caught by this periodic sampling, we may not care about it. If it is periodic, say a 5s spike every 10min and we keep missing it because of 1min sampling, that would be very unfortunate. By spike I don't mean a delta change that is large, but something that returns true from the decision function on whether to take a profile or not (we can of course ignore this true if we have recently taken a profile). We could just poll more frequently to capture these.

And of course there is the spike that is a delta change where two consecutive polls are returning true to take a profile, and the latter has experienced a delta change, and we've already taken a cpu profile recently, so we are going to miss something by not taking the profile now. I don't know how important this is.

Santamaura · 2023-01-25T17:42:09Z

What do you think? Want to file an issue and track separately?

Yes makes sense, I've created the issue. Thanks for the links as well.

tbg · 2023-02-27T08:19:43Z

Filed #97699

abarganier

Try initing kv with --splits 1000 and run the workload for a while to let it balance out. Then increase the concurrency until the system is at 85%.

I'd like to see this as well if you can make it happen!

Nonetheless, thanks for running these thorough tests @Santamaura - much appreciated. Given the latency increase I agree that there's a non-negligible amount of overhead here, but with the cluster setting default set to false I think it's okay to proceed. We can continue optimizing as we go, but as it stands this is a solid platform to work with!

bad allocations that come with the profiler labels

I'll be curious to see the difference when we disable labels, but we can dive into that with #97699 (thanks for filing @tbg)

Otherwise, I'm satisfied with these tests and I'm happy with the server.cpu_profile.enabled=false setting by default.

As one final thing, can we add a panic handler to MaybeTakeProfile?

otherwise! Of course, let's be sure to also address Tobi's final comments.

Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @herkolategan, @koorosh, @Santamaura, @smg260, @tbg, and @zachlite)

pkg/server/profiler/cpuprofiler.go line 42 at r76 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Default this to MaxInt64 while we figure out the perf impact.

I'm fine with that, but let's make sure to document that if you enable with server.cpu_profile.enabled=true, you will also need to configure server.cpu_profile.cpu_usage_combined_threshold to a realistic value as well.

pkg/server/profiler/cpuprofiler.go line 110 at r76 (raw file):

// MaybeTakeProfile takes a cpu profile if cpu usage is high enough.
func (cp *CPUProfiler) MaybeTakeProfile(ctx context.Context, currentCpuUsage int64) {
	if !cpuProfileEnabled.Get(&cp.st.SV) {

nit: can we add a panic handler here just in case at the top of the function? IMO these profilers should never be the cause of a process crashing (We did the same for activequeryprofiler)

	defer func() {
		if p := recover(); p != nil {
			logcrash.ReportPanic(ctx, &cp.st.SV, p, 1)
		}
	}()

This might just be me being extra paranoid, but I figure it can't hurt 😅

Santamaura

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @abarganier, @herkolategan, @koorosh, @smg260, @tbg, and @zachlite)

pkg/cli/start.go line 1438 at r24 (raw file):

Previously, tbg (Tobias Grieger) wrote…

My intention was that we'd remove this - if we want periodic CPU profiles taken every 7s, then we can use the cluster setting to set the CPU usage threshold to 1%, set the reset interval to 7s, and then we have it.

The single loop driving the CPU profiler is here:

cockroach/pkg/server/env_sampler.go

Lines 199 to 201 in 2c65d19

if cpuProfiler != nil {

cpuProfiler.MaybeTakeProfile(ctx, int64(cfg.runtime.CPUCombinedPercentNorm.Value()*100))

}

Removed.

pkg/server/profiler/cpuprofiler.go line 42 at r76 (raw file):

Previously, abarganier (Alex Barganier) wrote…

I'm fine with that, but let's make sure to document that if you enable with server.cpu_profile.enabled=true, you will also need to configure server.cpu_profile.cpu_usage_combined_threshold to a realistic value as well.

Done. Expanded in the server.cpu_profile.enabled description about this.

pkg/server/profiler/cpuprofiler.go line 110 at r76 (raw file):

Previously, abarganier (Alex Barganier) wrote…

nit: can we add a panic handler here just in case at the top of the function? IMO these profilers should never be the cause of a process crashing (We did the same for activequeryprofiler)
	defer func() {
		if p := recover(); p != nil {
			logcrash.ReportPanic(ctx, &cp.st.SV, p, 1)
		}
	}()
This might just be me being extra paranoid, but I figure it can't hurt 😅

Added.

abarganier

once CI passes - nice work @Santamaura!

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale) (waiting on @herkolategan, @koorosh, @smg260, @tbg, and @zachlite)

This PR adds a cpu profiler to the server package. The following cluster settings have been added to configure the cpu profiler: - server.cpu_profile.cpu_usage_combined_threshold is the baseline value for when cpu profiles should be taken - server.cpu_profile.interval is when the high water mark resets to the cpu_usage_combined_threshold value - server.cpu_profile.duration is how long a cpu profile is taken - server.cpu_profile.enabled is whether the on/off switch of the cpu profiler Fixes: cockroachdb#75799 Release note: None

Santamaura

Try initing kv with --splits 1000 and run the workload for a while to let it balance out. Then increase the concurrency until the system is at 85%.

I tried this a couple days ago but after bumping concurrency a number of times I was only able to get cpu to ~75% and then at some point the admin ui became non responsive. Regardless, I set the cpu profiler to take profiles at 75% and the numbers increase on the workload and graphs seemed roughly about the same as the previous test.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale) (waiting on @abarganier, @herkolategan, @koorosh, @smg260, @tbg, and @zachlite)

Santamaura · 2023-03-02T15:50:34Z

TFTRs!
bors r+

craig · 2023-03-02T16:20:48Z

Build failed (retrying...):

Bazel Essential CI (Cockroach)

craig · 2023-03-02T16:47:45Z

Build succeeded:

Bazel Essential CI (Cockroach)

blathers-crl · 2023-03-02T16:47:57Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from cf97b4f to blathers/backport-release-22.2-95623: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.}

cucaroach · 2023-05-22T21:21:09Z

Late to the show, this is awesome! How was this not release note worthy I wonder, are we concerned about its usage?

tbg · 2023-06-16T10:45:17Z

@rmloveland looks like we dropped the ball on the release note here, would you be able to add this to the right backlog somewhere so that we mention it for 23.2 at least?

rmloveland · 2023-06-16T15:22:45Z

Tobias Grieger ***@***.***> writes:

@rmloveland looks like we dropped the ball on the release note here, would you be able to add this to the right backlog somewhere so that we mention it for 23.2 at least?

ya no problem! filed https://cockroachlabs.atlassian.net/browse/DOC-8130 however the fix version of that docs issue is v23.2, i think if this is backported to v23.1 it will need a "Release Note: blah blah" added to the backport PR message to get picked up by our release notes automation for a v23.1 point release

Santamaura force-pushed the cpu-profiler branch 7 times, most recently from 457da7a to 41da6e9 Compare January 23, 2023 22:23

Santamaura marked this pull request as ready for review January 23, 2023 23:13

Santamaura requested review from a team as code owners January 23, 2023 23:13

Santamaura requested a review from a team January 23, 2023 23:13

Santamaura requested review from a team as code owners January 23, 2023 23:13

Santamaura requested review from herkolategan, smg260, tbg, koorosh and zachlite and removed request for a team January 23, 2023 23:13

tbg requested changes Jan 24, 2023

View reviewed changes

Santamaura force-pushed the cpu-profiler branch from 41da6e9 to 260b8d2 Compare January 24, 2023 16:30

Santamaura commented Jan 24, 2023

View reviewed changes

irfansharif reviewed Jan 24, 2023

View reviewed changes

Santamaura commented Jan 24, 2023

View reviewed changes

irfansharif reviewed Jan 25, 2023

View reviewed changes

sumeerbhola reviewed Jan 25, 2023

View reviewed changes

Santamaura mentioned this pull request Jan 25, 2023

server: introduce peak detection as a signal for taking cpu profiles #95836

Open

tbg mentioned this pull request Feb 27, 2023

server: enable CPU profiler by default - benchmarking tests #97699

Closed

abarganier reviewed Feb 27, 2023

View reviewed changes

Santamaura force-pushed the cpu-profiler branch from 471ec03 to fd10a15 Compare February 27, 2023 17:01

blathers-crl bot requested a review from irfansharif February 27, 2023 17:01

Santamaura removed the request for review from irfansharif February 27, 2023 17:02

Santamaura added the backport-22.2.x label Feb 27, 2023

Santamaura commented Feb 27, 2023

View reviewed changes

Santamaura force-pushed the cpu-profiler branch from fd10a15 to 23cde42 Compare March 1, 2023 15:31

blathers-crl bot requested a review from irfansharif March 1, 2023 15:31

Santamaura removed the request for review from irfansharif March 1, 2023 15:31

abarganier approved these changes Mar 2, 2023

View reviewed changes

Santamaura force-pushed the cpu-profiler branch from 23cde42 to cf97b4f Compare March 2, 2023 15:02

blathers-crl bot requested a review from irfansharif March 2, 2023 15:02

Santamaura removed the request for review from irfansharif March 2, 2023 15:03

Santamaura commented Mar 2, 2023

View reviewed changes

craig bot merged commit a44c3c2 into cockroachdb:master Mar 2, 2023

Santamaura mentioned this pull request Mar 2, 2023

release-22.2: server: add cpu profiler #97929

Merged

Santamaura mentioned this pull request Jul 6, 2023

server: enable CPU profiler by default - monitoring telemetry cluster #106287

Closed

aadityasondhi mentioned this pull request Apr 23, 2024

DOC-9654 PR #118850 - server: enable continuous CPU profiler cockroachdb/docs#18492

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: add cpu profiler #95623

server: add cpu profiler #95623

Santamaura commented Jan 20, 2023 •

edited

Loading

cockroach-teamcity commented Jan 20, 2023

tbg left a comment

tbg Jan 24, 2023

Santamaura left a comment

irfansharif left a comment

Santamaura left a comment

irfansharif left a comment

irfansharif left a comment

sumeerbhola left a comment

Santamaura commented Jan 25, 2023

tbg commented Feb 27, 2023

abarganier left a comment

Santamaura left a comment

abarganier left a comment

Santamaura left a comment

Santamaura commented Mar 2, 2023

craig bot commented Mar 2, 2023

craig bot commented Mar 2, 2023

blathers-crl bot commented Mar 2, 2023

cucaroach commented May 22, 2023

tbg commented Jun 16, 2023 •

edited

Loading

rmloveland commented Jun 16, 2023 via email

	if cpuProfiler != nil {
	cpuProfiler.MaybeTakeProfile(ctx, int64(cfg.runtime.CPUCombinedPercentNorm.Value()*100))
	}

server: add cpu profiler #95623

server: add cpu profiler #95623

Conversation

Santamaura commented Jan 20, 2023 • edited Loading

cockroach-teamcity commented Jan 20, 2023

tbg left a comment

Choose a reason for hiding this comment

tbg Jan 24, 2023

Choose a reason for hiding this comment

Santamaura left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

Santamaura left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

irfansharif left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

Santamaura commented Jan 25, 2023

tbg commented Feb 27, 2023

abarganier left a comment

Choose a reason for hiding this comment

Santamaura left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

Santamaura left a comment

Choose a reason for hiding this comment

Santamaura commented Mar 2, 2023

craig bot commented Mar 2, 2023

craig bot commented Mar 2, 2023

blathers-crl bot commented Mar 2, 2023

cucaroach commented May 22, 2023

tbg commented Jun 16, 2023 • edited Loading

rmloveland commented Jun 16, 2023 via email

Santamaura commented Jan 20, 2023 •

edited

Loading

tbg commented Jun 16, 2023 •

edited

Loading