Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: enable continuous CPU profiler #118850

Merged

Conversation

aadityasondhi
Copy link
Collaborator

@aadityasondhi aadityasondhi commented Feb 6, 2024

This patch enables the CPU profile with a threshold of 75% and a max frequency of 20min.

The motivation for this change is that recently, I have noticed a few cases during investigations where CPU utilization spikes but we lack profiles after the fact. This hinders our ability to dig deeper into the source of high CPU usage. Having profiles can help inform us of future AC integrations that we may need, or other performance improvements we can do elsewhere.

Informs #97699.

Release note (ops change): CRDB will now automatically generate CPU profiles if there is an increase in CPU utilization. This can help investigate possible issues after the fact.

@aadityasondhi aadityasondhi requested review from kvoli, sumeerbhola and a team February 6, 2024 20:54
@aadityasondhi aadityasondhi requested a review from a team as a code owner February 6, 2024 20:54
@aadityasondhi aadityasondhi requested review from abarganier and removed request for a team February 6, 2024 20:54
Copy link

blathers-crl bot commented Feb 6, 2024

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: - I think it's about time we turned this on.

I think there was some original desire for benchmarking, but I feel fairly confident that this is acceptable given the 20 minute interval. I also recall @Santamaura doing a good amount of benchmarking originally and not being able to find any notable performance hits. FWIW, the CRL telemetry cluster has had this feature enabled with a 5 minute interval and 80% threshold for many months now and hasn't experienced issues.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)

Copy link
Collaborator

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained (waiting on @sumeerbhola)

@sumeerbhola
Copy link
Collaborator

Am I correct in understanding that the CPU metric will be read and the threshold will be checked (in startSampleEnvironment) every 10s? So short spikes in CPU usage can also trigger a profile (which is desirable)?

Can we reduce the threshold to 65%? We have seen examples of high runnable goroutines when the CPU utilization doesn't look very high (because the utilization is smoothing over some interval). For example:
Screenshot 2024-02-07 at 2 09 41 PM
Screenshot 2024-02-07 at 2 09 55 PM

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 3 of 0 LGTMs obtained (waiting on @aadityasondhi)

@kvoli
Copy link
Collaborator

kvoli commented Feb 7, 2024

Discussed offline with @aadityasondhi, we should discuss the cleanup story here. Do we need to also introduce a mechanism which removes profiles after X days or rotates profiles so that there is a ceiling?

Probably worth looking into how memory profiles profiles are handled.

Copy link
Collaborator Author

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR all!

@sumeerbhola Yes that is correct, startSampleEnvironment is where we read the metric and it is read at the sampling interval which is defaulted to 10s. Lowered to 65%.

@kvoli If I am reading the code in startSampleEnvironment correctly, it seems that both the heap and cpu profiler use the same underlying pattern neither do any sort of clean up. A new file seems to be generated each time the profile is taken. I think it is safe to punt this for now and revisit the gc story if it becomes a problem, thoughts?

Reviewable status: :shipit: complete! 3 of 0 LGTMs obtained (waiting on @aadityasondhi)

Copy link
Collaborator Author

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correction to above, we do have a GC policy in all of our dump stores. For CPU profiles specifically, we use the cluster setting server.cpu_profile.total_dump_size_limit, which is set to 128MiB as the default.

Reviewable status: :shipit: complete! 3 of 0 LGTMs obtained (waiting on @aadityasondhi)

This patch enables the CPU profile with a threshold of 75% and a max
frequency of 20min.

The motivation for this change is that recently, I have noticed a few
cases during investigations where CPU utilization spikes but we lack
profiles after the fact. This hinders our ability to dig deeper into the
source of high CPU usage. Having profiles can help inform us of future
AC integrations that we may need, or other performance improvements we
can do elsewhere.

Informs cockroachdb#97699.

Release note (ops change): CRDB will now automatically generate CPU
profiles if there is an increase in CPU utilization. This can help
investigate possible issues after the fact.
@aadityasondhi
Copy link
Collaborator Author

bors r+

@craig craig bot merged commit 3671865 into cockroachdb:master Feb 8, 2024
7 of 9 checks passed
@craig
Copy link
Contributor

craig bot commented Feb 8, 2024

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants