-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: enable continuous CPU profiler #118850
server: enable continuous CPU profiler #118850
Conversation
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
672ff8c
to
7196e2c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- I think it's about time we turned this on.
I think there was some original desire for benchmarking, but I feel fairly confident that this is acceptable given the 20 minute interval. I also recall @Santamaura doing a good amount of benchmarking originally and not being able to find any notable performance hits. FWIW, the CRL telemetry cluster has had this feature enabled with a 5 minute interval and 80% threshold for many months now and hasn't experienced issues.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @kvoli and @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r1, all commit messages.
Reviewable status: complete! 2 of 0 LGTMs obtained (waiting on @sumeerbhola)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @aadityasondhi)
Discussed offline with @aadityasondhi, we should discuss the cleanup story here. Do we need to also introduce a mechanism which removes profiles after X days or rotates profiles so that there is a ceiling? Probably worth looking into how memory profiles profiles are handled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR all!
@sumeerbhola Yes that is correct, startSampleEnvironment
is where we read the metric and it is read at the sampling interval which is defaulted to 10s. Lowered to 65%.
@kvoli If I am reading the code in startSampleEnvironment
correctly, it seems that both the heap and cpu profiler use the same underlying pattern neither do any sort of clean up. A new file seems to be generated each time the profile is taken. I think it is safe to punt this for now and revisit the gc story if it becomes a problem, thoughts?
Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @aadityasondhi)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction to above, we do have a GC policy in all of our dump stores. For CPU profiles specifically, we use the cluster setting server.cpu_profile.total_dump_size_limit
, which is set to 128MiB as the default.
Reviewable status: complete! 3 of 0 LGTMs obtained (waiting on @aadityasondhi)
This patch enables the CPU profile with a threshold of 75% and a max frequency of 20min. The motivation for this change is that recently, I have noticed a few cases during investigations where CPU utilization spikes but we lack profiles after the fact. This hinders our ability to dig deeper into the source of high CPU usage. Having profiles can help inform us of future AC integrations that we may need, or other performance improvements we can do elsewhere. Informs cockroachdb#97699. Release note (ops change): CRDB will now automatically generate CPU profiles if there is an increase in CPU utilization. This can help investigate possible issues after the fact.
7196e2c
to
9036430
Compare
bors r+ |
Build succeeded: |
This patch enables the CPU profile with a threshold of 75% and a max frequency of 20min.
The motivation for this change is that recently, I have noticed a few cases during investigations where CPU utilization spikes but we lack profiles after the fact. This hinders our ability to dig deeper into the source of high CPU usage. Having profiles can help inform us of future AC integrations that we may need, or other performance improvements we can do elsewhere.
Informs #97699.
Release note (ops change): CRDB will now automatically generate CPU profiles if there is an increase in CPU utilization. This can help investigate possible issues after the fact.