-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: enable CPU profiler by default - benchmarking tests #97699
Comments
Related issue to roll out on cloud: https://cockroachlabs.atlassian.net/browse/CC-24928 For the record, it's been enabled for some time on |
FYI it looks like CC rollout is in progress, see ticket above. |
@Santamaura do you know when this will get merged on master roachtests? A failure came up in #106140 which would have benefited from auto CPU profiles. |
I don't. It's not in the obs inf current milestone though I can try to find out if we are staging it for the next milestone. |
`restore/tpce/400GB/aws/nodes=8/cpus=8` recently failed with high CPU usage on one node cockroachdb#106140. The high CPU usage was attributed within replication, however we were unable to determine where exactly. Enable the automatic CPU profiler on `restore/*` roachtests, which will collect profiles once normalized CPU utilization exceeds 70%. Note this is supposed to be a temporary addition, which will be subsumed by cockroachdb#97699. Informs: cockroachdb#106140 Release note: None
This patch enables the CPU profile with a threshold of 75% and a max frequency of 20min. The motivation for this change is that recently, I have noticed a few cases during investigations where CPU utilization spikes but we lack profiles after the fact. This hinders our ability to dig deeper into the source of high CPU usage. Having profiles can help inform us of future AC integrations that we may need, or other performance improvements we can do elsewhere. Informs cockroachdb#97699. Release note (ops change): CRDB will now automatically generate CPU profiles if there is an increase in CPU utilization. This can help investigate possible issues after the fact.
This patch enables the CPU profile with a threshold of 75% and a max frequency of 20min. The motivation for this change is that recently, I have noticed a few cases during investigations where CPU utilization spikes but we lack profiles after the fact. This hinders our ability to dig deeper into the source of high CPU usage. Having profiles can help inform us of future AC integrations that we may need, or other performance improvements we can do elsewhere. Informs cockroachdb#97699. Release note (ops change): CRDB will now automatically generate CPU profiles if there is an increase in CPU utilization. This can help investigate possible issues after the fact.
…118970 #118975 118293: roachtest: add multi-region mixed-version test r=srosenberg,herkolategan a=renatolabs This commit adds a `multi-region/mixed-version` test that creates a larger-sized cluster (4 regions, 10 nodes per region) and runs a constant background TPCC worload (with `tolerate-errors`), along with a short TPCC workload that runs in mixed-version state (without `tolerate-errors`). With this test, we exercise our ability to perform several cluster upgrades in MR clusters. In the future, we plan to extend this test to include other kinds of randomized testing as well. Fixes: #114803 Release note: None **roachtest: change TPCC functions to take a logger instance** This makes it easier to organize logs for a complex test by allowing the caller to inject the logger instance that should be used in a particular call to `runTPCC`. More immediately, we make use of this change in the recently introduced `multi-region/mixed-version` test, and pass the step's logger to those functions. 118827: roachpb: stop reading `Lease.DeprecatedStartStasis` r=erikgrinaker a=erikgrinaker This will allow removing the field in a later version. We still have to populate the field for backwards compatibility with 23.2. Epic: none Release note: None 118850: server: enable continuous CPU profiler r=aadityasondhi a=aadityasondhi This patch enables the CPU profile with a threshold of 75% and a max frequency of 20min. The motivation for this change is that recently, I have noticed a few cases during investigations where CPU utilization spikes but we lack profiles after the fact. This hinders our ability to dig deeper into the source of high CPU usage. Having profiles can help inform us of future AC integrations that we may need, or other performance improvements we can do elsewhere. Informs #97699. Release note (ops change): CRDB will now automatically generate CPU profiles if there is an increase in CPU utilization. This can help investigate possible issues after the fact. 118865: technotes: SQL statistics r=maryliag a=maryliag Tech notes on SQL Statistics. Part Of [CRDB-35839](https://cockroachlabs.atlassian.net/browse/CRDB-35839) Images of how the diagrams generated using mermaid will be rendered: <img width="814" alt="Screenshot 2024-02-06 at 4 10 05 PM" src="https://github.com/cockroachdb/cockroach/assets/1017486/9dfbd2b1-4454-41f1-8840-dc39d41059e5"> <img width="285" alt="Screenshot 2024-02-06 at 4 42 39 PM" src="https://github.com/cockroachdb/cockroach/assets/1017486/bf570d55-1f5d-454b-b693-fbcf5e082a39"> Release note: None 118905: pgwire: increase timeout and add Error() call r=rafiss a=rickystewart This test seems to deadlock sporadically under remote execution. I believe the connection is timing out, which due to the way this test was written causes it to deadlock. I'm increasing the timeout to attempt to increase reliability, and separately, we also add a call to `Error()` to mark the test as failed should the connection fail. Closes #118741. Epic: None Release note: None 118914: roachtest: print issue number after test failure r=srosenberg a=renatolabs This commit updates the GitHub issue poster so that information about the issue is returned when an issue is created or a comment added. The roachtest test runner uses this information in the TeamCity output so that we can easily see the issue corresponding to a test failure directly in the TC overview page. Epic: none Release note: None 118928: backupccl: skip TestDataDriven_multiregion under duress r=rickystewart a=msbutler Fixes #118567 Release note: none 118960: cli: update docs url for sql shell r=lunevalex a=lunevalex The SQL shell help function redirects the user to use-the-built-in-sql-client.html this page no longer exists. Instead the SQL shell should point to cockroach-sql.html. Epic: None Release note (cli change): Change the SQL shell help URL to point to cockroach-sql.html. 118970: explain: fix overflow when printing estimated row count r=yuzefovich a=yuzefovich Previously, when the optimizer estimated a very large row count (which is float64), we would first cast it to uint64 (which would work ok in case the count is large - we'd get `max int64 + 1`), and then we would cast it to int64 which can result in a negative number. This is now fixed by adding a check before casting to int64 to cap the value at max int64. Epic: None Release note: None 118975: changefeedccl: skip flaky tests for pulsar r=wenyihu6 a=wenyihu6 This patch skips flaky tests for pulsar sinks. We will add them back when pulsar is fully supported. Release note: none Fixes: #118938, #118937, #118936, #118935 Co-authored-by: Renato Costa <renato@cockroachlabs.com> Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com> Co-authored-by: Aaditya Sondhi <20070511+aadityasondhi@users.noreply.github.com> Co-authored-by: maryliag <marylia@cockroachlabs.com> Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com> Co-authored-by: Michael Butler <butler@cockroachlabs.com> Co-authored-by: Alex Lunev <alexl@cockroachlabs.com> Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com> Co-authored-by: Wenyi Hu <wenyi@cockroachlabs.com>
This patch enables the CPU profile with a threshold of 75% and a max frequency of 20min. The motivation for this change is that recently, I have noticed a few cases during investigations where CPU utilization spikes but we lack profiles after the fact. This hinders our ability to dig deeper into the source of high CPU usage. Having profiles can help inform us of future AC integrations that we may need, or other performance improvements we can do elsewhere. Informs cockroachdb#97699. Release note (ops change): CRDB will now automatically generate CPU profiles if there is an increase in CPU utilization. This can help investigate possible issues after the fact.
@aadityasondhi can this issue be closed, since #118850 was merged? |
Thanks for the callout. Closing this. |
This patch enables the CPU profile with a threshold of 75% and a max frequency of 20min. The motivation for this change is that recently, I have noticed a few cases during investigations where CPU utilization spikes but we lack profiles after the fact. This hinders our ability to dig deeper into the source of high CPU usage. Having profiles can help inform us of future AC integrations that we may need, or other performance improvements we can do elsewhere. Informs cockroachdb#97699. Release note (ops change): CRDB will now automatically generate CPU profiles if there is an increase in CPU utilization. This can help investigate possible issues after the fact.
In #95623, we're introducing automatic collection of CPU profiles based on CPU usage thresholds.
There are concerns with the overhead of CPU profiles, so this is disabled by default.
We should quantify the overhead, maybe bring it down, and enable by default.
Jira issue: CRDB-24831
gz#17311
The text was updated successfully, but these errors were encountered: