Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: enable CPU profiler by default - benchmarking tests #97699

Closed
tbg opened this issue Feb 27, 2023 · 6 comments
Closed

server: enable CPU profiler by default - benchmarking tests #97699

tbg opened this issue Feb 27, 2023 · 6 comments
Labels
A-kv-observability A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability

Comments

@tbg
Copy link
Member

tbg commented Feb 27, 2023

In #95623, we're introducing automatic collection of CPU profiles based on CPU usage thresholds.

There are concerns with the overhead of CPU profiles, so this is disabled by default.

We should quantify the overhead, maybe bring it down, and enable by default.

Jira issue: CRDB-24831

gz#17311

@tbg tbg added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-observability labels Feb 27, 2023
@tbg
Copy link
Member Author

tbg commented Jun 16, 2023

Related issue to roll out on cloud: https://cockroachlabs.atlassian.net/browse/CC-24928
It's likely wise to roll out on cloud first (if we can do that quickly enough) and then flip the bool for 23.2.

For the record, it's been enabled for some time on telemetry but this cluster isn't latency sensitive.

@exalate-issue-sync exalate-issue-sync bot changed the title server: enable CPU profiler by default server: enable CPU profiler by default - benchmarking tests Jul 6, 2023
@tbg
Copy link
Member Author

tbg commented Jul 19, 2023

FYI it looks like CC rollout is in progress, see ticket above.

@kvoli
Copy link
Collaborator

kvoli commented Jul 27, 2023

@Santamaura do you know when this will get merged on master roachtests? A failure came up in #106140 which would have benefited from auto CPU profiles.

@Santamaura
Copy link
Contributor

I don't. It's not in the obs inf current milestone though I can try to find out if we are staging it for the next milestone.

kvoli added a commit to kvoli/cockroach that referenced this issue Aug 3, 2023
`restore/tpce/400GB/aws/nodes=8/cpus=8` recently failed with high CPU
usage on one node cockroachdb#106140. The high CPU usage was attributed within
replication, however we were unable to determine where exactly.

Enable the automatic CPU profiler on `restore/*` roachtests, which will
collect profiles once normalized CPU utilization exceeds 70%.

Note this is supposed to be a temporary addition, which will be subsumed
by cockroachdb#97699.

Informs: cockroachdb#106140

Release note: None
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Feb 6, 2024
This patch enables the CPU profile with a threshold of 75% and a max
frequency of 20min.

The motivation for this change is that recently, I have noticed a few
cases during investigations where CPU utilization spikes but we lack
profiles after the fact. This hinders our ability to dig deeper into the
source of high CPU usage. Having profiles can help inform us of future
AC integrations that we may need, or other performance improvements we
can do elsewhere.

Informs cockroachdb#97699.

Release note (ops change): CRDB will now automatically generate CPU
profiles if there is an increase in CPU utilization. This can help
investigate possible issues after the fact.
aadityasondhi added a commit to aadityasondhi/cockroach that referenced this issue Feb 8, 2024
This patch enables the CPU profile with a threshold of 75% and a max
frequency of 20min.

The motivation for this change is that recently, I have noticed a few
cases during investigations where CPU utilization spikes but we lack
profiles after the fact. This hinders our ability to dig deeper into the
source of high CPU usage. Having profiles can help inform us of future
AC integrations that we may need, or other performance improvements we
can do elsewhere.

Informs cockroachdb#97699.

Release note (ops change): CRDB will now automatically generate CPU
profiles if there is an increase in CPU utilization. This can help
investigate possible issues after the fact.
craig bot pushed a commit that referenced this issue Feb 8, 2024
…118970 #118975

118293: roachtest: add multi-region mixed-version test r=srosenberg,herkolategan a=renatolabs

This commit adds a `multi-region/mixed-version` test that creates a larger-sized cluster (4 regions, 10 nodes per region) and runs a constant background TPCC worload (with `tolerate-errors`), along with a short TPCC workload that runs in mixed-version state (without `tolerate-errors`).

With this test, we exercise our ability to perform several cluster upgrades in MR clusters. In the future, we plan to extend this test to include other kinds of randomized testing as well.

Fixes: #114803

Release note: None

**roachtest: change TPCC functions to take a logger instance**
This makes it easier to organize logs for a complex test by allowing
the caller to inject the logger instance that should be used in a
particular call to `runTPCC`. More immediately, we make use of this
change in the recently introduced `multi-region/mixed-version` test,
and pass the step's logger to those functions.

118827: roachpb: stop reading `Lease.DeprecatedStartStasis` r=erikgrinaker a=erikgrinaker

This will allow removing the field in a later version. We still have to populate the field for backwards compatibility with 23.2.

Epic: none
Release note: None

118850: server: enable continuous CPU profiler r=aadityasondhi a=aadityasondhi

This patch enables the CPU profile with a threshold of 75% and a max frequency of 20min.

The motivation for this change is that recently, I have noticed a few cases during investigations where CPU utilization spikes but we lack profiles after the fact. This hinders our ability to dig deeper into the source of high CPU usage. Having profiles can help inform us of future AC integrations that we may need, or other performance improvements we can do elsewhere.

Informs #97699.

Release note (ops change): CRDB will now automatically generate CPU profiles if there is an increase in CPU utilization. This can help investigate possible issues after the fact.

118865: technotes: SQL statistics r=maryliag a=maryliag

Tech notes on SQL Statistics.

Part Of [CRDB-35839](https://cockroachlabs.atlassian.net/browse/CRDB-35839)

Images of how the diagrams generated using mermaid will be rendered:

<img width="814" alt="Screenshot 2024-02-06 at 4 10 05 PM" src="https://github.com/cockroachdb/cockroach/assets/1017486/9dfbd2b1-4454-41f1-8840-dc39d41059e5">

<img width="285" alt="Screenshot 2024-02-06 at 4 42 39 PM" src="https://github.com/cockroachdb/cockroach/assets/1017486/bf570d55-1f5d-454b-b693-fbcf5e082a39">



Release note: None

118905: pgwire: increase timeout and add Error() call r=rafiss a=rickystewart

This test seems to deadlock sporadically under remote execution. I believe the connection is timing out, which due to the way this test was written causes it to deadlock. I'm increasing the timeout to attempt to increase reliability, and separately, we also add a call to `Error()` to mark the test as failed should the connection fail.

Closes #118741.

Epic: None
Release note: None

118914: roachtest: print issue number after test failure r=srosenberg a=renatolabs

This commit updates the GitHub issue poster so that information about
the issue is returned when an issue is created or a comment
added. The roachtest test runner uses this information in the TeamCity
output so that we can easily see the issue corresponding to a test
failure directly in the TC overview page.

Epic: none

Release note: None

118928: backupccl: skip TestDataDriven_multiregion under duress r=rickystewart a=msbutler

Fixes #118567

Release note: none

118960: cli: update docs url for sql shell r=lunevalex a=lunevalex

The SQL shell help function redirects the user to
use-the-built-in-sql-client.html this page no longer exists. Instead the SQL shell should point to cockroach-sql.html.

Epic: None

Release note (cli change): Change the SQL shell help URL to point to cockroach-sql.html.

118970: explain: fix overflow when printing estimated row count r=yuzefovich a=yuzefovich

Previously, when the optimizer estimated a very large row count (which is float64), we would first cast it to uint64 (which would work ok in case the count is large - we'd get `max int64 + 1`), and then we would cast it to int64 which can result in a negative number. This is now fixed by adding a check before casting to int64 to cap the value at max int64.

Epic: None

Release note: None

118975: changefeedccl: skip flaky tests for pulsar r=wenyihu6 a=wenyihu6

This patch skips flaky tests for pulsar sinks. We will add them back when pulsar
is fully supported.

Release note: none
Fixes: #118938, #118937, #118936, #118935

Co-authored-by: Renato Costa <renato@cockroachlabs.com>
Co-authored-by: Erik Grinaker <grinaker@cockroachlabs.com>
Co-authored-by: Aaditya Sondhi <20070511+aadityasondhi@users.noreply.github.com>
Co-authored-by: maryliag <marylia@cockroachlabs.com>
Co-authored-by: Ricky Stewart <ricky@cockroachlabs.com>
Co-authored-by: Michael Butler <butler@cockroachlabs.com>
Co-authored-by: Alex Lunev <alexl@cockroachlabs.com>
Co-authored-by: Yahor Yuzefovich <yahor@cockroachlabs.com>
Co-authored-by: Wenyi Hu <wenyi@cockroachlabs.com>
wenyihu6 pushed a commit to wenyihu6/cockroach that referenced this issue Feb 21, 2024
This patch enables the CPU profile with a threshold of 75% and a max
frequency of 20min.

The motivation for this change is that recently, I have noticed a few
cases during investigations where CPU utilization spikes but we lack
profiles after the fact. This hinders our ability to dig deeper into the
source of high CPU usage. Having profiles can help inform us of future
AC integrations that we may need, or other performance improvements we
can do elsewhere.

Informs cockroachdb#97699.

Release note (ops change): CRDB will now automatically generate CPU
profiles if there is an increase in CPU utilization. This can help
investigate possible issues after the fact.
@rafiss
Copy link
Collaborator

rafiss commented Apr 29, 2024

@aadityasondhi can this issue be closed, since #118850 was merged?

@aadityasondhi
Copy link
Collaborator

Thanks for the callout. Closing this.

dhartunian pushed a commit to dhartunian/cockroach that referenced this issue Apr 29, 2024
This patch enables the CPU profile with a threshold of 75% and a max
frequency of 20min.

The motivation for this change is that recently, I have noticed a few
cases during investigations where CPU utilization spikes but we lack
profiles after the fact. This hinders our ability to dig deeper into the
source of high CPU usage. Having profiles can help inform us of future
AC integrations that we may need, or other performance improvements we
can do elsewhere.

Informs cockroachdb#97699.

Release note (ops change): CRDB will now automatically generate CPU
profiles if there is an increase in CPU utilization. This can help
investigate possible issues after the fact.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability
Projects
None yet
Development

No branches or pull requests

5 participants