roachtest: add multi-region mixed-version test #118293

renatolabs · 2024-01-24T20:33:28Z

This commit adds a multi-region/mixed-version test that creates a larger-sized cluster (4 regions, 10 nodes per region) and runs a constant background TPCC worload (with tolerate-errors), along with a short TPCC workload that runs in mixed-version state (without tolerate-errors).

With this test, we exercise our ability to perform several cluster upgrades in MR clusters. In the future, we plan to extend this test to include other kinds of randomized testing as well.

Fixes: #114803

Release note: None

roachtest: change TPCC functions to take a logger instance
This makes it easier to organize logs for a complex test by allowing
the caller to inject the logger instance that should be used in a
particular call to runTPCC. More immediately, we make use of this
change in the recently introduced multi-region/mixed-version test,
and pass the step's logger to those functions.

cockroach-teamcity · 2024-01-24T20:33:50Z

This change is

renatolabs · 2024-01-26T21:38:43Z

I'm trying to convince myself that this test could have caught issues like #113908, but I'm not there yet.

General information about the current test:

4 regions, 10 nodes per region
Very little work done in the cluster: CPU utilization is almost constantly less than 10% across all nodes.
Basically no data: ~3GB.

While basic, this does seem to mirror the setup in https://github.com/cockroachlabs/support/issues/2685#issuecomment-1788125904 (except for node count -- see below).

The upgrade is given 30 minutes to complete. If I comment out AlwaysUseLatestPredecessors and run the test with a plan that involves upgrading from 22.2.6 to 23.1.11, the upgrade times out. But by the time it times out, the cluster version is at 22.2-upgrading-to-23.1-step-090. So it's not done yet, but it's relatively close to finishing.

If I run the test as-is, the upgrade being performed (at the time of writing) will always be 22.2.17 to 23.1.14 (which does include #113996). In this case, the upgrade takes 21 minutes to finish.

@nvanbenschoten @fqazi @kvoli: do these numbers look reasonable to you? The original issue seemed to indicate that the upgrade was "stuck", whereas I feel like the 23.1.11 run of this test could have finished if I gave it 10 more minutes. Could this be a matter of cluster size? I could/will try running this on an 80-node cluster but thought I'd get your perspective on this for now.

FWIW, I'm intentionally giving very little work to this cluster because we plan to introduce a sqlancer workload in this test very soon, and that will definitely increase utilization.

renatolabs · 2024-02-01T15:54:02Z

FWIW, yes, the distinction between 23.1.14 and older versions is much easier to see with 80 nodes.

renatolabs · 2024-02-01T19:30:31Z

@cockroachdb/test-eng ready for review!

srosenberg · 2024-02-05T21:03:07Z

pkg/cmd/roachtest/tests/tpcc.go

+			// test already failed, or that the test finished (and therefore
+			// this function should not return an error).
+			if ctx.Err() != nil && opts.Duration == 0 {
+				return nil


I couldn't immediately find examples of tests running the workload indefinitely (i.e., Duration == 0).

Yep, the background tpcc workload introduced in this PR should be the only one (hence this logic needed to be added).

srosenberg · 2024-02-05T22:17:29Z

pkg/cmd/roachtest/tests/mixed_version_multi_region.go

+	const (
+		nodesPerRegion                  = 20
+		backgroundWarehousesPerRegion   = 30
+		mixedVersionWarehousesPerRegion = 50


These seem fairly low for a region that has 20 nodes. I'm curious how these were chosen?

They were chosen "arbitrarily" 😄. I didn't want the cluster to be overloaded as we plan to add sqlancer (which will take most of the cluster's capacity, I'm assuming). That said, we can increased these numbers for now and lower them in the future as well. I'll play around with them until I reach a workload that will reach ~50% CPU utilization.

Makes sense, thanks. During the cloud report, we seemed to have found a way to linearly scale cpu utilization via tpcc-nowait by increasing/decreasing connections and warehouses per vCPUs. Unfortunately, I wasn't able to reproduce it with 24.1.0. In any case, there are other ways to scale cpu utilization, albeit artificial, e.g., stress-ng.

srosenberg · 2024-02-05T22:24:47Z

pkg/cmd/roachtest/tests/mixed_version_multi_region.go

+				mixedversion.AlwaysUseLatestPredecessors,
+			)
+
+			backgroundTPCCOpts := tpccOptions{


It might be worth mentioning explicitly that the background workload has no expiration (i.e., Duration is elided). It may not be immediately obvious given all the other config. options.

Good point, added a comment about this.

srosenberg

Nice work!

renatolabs · 2024-02-08T15:07:07Z

TFTR!

bors r=srosenberg

renatolabs · 2024-02-08T15:14:39Z

FWIW, I left this test running over night and it failed with an ambiguous error while in mixed version 23.2 and master. While workload should arguably handle that, the cluster had plenty of unused capacity at that point and I could see many suspicious error messages in the logs that would be worth looking into anyway.

Merging the test as-is so that any failures that do happen in the future get some eyes.

craig · 2024-02-08T16:24:40Z

Build failed:

Bazel Essential CI (Cockroach)

renatolabs · 2024-02-08T16:31:27Z

bors retry

craig · 2024-02-08T17:15:35Z

Build failed:

Bazel Essential CI (Cockroach)

This commit adds a `multi-region/mixed-version` test that creates a larger-sized cluster (4 regions, 10 nodes per region) and runs a constant background TPCC worload (with `tolerate-errors`), along with a short TPCC workload that runs in mixed-version state (without `tolerate-errors`). With this test, we exercise our ability to perform several cluster upgrades in MR clusters. In the future, we plan to extend this test to include other kinds of randomized testing as well. Fixes: cockroachdb#114803 Release note: None Signed-off-by: Renato Costa <renato@cockroachlabs.com>

This makes it easier to organize logs for a complex test by allowing the caller to inject the logger instance that should be used in a particular call to `runTPCC`. More immediately, we make use of this change in the recently introduced `multi-region/mixed-version` test, and pass the step's logger to those functions. Epic: none Release note: None

renatolabs · 2024-02-08T18:51:18Z

bors retry

rickystewart · 2024-02-08T20:01:12Z

bors r-

~~Build failures seem related to this PR~~

~~Please check the build logs before asking bors to retry to at least determine that it seems to be a flake. Bors has not made any progress today due to this.~~

Comment was hasty. The most recent push was fine in fact, I just didn't notice that.

craig · 2024-02-08T20:01:16Z

Canceled.

renatolabs · 2024-02-08T20:07:13Z

I thought I had fixed it in my latest force-push. Am I missing something?

renatolabs · 2024-02-08T20:14:04Z

bors retry

renatolabs · 2024-02-08T20:23:50Z

That didn't seem to do it, bors didn't move to Pending.

bors r=srosenberg,herkolategan

craig · 2024-02-08T22:56:27Z

Build succeeded:

Bazel Essential CI (Cockroach)

renatolabs force-pushed the rc/mixed-version-mr-test branch 3 times, most recently from 0cd9429 to 3be31ab Compare January 25, 2024 18:08

renatolabs marked this pull request as ready for review January 25, 2024 18:12

renatolabs requested a review from a team as a code owner January 25, 2024 18:12

renatolabs requested review from herkolategan and DarrylWong and removed request for a team January 25, 2024 18:12

renatolabs force-pushed the rc/mixed-version-mr-test branch from 3be31ab to cc7a051 Compare January 25, 2024 21:19

srosenberg self-requested a review January 26, 2024 14:06

renatolabs force-pushed the rc/mixed-version-mr-test branch from cc7a051 to eedb59b Compare January 26, 2024 21:20

renatolabs force-pushed the rc/mixed-version-mr-test branch from eedb59b to 10e56ec Compare January 31, 2024 15:12

renatolabs force-pushed the rc/mixed-version-mr-test branch from 10e56ec to 9ad908f Compare February 1, 2024 19:30

herkolategan approved these changes Feb 2, 2024

View reviewed changes

renatolabs force-pushed the rc/mixed-version-mr-test branch from 9ad908f to 9eb0fe3 Compare February 2, 2024 19:58

srosenberg reviewed Feb 5, 2024

View reviewed changes

renatolabs force-pushed the rc/mixed-version-mr-test branch 3 times, most recently from 1fa1df0 to 11dba39 Compare February 7, 2024 14:55

srosenberg self-requested a review February 8, 2024 02:30

srosenberg approved these changes Feb 8, 2024

View reviewed changes

renatolabs force-pushed the rc/mixed-version-mr-test branch from 11dba39 to ebc95d9 Compare February 8, 2024 18:08

renatolabs added 2 commits February 8, 2024 13:13

renatolabs force-pushed the rc/mixed-version-mr-test branch from ebc95d9 to 480791e Compare February 8, 2024 18:13

craig bot merged commit 3671865 into cockroachdb:master Feb 8, 2024
8 of 9 checks passed

renatolabs deleted the rc/mixed-version-mr-test branch February 9, 2024 14:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: add multi-region mixed-version test #118293

roachtest: add multi-region mixed-version test #118293

renatolabs commented Jan 24, 2024 •

edited

Loading

cockroach-teamcity commented Jan 24, 2024

renatolabs commented Jan 26, 2024

renatolabs commented Feb 1, 2024 •

edited

Loading

renatolabs commented Feb 1, 2024

srosenberg Feb 5, 2024

renatolabs Feb 5, 2024

srosenberg Feb 5, 2024

renatolabs Feb 6, 2024

srosenberg Feb 8, 2024

srosenberg Feb 5, 2024

renatolabs Feb 6, 2024

srosenberg left a comment

renatolabs commented Feb 8, 2024

renatolabs commented Feb 8, 2024

craig bot commented Feb 8, 2024

renatolabs commented Feb 8, 2024

craig bot commented Feb 8, 2024

renatolabs commented Feb 8, 2024

rickystewart commented Feb 8, 2024 •

edited

Loading

craig bot commented Feb 8, 2024

renatolabs commented Feb 8, 2024

renatolabs commented Feb 8, 2024

renatolabs commented Feb 8, 2024

craig bot commented Feb 8, 2024

roachtest: add multi-region mixed-version test #118293

roachtest: add multi-region mixed-version test #118293

Conversation

renatolabs commented Jan 24, 2024 • edited Loading

cockroach-teamcity commented Jan 24, 2024

renatolabs commented Jan 26, 2024

renatolabs commented Feb 1, 2024 • edited Loading

renatolabs commented Feb 1, 2024

srosenberg Feb 5, 2024

Choose a reason for hiding this comment

renatolabs Feb 5, 2024

Choose a reason for hiding this comment

srosenberg Feb 5, 2024

Choose a reason for hiding this comment

renatolabs Feb 6, 2024

Choose a reason for hiding this comment

srosenberg Feb 8, 2024

Choose a reason for hiding this comment

srosenberg Feb 5, 2024

Choose a reason for hiding this comment

renatolabs Feb 6, 2024

Choose a reason for hiding this comment

srosenberg left a comment

Choose a reason for hiding this comment

renatolabs commented Feb 8, 2024

renatolabs commented Feb 8, 2024

craig bot commented Feb 8, 2024

renatolabs commented Feb 8, 2024

craig bot commented Feb 8, 2024

renatolabs commented Feb 8, 2024

rickystewart commented Feb 8, 2024 • edited Loading

craig bot commented Feb 8, 2024

renatolabs commented Feb 8, 2024

renatolabs commented Feb 8, 2024

renatolabs commented Feb 8, 2024

craig bot commented Feb 8, 2024

renatolabs commented Jan 24, 2024 •

edited

Loading

renatolabs commented Feb 1, 2024 •

edited

Loading

rickystewart commented Feb 8, 2024 •

edited

Loading