Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add multi-region mixed-version test #118293

Merged
merged 2 commits into from
Feb 8, 2024

Conversation

renatolabs
Copy link
Contributor

@renatolabs renatolabs commented Jan 24, 2024

This commit adds a multi-region/mixed-version test that creates a larger-sized cluster (4 regions, 10 nodes per region) and runs a constant background TPCC worload (with tolerate-errors), along with a short TPCC workload that runs in mixed-version state (without tolerate-errors).

With this test, we exercise our ability to perform several cluster upgrades in MR clusters. In the future, we plan to extend this test to include other kinds of randomized testing as well.

Fixes: #114803

Release note: None

roachtest: change TPCC functions to take a logger instance
This makes it easier to organize logs for a complex test by allowing
the caller to inject the logger instance that should be used in a
particular call to runTPCC. More immediately, we make use of this
change in the recently introduced multi-region/mixed-version test,
and pass the step's logger to those functions.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@renatolabs renatolabs force-pushed the rc/mixed-version-mr-test branch 3 times, most recently from 0cd9429 to 3be31ab Compare January 25, 2024 18:08
@renatolabs renatolabs marked this pull request as ready for review January 25, 2024 18:12
@renatolabs renatolabs requested a review from a team as a code owner January 25, 2024 18:12
@renatolabs renatolabs requested review from herkolategan and DarrylWong and removed request for a team January 25, 2024 18:12
@srosenberg srosenberg self-requested a review January 26, 2024 14:06
@renatolabs
Copy link
Contributor Author

I'm trying to convince myself that this test could have caught issues like #113908, but I'm not there yet.

General information about the current test:

  • 4 regions, 10 nodes per region
  • Very little work done in the cluster: CPU utilization is almost constantly less than 10% across all nodes.
  • Basically no data: ~3GB.

While basic, this does seem to mirror the setup in https://github.com/cockroachlabs/support/issues/2685#issuecomment-1788125904 (except for node count -- see below).


The upgrade is given 30 minutes to complete. If I comment out AlwaysUseLatestPredecessors and run the test with a plan that involves upgrading from 22.2.6 to 23.1.11, the upgrade times out. But by the time it times out, the cluster version is at 22.2-upgrading-to-23.1-step-090. So it's not done yet, but it's relatively close to finishing.

If I run the test as-is, the upgrade being performed (at the time of writing) will always be 22.2.17 to 23.1.14 (which does include #113996). In this case, the upgrade takes 21 minutes to finish.


@nvanbenschoten @fqazi @kvoli: do these numbers look reasonable to you? The original issue seemed to indicate that the upgrade was "stuck", whereas I feel like the 23.1.11 run of this test could have finished if I gave it 10 more minutes. Could this be a matter of cluster size? I could/will try running this on an 80-node cluster but thought I'd get your perspective on this for now.

FWIW, I'm intentionally giving very little work to this cluster because we plan to introduce a sqlancer workload in this test very soon, and that will definitely increase utilization.

@renatolabs
Copy link
Contributor Author

renatolabs commented Feb 1, 2024

FWIW, yes, the distinction between 23.1.14 and older versions is much easier to see with 80 nodes.

@renatolabs
Copy link
Contributor Author

@cockroachdb/test-eng ready for review!

// test already failed, or that the test finished (and therefore
// this function should not return an error).
if ctx.Err() != nil && opts.Duration == 0 {
return nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't immediately find examples of tests running the workload indefinitely (i.e., Duration == 0).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, the background tpcc workload introduced in this PR should be the only one (hence this logic needed to be added).

const (
nodesPerRegion = 20
backgroundWarehousesPerRegion = 30
mixedVersionWarehousesPerRegion = 50
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem fairly low for a region that has 20 nodes. I'm curious how these were chosen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They were chosen "arbitrarily" 😄. I didn't want the cluster to be overloaded as we plan to add sqlancer (which will take most of the cluster's capacity, I'm assuming). That said, we can increased these numbers for now and lower them in the future as well. I'll play around with them until I reach a workload that will reach ~50% CPU utilization.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks. During the cloud report, we seemed to have found a way to linearly scale cpu utilization via tpcc-nowait by increasing/decreasing connections and warehouses per vCPUs. Unfortunately, I wasn't able to reproduce it with 24.1.0. In any case, there are other ways to scale cpu utilization, albeit artificial, e.g., stress-ng.

mixedversion.AlwaysUseLatestPredecessors,
)

backgroundTPCCOpts := tpccOptions{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth mentioning explicitly that the background workload has no expiration (i.e., Duration is elided). It may not be immediately obvious given all the other config. options.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, added a comment about this.

@renatolabs renatolabs force-pushed the rc/mixed-version-mr-test branch 3 times, most recently from 1fa1df0 to 11dba39 Compare February 7, 2024 14:55
@srosenberg srosenberg self-requested a review February 8, 2024 02:30
Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work!

@renatolabs
Copy link
Contributor Author

TFTR!

bors r=srosenberg

@renatolabs
Copy link
Contributor Author

FWIW, I left this test running over night and it failed with an ambiguous error while in mixed version 23.2 and master. While workload should arguably handle that, the cluster had plenty of unused capacity at that point and I could see many suspicious error messages in the logs that would be worth looking into anyway.

Merging the test as-is so that any failures that do happen in the future get some eyes.

@craig
Copy link
Contributor

craig bot commented Feb 8, 2024

Build failed:

@renatolabs
Copy link
Contributor Author

bors retry

@craig
Copy link
Contributor

craig bot commented Feb 8, 2024

Build failed:

This commit adds a `multi-region/mixed-version` test that creates a
larger-sized cluster (4 regions, 10 nodes per region) and runs a
constant background TPCC worload (with `tolerate-errors`), along with
a short TPCC workload that runs in mixed-version state (without
`tolerate-errors`).

With this test, we exercise our ability to perform several cluster
upgrades in MR clusters. In the future, we plan to extend this test to
include other kinds of randomized testing as well.

Fixes: cockroachdb#114803

Release note: None

Signed-off-by: Renato Costa <renato@cockroachlabs.com>
This makes it easier to organize logs for a complex test by allowing
the caller to inject the logger instance that should be used in a
particular call to `runTPCC`. More immediately, we make use of this
change in the recently introduced `multi-region/mixed-version` test,
and pass the step's logger to those functions.

Epic: none

Release note: None
@renatolabs
Copy link
Contributor Author

bors retry

@rickystewart
Copy link
Collaborator

rickystewart commented Feb 8, 2024

bors r-

Build failures seem related to this PR

Please check the build logs before asking bors to retry to at least determine that it seems to be a flake. Bors has not made any progress today due to this.

Comment was hasty. The most recent push was fine in fact, I just didn't notice that.

@craig
Copy link
Contributor

craig bot commented Feb 8, 2024

Canceled.

@renatolabs
Copy link
Contributor Author

I thought I had fixed it in my latest force-push. Am I missing something?

@renatolabs
Copy link
Contributor Author

bors retry

@renatolabs
Copy link
Contributor Author

That didn't seem to do it, bors didn't move to Pending.

bors r=srosenberg,herkolategan

@craig craig bot merged commit 3671865 into cockroachdb:master Feb 8, 2024
8 of 9 checks passed
@craig
Copy link
Contributor

craig bot commented Feb 8, 2024

Build succeeded:

@renatolabs renatolabs deleted the rc/mixed-version-mr-test branch February 9, 2024 14:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: add large multi-region cluster upgrade testing
5 participants