-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: add multi-region mixed-version test #118293
Conversation
0cd9429
to
3be31ab
Compare
3be31ab
to
cc7a051
Compare
cc7a051
to
eedb59b
Compare
I'm trying to convince myself that this test could have caught issues like #113908, but I'm not there yet. General information about the current test:
While basic, this does seem to mirror the setup in https://github.com/cockroachlabs/support/issues/2685#issuecomment-1788125904 (except for node count -- see below). The upgrade is given 30 minutes to complete. If I comment out If I run the test as-is, the upgrade being performed (at the time of writing) will always be 22.2.17 to 23.1.14 (which does include #113996). In this case, the upgrade takes 21 minutes to finish. @nvanbenschoten @fqazi @kvoli: do these numbers look reasonable to you? The original issue seemed to indicate that the upgrade was "stuck", whereas I feel like the 23.1.11 run of this test could have finished if I gave it 10 more minutes. Could this be a matter of cluster size? I could/will try running this on an 80-node cluster but thought I'd get your perspective on this for now. FWIW, I'm intentionally giving very little work to this cluster because we plan to introduce a |
eedb59b
to
10e56ec
Compare
FWIW, yes, the distinction between 23.1.14 and older versions is much easier to see with 80 nodes. |
10e56ec
to
9ad908f
Compare
@cockroachdb/test-eng ready for review! |
9ad908f
to
9eb0fe3
Compare
// test already failed, or that the test finished (and therefore | ||
// this function should not return an error). | ||
if ctx.Err() != nil && opts.Duration == 0 { | ||
return nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't immediately find examples of tests running the workload indefinitely (i.e., Duration == 0
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, the background tpcc workload introduced in this PR should be the only one (hence this logic needed to be added).
const ( | ||
nodesPerRegion = 20 | ||
backgroundWarehousesPerRegion = 30 | ||
mixedVersionWarehousesPerRegion = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem fairly low for a region that has 20 nodes. I'm curious how these were chosen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were chosen "arbitrarily" 😄. I didn't want the cluster to be overloaded as we plan to add sqlancer (which will take most of the cluster's capacity, I'm assuming). That said, we can increased these numbers for now and lower them in the future as well. I'll play around with them until I reach a workload that will reach ~50% CPU utilization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks. During the cloud report, we seemed to have found a way to linearly scale cpu utilization via tpcc-nowait
by increasing/decreasing connections and warehouses per vCPUs. Unfortunately, I wasn't able to reproduce it with 24.1.0. In any case, there are other ways to scale cpu utilization, albeit artificial, e.g., stress-ng
.
mixedversion.AlwaysUseLatestPredecessors, | ||
) | ||
|
||
backgroundTPCCOpts := tpccOptions{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth mentioning explicitly that the background workload has no expiration (i.e., Duration
is elided). It may not be immediately obvious given all the other config. options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, added a comment about this.
1fa1df0
to
11dba39
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work!
TFTR! bors r=srosenberg |
FWIW, I left this test running over night and it failed with an ambiguous error while in mixed version 23.2 and master. While workload should arguably handle that, the cluster had plenty of unused capacity at that point and I could see many suspicious error messages in the logs that would be worth looking into anyway. Merging the test as-is so that any failures that do happen in the future get some eyes. |
Build failed: |
bors retry |
Build failed: |
11dba39
to
ebc95d9
Compare
This commit adds a `multi-region/mixed-version` test that creates a larger-sized cluster (4 regions, 10 nodes per region) and runs a constant background TPCC worload (with `tolerate-errors`), along with a short TPCC workload that runs in mixed-version state (without `tolerate-errors`). With this test, we exercise our ability to perform several cluster upgrades in MR clusters. In the future, we plan to extend this test to include other kinds of randomized testing as well. Fixes: cockroachdb#114803 Release note: None Signed-off-by: Renato Costa <renato@cockroachlabs.com>
This makes it easier to organize logs for a complex test by allowing the caller to inject the logger instance that should be used in a particular call to `runTPCC`. More immediately, we make use of this change in the recently introduced `multi-region/mixed-version` test, and pass the step's logger to those functions. Epic: none Release note: None
ebc95d9
to
480791e
Compare
bors retry |
bors r-
Comment was hasty. The most recent push was fine in fact, I just didn't notice that. |
Canceled. |
I thought I had fixed it in my latest force-push. Am I missing something? |
bors retry |
That didn't seem to do it, bors didn't move to Pending. bors r=srosenberg,herkolategan |
Build succeeded: |
This commit adds a
multi-region/mixed-version
test that creates a larger-sized cluster (4 regions, 10 nodes per region) and runs a constant background TPCC worload (withtolerate-errors
), along with a short TPCC workload that runs in mixed-version state (withouttolerate-errors
).With this test, we exercise our ability to perform several cluster upgrades in MR clusters. In the future, we plan to extend this test to include other kinds of randomized testing as well.
Fixes: #114803
Release note: None
roachtest: change TPCC functions to take a logger instance
This makes it easier to organize logs for a complex test by allowing
the caller to inject the logger instance that should be used in a
particular call to
runTPCC
. More immediately, we make use of thischange in the recently introduced
multi-region/mixed-version
test,and pass the step's logger to those functions.