Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: add roachtest.Operation and run-operation command #118364

Closed
wants to merge 1 commit into from

Conversation

itsbilal
Copy link
Member

@itsbilal itsbilal commented Jan 26, 2024

This change adds the ability to write roachtest Operations that, unlike a full-on roachtest.Test, do not require an ephemeral cluster to be spun up or down. These operations are expected to have as few logical side effects as possible and can be run on a cluster with running workloads. Running an Operation using roachtest run-operation also guarantees that the Cockroach/Workload binaries on that node will not be swapped with local ones, and that the cluster won't be wiped unintentionally at the end or in case of error.

This change also adds add-index and add-column as two example operations that operate in SQL land and demonstrate the purpose of an Operation.

Epic: none.

Release note: None.

@itsbilal itsbilal requested a review from a team as a code owner January 26, 2024 20:18
@itsbilal itsbilal requested review from DarrylWong and removed request for a team January 26, 2024 20:18
Copy link

blathers-crl bot commented Jan 26, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@itsbilal
Copy link
Member Author

@herkolategan / whoever from test-eng reviews this, happy to first take pointers on whether the general approach here is sound. Over the next bit I'll be cleaning up some of this code so there might be some rough edges if you do a full review right now, but I definitely want to first know if y'all are on board with this approach. Also, it works!

@renatolabs
Copy link
Contributor

I'm missing a bit of context, I believe, being outside DRT: what is the envisioned way these operations would be called? Directly by name? A random sample? Right now they run concurrently -- what if they are incompatible?

If we are just not handling these situations for the time being, that's fine too, but it's not clear what's in and out of scope.

I haven't looked at the code in detail, but enough to get the general picture. A couple of initial/high level comments:

  • I like the notion of "operation", and it's something we have discussed in various forms and in different times/contexts too. I also like that this has a "cleanup" counterpart.

  • It would be a very nice feature if operations could be trivially called from within a roachtest. Added bonus if there's an easy way to pick a random operation as well.

  • Do we need operation to be a completely different concept in roachtest? From my understanding so far, an operation is a "test" that runs on a cluster that has been previously provisioned and where cockroach is running. Everything else seems mostly the same. Given the very large amount of code being duplicated in the current implementation, I'm very worried about maintenance cost.

Have you thought about or explored an implementation where an operation is a regular "test", appropriately marked maybe with something like TestSpec.Operation, and adapting the existing test runner to skip the cluster provisioning part when running operations?

@itsbilal
Copy link
Member Author

Thanks for the high-level review Renato, that's exactly what I needed!

It would be a very nice feature if operations could be trivially called from within a roachtest. Added bonus if there's an easy way to pick a random operation as well.

Yes, those are all great points - I do want Operation to be abstract enough to be easily callable from a Roachtest to reduce code duplication down the line. I do foresee that as one of the three ways to call an Operation, the other two being roachtest run-operation (implemented here), and the third being called through the yet-to-be-built workload-manager or drt-run that will randomly run multiple operations on an existing cluster that has some long running workloads running on it. I expect run-operation to mostly be used just to test out an operation. The workload-manager pathway is the one that DRT will use the most extensively.

Do we need operation to be a completely different concept in roachtest? From my understanding so far, an operation is a "test" that runs on a cluster that has been previously provisioned and where cockroach is running. Everything else seems mostly the same. Given the very large amount of code being duplicated in the current implementation, I'm very worried about maintenance cost.

This is a good point, and my thought process on it was that Operations will (over time) diverge greatly from roachtests, in their spec and in the ways they're called. The test-runner stuff is the bulk of duplicated code at the moment and I'm not even sure if it'll get much use outside of the aforementioned run-operation command. Some examples of ways in which Operation specs will diverge from Test specs:

  1. Operations could be passed things like the cockroach start commands to use, DBs already initialized (eg. if they do something specific to a db eg. drop and reimport TPCC), etc.
  2. Operations could specify dependencies - they need a node to have a live cockroach process running on it, all the way to having a db populated with data.
  3. Operations can be interleaved or run concurrently on the same cluster - and by interleaved I mean that we can add a column, add an index, stall a node, and then run the cleanups of them in some order, resulting in a greater explosion in combinatorial state.

Does this make sense? Let me know if you want me to simplify the code and reduce duplication; y'all are code owners of this area so I'm happy to follow your lead on this. I can explore having Operation specs that remain separate but reusing more of the test runner code than we currently do, to try to reduce code duplication there.

@renatolabs
Copy link
Contributor

the third being called through the yet-to-be-built workload-manager or drt-run

I expect run-operation to mostly be used just to test out an operation

This brings back the first question I asked (what is the envisioned way these operations would be called?); it's still not clear to me, and the sentences above made me more confused. I initially thought runOperations would have that logic, as it runs a set of operations currently. But it seems there would be another component also scheduling ops.

If there's notes about the intended design or vision for how every piece would work, I think that would really help understanding the bigger picture.

Some examples of ways in which Operation specs will diverge from Test specs:

It's fine for operations to be different from regular tests in those ways, but IMO that does not justify an implementation where we introduce so much duplication of non-trivial code (~1,000 lines of code that is mostly the same). As it is, it's hard to tell (unless you're very familiar with roachtest) what's needed for an operation to run, vs. what was just copied from the regular test runner. By skimming the code, I can see lots of logic that is unnecessary in the context of operations. I don't see a sustainable way to keep these implementations "synchronized".

If this implementation was merged mostly as-is, what are we supposed to do as we make improvements to the test runner or test interfaces? A: try to make the equivalent change in the operations counterpart? or B: leave the operations counterpart alone as "it's mostly used to test an operation"? I don't think either of these options would be acceptable: A involves a lot of manual work from the owners (I assume Test Eng would eventually inherit ownership), whereas B would lead to frustration as teams would notice that tests and operations look the same, but don't behave the same.

My point boils down to a wish that the "lines added" part of this diff were all dealing with what makes an operation different from a test (how it can take parameters, how we need to evaluate dependencies, etc), and not incidental complexity of duplicating large interfaces.

@itsbilal
Copy link
Member Author

itsbilal commented Feb 7, 2024

This brings back the first question I asked (what is the envisioned way these operations would be called?); it's still not clear to me, and the sentences above made me more confused. I initially thought runOperations would have that logic, as it runs a set of operations currently. But it seems there would be another component also scheduling ops.

That's a good point and something that is not obvious given it was really only discussed within the DRT team; that's an oversight on my part and I should have clarified it here. The idea is that run-operation is just to test out an operation like I said, and most scheduling/running of operations will happen in a different binary (drt-run likely) that will also run long-running workloads in the same place. OperationSpec should already be designed with that latter use-case in mind, and you make a good point about the bulk of testRunner changes being orthogonal to it and are really only relevant for run-operation.

I've significantly slimmed this PR and we now convert (in a best-effort way) OperationSpec to a TestSpec for use with the test runner. I expect the drt-run command to not rely on this conversion and to directly schedule Operations, and any Operation-specific complexity that isn't related to Tests will end up staying in OperationSpec and will only be used by drt-run. My main worry continues to be that at some point this conversion will become too cumbersome and/or result in very different behaviour when an operation is run through the test runner vs. through drt-run, but that's a worry I can defer for now to get this going and to get more operations on the way. One good thing about this new approach is that running operations as sub-tests within a roachtest is a little easier.

This should be in line with your expectations and should be much easier for TestEng to own. My understanding is that, as far as ownership goes, the long-term DRT team will own drt-run, individual operations will be owned by the relevant team writing these operations, and Test Eng will likely own just the run-operation command.

@itsbilal
Copy link
Member Author

@renatolabs / test-eng: friendly ping for a review, thanks! no rush but would be good to get other teams to start adding operations soon so the OperationSpec is probably the most important thing to look at.

Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for iterating! Most comments are about duplicating logic and code that doesn't make sense in the context of operations.

I still have some general, high level questions (which I list below), but they are optional / for consideration. I don't have all the context.

  • We're not addressing, in this PR, the point I had mentioned about calling operations from roachtests. I don't understand the purpose of the Operation type. All it does at the moment is provide a map[string]string that is arguably not necessary, and needlessly gets in the way of reuse.

  • The semantics around concurrent operation execution are not clear to me. If an operation calls randomTable, is it guaranteed that I won't run into TOCTOU issues (e.g., what if I schedule a drop-table operation concurrently)? Or is that not something we will do?

  • I don't understand why the operation itself needs to indicate its cleanup wait time. I think this would be more fruitfully handled by a "scheduler" sort of component.

  • The semantics of "dependencies" are also not clear to me. There are hints of its purpose (RequiresDatabaseSchema), but that's very high level and both the meaning and implementation of such a dependency are not obvious to me.

skipWipe bool
skipStop bool
skipWipe bool
tolerateRegistrationErrors bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be skipRegistration? We would never want to register a cluster when running an operation, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

register adds it to the cluster registry which we'd still want so our test runner code can keep going along no problem. We just don't want to stop/wipe the cluster in case of errors there.

pkg/cmd/roachtest/cluster.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/main.go Outdated Show resolved Hide resolved
runSkipped bool,
selectProbability float64,
print bool,
) ([]registry.OperationSpec, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reading about how these operations are going to be called (from the DRT workload tool), and I see that the YAML file references them by name. In that case, do we need all of these features (regex matching, skipping logic, selection probability, etc). If anything, I'd imagine it's more confusing for the YAML file to say - name: network/iptables, and then suddenly that same file would mean multiple operations because someone added a network/iptables_fast operation.

Can we simplify all of this by requiring exact matching on operation name? It would also get rid of a bunch of duplicated code.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it'd be useful to be able to run all operations with a specific prefix - eg. the different kinds of disk-stalleds. Maintaining the regex ability is useful. Plus the script on the drt cluster is already using a wildcard. And drt-run will also likely involve the use of regex, seeing as we probably don't want to update the yaml file every time we add a new operation (but we still want the ability to exclude certain operations, which a regex allows us to do).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that drt-run would be the main (ideally only?) component to schedule operations. Will it shell out to roachtest run-operations or do you plan to link it in some other way?

In any case, if you pass a wildcard, or regex, or a sample, then drt-run itself has very little control on the ordering / scheduling of the operations.

seeing as we probably don't want to update the yaml file every time we add a new operation

This is solvable by having drt-run query the list of operations dynamically.

I won't die on this hill, but I'm struggling to find a situation where we'll want to run operations in this way, especially considering that we are planning to develop a tool to be responsible for scheduling operations. This is especially true for things like the sample feature (i.e., it's fine for this feature to exist, but in my view it's a responsibility of drt-run). It's a lot easier to have something reproducible if we have just one scheduler component to reason about.

pkg/cmd/roachtest/registry/operation_spec.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/test_registry.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/roachtestflags/flags.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/run.go Outdated Show resolved Hide resolved
pkg/cmd/roachtest/run.go Outdated Show resolved Hide resolved
github := newGithubIssues(r.config.disableIssue, c, vmCreateOpts)

// handleClusterCreationFailure can be called when the `err` given
// occurred for reasons related to creating or setting up a
// cluster for a test.
handleClusterCreationFailure := func(err error) {
t.Error(errClusterProvisioningFailed(err))
// Don't post issues for operations.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this? Don't we skip cluster creation calls for operations?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could still get an error from attachClusters that'll be set to clusterCreateErr.

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIL the TOCTOU acronym.

Thanks for the continued effort to make this a clean addition to roachtest! I left a few comments in addition to Renato's review. Still a few areas of duplication that may benefit from some refactoring, such as runOperations vs. runTests. It has logic which if ever updated in one would need to be transferred to the other.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @itsbilal, @renatolabs, and @srosenberg)


pkg/cmd/roachtest/cluster.go line 982 at r5 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Should this be skipRegistration? We would never want to register a cluster when running an operation, right?

Also curious about this, since destroy clusters (which uses the cluster registry) is removed from runOperations. But there is also clusterRegistry.markClusterAsSaved` - which prevents the cluster from being destroyed. Seems like the registry is only meant for keeping track of what to delete.


pkg/cmd/roachtest/cluster.go line 1019 at r6 (raw file):

	}

	if err := r.registerCluster(c); err != nil && !opt.tolerateRegistrationErrors {

curious what the thought process behind tolerateRegistrationErrors is? Seems like registerCluster will throw an error on programmer logic.


pkg/cmd/roachtest/test_impl.go line 162 at r6 (raw file):

	}
	if t.spec.Operation {
		t.l.Printf("Benchmark test, running with passed-in Cockroach")

Is this print correct, seems like a copy of the benchmark output?


pkg/cmd/roachtest/test_registry.go line 70 at r6 (raw file):

}

// Add adds a test to the registry.

Nit: s/Add/AddOperation


pkg/cmd/roachtest/test_registry.go line 91 at r6 (raw file):

const testNameRE = "^[a-zA-Z0-9-_=/,]+$"

// prepareSpec validates a spec and does minor massaging of its fields.

Nit: s/prepareSpec/prepareOpSpec


pkg/cmd/roachtest/test_registry.go line 189 at r6 (raw file):

// AllOperations returns all the operation specs, sorted by name.
func (r testRegistryImpl) AllOperations() []registry.OperationSpec {

Nit: Getting a warning around mixed value / pointer receivers. Not introduced by this PR though. I know it's mostly for consistency, clarity, easier use and has no functional impact.


pkg/cmd/roachtest/registry/operation_spec.go line 27 at r6 (raw file):

const (
	OperationRequiresNodes OperationDependency = iota

Noted it's not used currently, but won't we possibly require multiple dependencies; would 1 << iota make more sense?

Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is still quite a lot of duplication in an attempt to make an operation behave like a test. But, is it really a test? Does an operation require anything more than Run? (Things like owner and registry may be common to both; however, that's a very small intersection.) I wonder if we could try a different approach. Instead of making an operation behave like a test, just provide an interface to run arbitrary code in an existing cluster. Then, create a (dummy) roachtest that's going to choose "operations" and run them sequentially (or concurrently).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @itsbilal, and @renatolabs)

This change adds the ability to write roachtest Operations that,
unlike a full-on roachtest.Test, do not require an ephemeral cluster
to be spun up or down. These operations are expected to have as few
logical side effects as possible and can be run on a cluster with
running workloads. Running an Operation using `roachtest run-operation`
also guarantees that the Cockroach/Workload binaries on that node
will not be swapped with local ones, and that the cluster won't be
wiped unintentionally at the end or in case of error.

This change also adds add-index and add-column as two example operations
that operate in SQL land and demonstrate the purpose of an Operation.

Release note: None.

Epic: none
Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the very detailed reviews! The key goal of this PR is to get a close-to-finished OperationSpec out, even if other pieces are iterated on over the next few weeks.

We're not addressing, in this PR, the point I had mentioned about calling operations from roachtests. I don't understand the purpose of the Operation type. All it does at the moment is provide a map[string]string that is arguably not necessary, and needlessly gets in the way of reuse.

The Operation interface can absolutely be simpler- and my WIP drt-run PR makes that change as the implementation of that interface there is way, way lighter. I've added a TODO here to shrink that interface instead of just encapsulating test.Test. The latter works for now because we're converting operations into tests for this PR anyway (but drt-run isn't doing that).

The semantics around concurrent operation execution are not clear to me. If an operation calls randomTable, is it guaranteed that I won't run into TOCTOU issues (e.g., what if I schedule a drop-table operation concurrently)? Or is that not something we will do?

I've improved the comment at CanRunConcurrently. It's easier to solve later in the context of drt-run which will be doing the scheduling, while this command just runs a linear sequence of operations. But the idea is that a drop-table would lock out all other operations for its duration/cleanup as it isn't safe to run concurrently. We can do something smarter by silo-ing operations in dbs/tables, but that adds complexity to the interface for likely not much additional wins and we can defer that improvement. for later.

I don't understand why the operation itself needs to indicate its cleanup wait time. I think this would be more fruitfully handled by a "scheduler" sort of component.

It's a bit of a minimum, and the scheduler can still exceed it. I can remove it if it's excessive in this context.

The semantics of "dependencies" are also not clear to me. There are hints of its purpose (RequiresDatabaseSchema), but that's very high level and both the meaning and implementation of such a dependency are not obvious to me.

The idea is to just be able to specify when an operation is eligible to be run, and hook it up into filter.Filter. Replicating something like the schemachange workload in operations will only require a schema, and not a populated db. Replicating disk-stall only needs raw nodes in theory; not even a running crdb process. I can remove it as it's unused but that goes against the goal of giving operation authors the ability to semantically specify what the operation needs, and I think we'll regret not having it later.

There is still quite a lot of duplication in an attempt to make an operation behave like a test. But, is it really a test? Does an operation require anything more than Run?

From a spec-perspective, operations will, if anything, need more parameters than tests as they're not self-contained and have dependencies and possibly side-effects. The duplication looks bad at first but outside of runOperations which is pretty minimal, everything else is standalone and is only going to deviate more beyond this point.

Then, create a (dummy) roachtest that's going to choose "operations" and run them sequentially (or concurrently).

I think that's challenging as it opens us up to issues of unintentionally triggering roachprod cluster creation/destruction/github issue-posting/etc, as we learned the hard way today 😂

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, and @renatolabs)


pkg/cmd/roachtest/cluster.go line 1019 at r6 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

curious what the thought process behind tolerateRegistrationErrors is? Seems like registerCluster will throw an error on programmer logic.

registerCluster complains if we register the same cluster twice (eg. if we're running more than one operation), so we wanna be able to specify that registration errors are fine as long as attachment succeeded and our cluster isn't nil.


pkg/cmd/roachtest/main.go line 188 at r5 (raw file):

Previously, renatolabs (Renato Costa) wrote…

"ephemeral" here is confusing, as it could imply "created just for the operation" (that's the interpretation for roachtest). IMO, it would be clearer to just say "on existing clusters".

Done.


pkg/cmd/roachtest/test_impl.go line 162 at r6 (raw file):

Previously, herkolategan (Herko Lategan) wrote…

Is this print correct, seems like a copy of the benchmark output?

Fixed.

@herkolategan herkolategan self-requested a review February 27, 2024 16:58
runSkipped bool,
selectProbability float64,
print bool,
) ([]registry.OperationSpec, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that drt-run would be the main (ideally only?) component to schedule operations. Will it shell out to roachtest run-operations or do you plan to link it in some other way?

In any case, if you pass a wildcard, or regex, or a sample, then drt-run itself has very little control on the ordering / scheduling of the operations.

seeing as we probably don't want to update the yaml file every time we add a new operation

This is solvable by having drt-run query the list of operations dynamically.

I won't die on this hill, but I'm struggling to find a situation where we'll want to run operations in this way, especially considering that we are planning to develop a tool to be responsible for scheduling operations. This is especially true for things like the sample feature (i.e., it's fine for this feature to exist, but in my view it's a responsibility of drt-run). It's a lot easier to have something reproducible if we have just one scheduler component to reason about.

if s.Skip == "" || runSkipped {
notSkipped = append(notSkipped, s)
} else {
if print && roachtestflags.TeamCity {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The --- SKIP and ##teamcity logic here and a few times below doesn't apply to operations, right?

pkg/cmd/roachtest/registry/operation_spec.go Show resolved Hide resolved
// eligible clusters to run.
//
// TODO(bilal): Unused.
Dependency OperationDependency
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But is the current API consumable by teams at the moment? When would I choose OperationRequiresSqlConnection over OperationRequiresDatabaseSchema? What if I need both? Does one imply the other? What does it mean to "require a database schema"?

My point is that the dependencies are not defined precisely enough so I wonder what the value is of including them as-is right now. Even if teams do set it, we'd need to go back and update the specs anyway when we actually implement any kind of dependency enforcement.

If we do want to have an API, then I'd suggest having at least well defined semantics for what the dependencies are, so that we can come back and implement them later.

pkg/cmd/roachtest/registry/operation_spec.go Show resolved Hide resolved
pkg/cmd/roachtest/registry/operation_spec.go Show resolved Hide resolved
pkg/cmd/roachtest/registry/operation_spec.go Show resolved Hide resolved
@@ -157,6 +158,13 @@ func (t *testImpl) Cockroach() string {
t.l.Printf("Benchmark test, running with standard cockroach")
return t.StandardCockroach()
}
if t.spec.Operation {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, there is a bit of a mismatch here. For context, t.Cockroach() in roachtest runs translates to "path to cockroach binary in the test runner" (i.e., not in the cluster). This used to be needed because tests used to have to upload cockroach before starting the cluster, a repetitive step that has since been removed. We should have removed the Cockroach() function from the test interface.

From what I understand, your purpose is for operations to do something like c.Run("%s workload run", t.Cockroach()). If we were to do that, then it would 1) be confusing if you knew the previous meaning; and 2) also make it harder to call an operation from roachtest.

Any chance operations can assume a ./cockroach path for now? My understanding is that we will want to be able to pass parameters to operations at some point, and the path to the binary on the cluster could be a parameter that we pass.

Copy link
Member Author

@itsbilal itsbilal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I've split the non-running parts of this into #119796 to ease review, and also addressed some of the comments below (the ones that apply to those parts) there. I'll rebase this on that change soon.

@srosenberg
Copy link
Member

The sprawling changes to test_runner (and other roachtest infrastructure) introduced by this PR are a bit too sprawling for our comfort :) We'd prefer a cleaner approach, one that ideally doesn't require (non-trivial) changes to the test_runner. I suggest we backtrack a little and try to nail down the simplest API for implementing and executing operations.

itsbilal added a commit to itsbilal/cockroach that referenced this pull request Mar 6, 2024
This change introduces a new type in the roachtest registry for
Operations; smaller subunits of tests that work strictly on existing
clusters and are meant to have zero logical side-effects to a running
cluster (eg. should not delete data in an irrecoverable way), but can
induce chaos events like crashing nodes. This new spec type,
OperationSpec, specifies one operation and is registered with
a registry.Registry. It can be filtered with a registry.TestFilter
similar to TestSpecs.

Split from cockroachdb#118364.

Epic: none

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this pull request Mar 6, 2024
This change introduces a new type in the roachtest registry for
Operations; smaller subunits of tests that work strictly on existing
clusters and are meant to have zero logical side-effects to a running
cluster (eg. should not delete data in an irrecoverable way), but can
induce chaos events like crashing nodes. This new spec type,
OperationSpec, specifies one operation and is registered with
a registry.Registry.

Split from cockroachdb#118364.

Epic: none

Release note: None
@itsbilal
Copy link
Member Author

itsbilal commented Mar 6, 2024

Thanks for the reviews everyone! I had a meeting with Renato today and based on our conversation + comments in this PR, I've greatly simplified this and split it into 3 PRs, #119796, #120023 and #120024, each of which builds on top of the one before. I'll close this PR for now as it's superseded by those in combination. Thanks a ton again; I know this has been a lot of effort for everyone involved and I really appreciate it all.

@itsbilal itsbilal closed this Mar 6, 2024
itsbilal added a commit to itsbilal/cockroach that referenced this pull request Mar 11, 2024
This change introduces a new type in the roachtest registry for
Operations; smaller subunits of tests that work strictly on existing
clusters and are meant to have zero logical side-effects to a running
cluster (eg. should not delete data in an irrecoverable way), but can
induce chaos events like crashing nodes. This new spec type,
OperationSpec, specifies one operation and is registered with
a registry.Registry.

Split from cockroachdb#118364.

Epic: none

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this pull request Mar 11, 2024
This change introduces a new type in the roachtest registry for
Operations; smaller subunits of tests that work strictly on existing
clusters and are meant to have zero logical side-effects to a running
cluster (eg. should not delete data in an irrecoverable way), but can
induce chaos events like crashing nodes. This new spec type,
OperationSpec, specifies one operation and is registered with
a registry.Registry.

Split from cockroachdb#118364.

Epic: none

Release note: None
itsbilal added a commit to itsbilal/cockroach that referenced this pull request Mar 11, 2024
This change introduces a new type in the roachtest registry for
Operations; smaller subunits of tests that work strictly on existing
clusters and are meant to have zero logical side-effects to a running
cluster (eg. should not delete data in an irrecoverable way), but can
induce chaos events like crashing nodes. This new spec type,
OperationSpec, specifies one operation and is registered with
a registry.Registry.

Split from cockroachdb#118364.

Epic: none

Release note: None
craig bot pushed a commit that referenced this pull request Mar 11, 2024
119796: cmd/roachtest: add registry.OperationSpec r=herkolategan a=itsbilal

This change introduces a new type in the roachtest registry for Operations; smaller subunits of tests that work strictly on existing clusters and are meant to have zero logical side-effects to a running cluster (eg. should not delete data in an irrecoverable way), but can induce chaos events like crashing nodes. This new spec type, OperationSpec, specifies one operation and is registered with a registry.Registry. It can be filtered with a registry.TestFilter similar to TestSpecs.

Split from #118364.

Epic: none

Release note: None

Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
kvoli pushed a commit to kvoli/cockroach that referenced this pull request Mar 11, 2024
This change introduces a new type in the roachtest registry for
Operations; smaller subunits of tests that work strictly on existing
clusters and are meant to have zero logical side-effects to a running
cluster (eg. should not delete data in an irrecoverable way), but can
induce chaos events like crashing nodes. This new spec type,
OperationSpec, specifies one operation and is registered with
a registry.Registry.

Split from cockroachdb#118364.

Epic: none

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants