-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: add roachtest.Operation and run-operation command #118364
Conversation
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
@herkolategan / whoever from test-eng reviews this, happy to first take pointers on whether the general approach here is sound. Over the next bit I'll be cleaning up some of this code so there might be some rough edges if you do a full review right now, but I definitely want to first know if y'all are on board with this approach. Also, it works! |
I'm missing a bit of context, I believe, being outside DRT: what is the envisioned way these operations would be called? Directly by name? A random sample? Right now they run concurrently -- what if they are incompatible? If we are just not handling these situations for the time being, that's fine too, but it's not clear what's in and out of scope. I haven't looked at the code in detail, but enough to get the general picture. A couple of initial/high level comments:
Have you thought about or explored an implementation where an operation is a regular "test", appropriately marked maybe with something like |
Thanks for the high-level review Renato, that's exactly what I needed!
Yes, those are all great points - I do want
This is a good point, and my thought process on it was that Operations will (over time) diverge greatly from roachtests, in their spec and in the ways they're called. The test-runner stuff is the bulk of duplicated code at the moment and I'm not even sure if it'll get much use outside of the aforementioned
Does this make sense? Let me know if you want me to simplify the code and reduce duplication; y'all are code owners of this area so I'm happy to follow your lead on this. I can explore having Operation specs that remain separate but reusing more of the test runner code than we currently do, to try to reduce code duplication there. |
This brings back the first question I asked (what is the envisioned way these operations would be called?); it's still not clear to me, and the sentences above made me more confused. I initially thought If there's notes about the intended design or vision for how every piece would work, I think that would really help understanding the bigger picture.
It's fine for operations to be different from regular tests in those ways, but IMO that does not justify an implementation where we introduce so much duplication of non-trivial code (~1,000 lines of code that is mostly the same). As it is, it's hard to tell (unless you're very familiar with roachtest) what's needed for an operation to run, vs. what was just copied from the regular test runner. By skimming the code, I can see lots of logic that is unnecessary in the context of operations. I don't see a sustainable way to keep these implementations "synchronized". If this implementation was merged mostly as-is, what are we supposed to do as we make improvements to the test runner or test interfaces? A: try to make the equivalent change in the operations counterpart? or B: leave the operations counterpart alone as "it's mostly used to test an operation"? I don't think either of these options would be acceptable: My point boils down to a wish that the "lines added" part of this diff were all dealing with what makes an operation different from a test (how it can take parameters, how we need to evaluate dependencies, etc), and not incidental complexity of duplicating large interfaces. |
That's a good point and something that is not obvious given it was really only discussed within the DRT team; that's an oversight on my part and I should have clarified it here. The idea is that I've significantly slimmed this PR and we now convert (in a best-effort way) This should be in line with your expectations and should be much easier for TestEng to own. My understanding is that, as far as ownership goes, the long-term DRT team will own |
c643909
to
3d24e84
Compare
@renatolabs / test-eng: friendly ping for a review, thanks! no rush but would be good to get other teams to start adding operations soon so the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for iterating! Most comments are about duplicating logic and code that doesn't make sense in the context of operations.
I still have some general, high level questions (which I list below), but they are optional / for consideration. I don't have all the context.
-
We're not addressing, in this PR, the point I had mentioned about calling operations from roachtests. I don't understand the purpose of the
Operation
type. All it does at the moment is provide a map[string]string that is arguably not necessary, and needlessly gets in the way of reuse. -
The semantics around concurrent operation execution are not clear to me. If an operation calls
randomTable
, is it guaranteed that I won't run into TOCTOU issues (e.g., what if I schedule adrop-table
operation concurrently)? Or is that not something we will do? -
I don't understand why the operation itself needs to indicate its cleanup wait time. I think this would be more fruitfully handled by a "scheduler" sort of component.
-
The semantics of "dependencies" are also not clear to me. There are hints of its purpose (
RequiresDatabaseSchema
), but that's very high level and both the meaning and implementation of such a dependency are not obvious to me.
skipWipe bool | ||
skipStop bool | ||
skipWipe bool | ||
tolerateRegistrationErrors bool |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be skipRegistration
? We would never want to register a cluster when running an operation, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
register
adds it to the cluster registry which we'd still want so our test runner code can keep going along no problem. We just don't want to stop/wipe the cluster in case of errors there.
runSkipped bool, | ||
selectProbability float64, | ||
print bool, | ||
) ([]registry.OperationSpec, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was reading about how these operations are going to be called (from the DRT workload tool), and I see that the YAML file references them by name. In that case, do we need all of these features (regex matching, skipping logic, selection probability, etc). If anything, I'd imagine it's more confusing for the YAML file to say - name: network/iptables
, and then suddenly that same file would mean multiple operations because someone added a network/iptables_fast
operation.
Can we simplify all of this by requiring exact matching on operation name? It would also get rid of a bunch of duplicated code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it'd be useful to be able to run all operations with a specific prefix - eg. the different kinds of disk-stalled
s. Maintaining the regex ability is useful. Plus the script on the drt cluster is already using a wildcard. And drt-run will also likely involve the use of regex, seeing as we probably don't want to update the yaml file every time we add a new operation (but we still want the ability to exclude certain operations, which a regex allows us to do).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that drt-run
would be the main (ideally only?) component to schedule operations. Will it shell out to roachtest run-operations
or do you plan to link it in some other way?
In any case, if you pass a wildcard, or regex, or a sample, then drt-run
itself has very little control on the ordering / scheduling of the operations.
seeing as we probably don't want to update the yaml file every time we add a new operation
This is solvable by having drt-run
query the list of operations dynamically.
I won't die on this hill, but I'm struggling to find a situation where we'll want to run operations in this way, especially considering that we are planning to develop a tool to be responsible for scheduling operations. This is especially true for things like the sample
feature (i.e., it's fine for this feature to exist, but in my view it's a responsibility of drt-run). It's a lot easier to have something reproducible if we have just one scheduler component to reason about.
github := newGithubIssues(r.config.disableIssue, c, vmCreateOpts) | ||
|
||
// handleClusterCreationFailure can be called when the `err` given | ||
// occurred for reasons related to creating or setting up a | ||
// cluster for a test. | ||
handleClusterCreationFailure := func(err error) { | ||
t.Error(errClusterProvisioningFailed(err)) | ||
// Don't post issues for operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this? Don't we skip cluster creation calls for operations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could still get an error from attachClusters
that'll be set to clusterCreateErr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL the TOCTOU acronym.
Thanks for the continued effort to make this a clean addition to roachtest
! I left a few comments in addition to Renato's review. Still a few areas of duplication that may benefit from some refactoring, such as runOperations
vs. runTests
. It has logic which if ever updated in one would need to be transferred to the other.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @itsbilal, @renatolabs, and @srosenberg)
pkg/cmd/roachtest/cluster.go
line 982 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Should this be
skipRegistration
? We would never want to register a cluster when running an operation, right?
Also curious about this, since destroy clusters (which uses the cluster registry) is removed from runOperations. But there is also
clusterRegistry.markClusterAsSaved` - which prevents the cluster from being destroyed. Seems like the registry is only meant for keeping track of what to delete.
pkg/cmd/roachtest/cluster.go
line 1019 at r6 (raw file):
} if err := r.registerCluster(c); err != nil && !opt.tolerateRegistrationErrors {
curious what the thought process behind tolerateRegistrationErrors
is? Seems like registerCluster
will throw an error on programmer logic.
pkg/cmd/roachtest/test_impl.go
line 162 at r6 (raw file):
} if t.spec.Operation { t.l.Printf("Benchmark test, running with passed-in Cockroach")
Is this print correct, seems like a copy of the benchmark output?
pkg/cmd/roachtest/test_registry.go
line 70 at r6 (raw file):
} // Add adds a test to the registry.
Nit: s/Add/AddOperation
pkg/cmd/roachtest/test_registry.go
line 91 at r6 (raw file):
const testNameRE = "^[a-zA-Z0-9-_=/,]+$" // prepareSpec validates a spec and does minor massaging of its fields.
Nit: s/prepareSpec/prepareOpSpec
pkg/cmd/roachtest/test_registry.go
line 189 at r6 (raw file):
// AllOperations returns all the operation specs, sorted by name. func (r testRegistryImpl) AllOperations() []registry.OperationSpec {
Nit: Getting a warning around mixed value / pointer receivers. Not introduced by this PR though. I know it's mostly for consistency, clarity, easier use and has no functional impact.
pkg/cmd/roachtest/registry/operation_spec.go
line 27 at r6 (raw file):
const ( OperationRequiresNodes OperationDependency = iota
Noted it's not used currently, but won't we possibly require multiple dependencies; would 1 << iota
make more sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is still quite a lot of duplication in an attempt to make an operation behave like a test. But, is it really a test? Does an operation require anything more than Run
? (Things like owner and registry may be common to both; however, that's a very small intersection.) I wonder if we could try a different approach. Instead of making an operation behave like a test, just provide an interface to run arbitrary code in an existing cluster. Then, create a (dummy) roachtest that's going to choose "operations" and run them sequentially (or concurrently).
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @itsbilal, and @renatolabs)
This change adds the ability to write roachtest Operations that, unlike a full-on roachtest.Test, do not require an ephemeral cluster to be spun up or down. These operations are expected to have as few logical side effects as possible and can be run on a cluster with running workloads. Running an Operation using `roachtest run-operation` also guarantees that the Cockroach/Workload binaries on that node will not be swapped with local ones, and that the cluster won't be wiped unintentionally at the end or in case of error. This change also adds add-index and add-column as two example operations that operate in SQL land and demonstrate the purpose of an Operation. Release note: None. Epic: none
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the very detailed reviews! The key goal of this PR is to get a close-to-finished OperationSpec
out, even if other pieces are iterated on over the next few weeks.
We're not addressing, in this PR, the point I had mentioned about calling operations from roachtests. I don't understand the purpose of the Operation type. All it does at the moment is provide a map[string]string that is arguably not necessary, and needlessly gets in the way of reuse.
The Operation
interface can absolutely be simpler- and my WIP drt-run PR makes that change as the implementation of that interface there is way, way lighter. I've added a TODO here to shrink that interface instead of just encapsulating test.Test. The latter works for now because we're converting operations into tests for this PR anyway (but drt-run
isn't doing that).
The semantics around concurrent operation execution are not clear to me. If an operation calls randomTable, is it guaranteed that I won't run into TOCTOU issues (e.g., what if I schedule a drop-table operation concurrently)? Or is that not something we will do?
I've improved the comment at CanRunConcurrently. It's easier to solve later in the context of drt-run which will be doing the scheduling, while this command just runs a linear sequence of operations. But the idea is that a drop-table would lock out all other operations for its duration/cleanup as it isn't safe to run concurrently. We can do something smarter by silo-ing operations in dbs/tables, but that adds complexity to the interface for likely not much additional wins and we can defer that improvement. for later.
I don't understand why the operation itself needs to indicate its cleanup wait time. I think this would be more fruitfully handled by a "scheduler" sort of component.
It's a bit of a minimum, and the scheduler can still exceed it. I can remove it if it's excessive in this context.
The semantics of "dependencies" are also not clear to me. There are hints of its purpose (RequiresDatabaseSchema), but that's very high level and both the meaning and implementation of such a dependency are not obvious to me.
The idea is to just be able to specify when an operation is eligible to be run, and hook it up into filter.Filter
. Replicating something like the schemachange workload in operations will only require a schema, and not a populated db. Replicating disk-stall
only needs raw nodes in theory; not even a running crdb process. I can remove it as it's unused but that goes against the goal of giving operation authors the ability to semantically specify what the operation needs, and I think we'll regret not having it later.
There is still quite a lot of duplication in an attempt to make an operation behave like a test. But, is it really a test? Does an operation require anything more than
Run
?
From a spec-perspective, operations will, if anything, need more parameters than tests as they're not self-contained and have dependencies and possibly side-effects. The duplication looks bad at first but outside of runOperations
which is pretty minimal, everything else is standalone and is only going to deviate more beyond this point.
Then, create a (dummy) roachtest that's going to choose "operations" and run them sequentially (or concurrently).
I think that's challenging as it opens us up to issues of unintentionally triggering roachprod cluster creation/destruction/github issue-posting/etc, as we learned the hard way today 😂
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @DarrylWong, @herkolategan, and @renatolabs)
pkg/cmd/roachtest/cluster.go
line 1019 at r6 (raw file):
Previously, herkolategan (Herko Lategan) wrote…
curious what the thought process behind
tolerateRegistrationErrors
is? Seems likeregisterCluster
will throw an error on programmer logic.
registerCluster
complains if we register the same cluster twice (eg. if we're running more than one operation), so we wanna be able to specify that registration errors are fine as long as attachment succeeded and our cluster isn't nil.
pkg/cmd/roachtest/main.go
line 188 at r5 (raw file):
Previously, renatolabs (Renato Costa) wrote…
"ephemeral" here is confusing, as it could imply "created just for the operation" (that's the interpretation for roachtest). IMO, it would be clearer to just say "on existing clusters".
Done.
pkg/cmd/roachtest/test_impl.go
line 162 at r6 (raw file):
Previously, herkolategan (Herko Lategan) wrote…
Is this print correct, seems like a copy of the benchmark output?
Fixed.
runSkipped bool, | ||
selectProbability float64, | ||
print bool, | ||
) ([]registry.OperationSpec, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that drt-run
would be the main (ideally only?) component to schedule operations. Will it shell out to roachtest run-operations
or do you plan to link it in some other way?
In any case, if you pass a wildcard, or regex, or a sample, then drt-run
itself has very little control on the ordering / scheduling of the operations.
seeing as we probably don't want to update the yaml file every time we add a new operation
This is solvable by having drt-run
query the list of operations dynamically.
I won't die on this hill, but I'm struggling to find a situation where we'll want to run operations in this way, especially considering that we are planning to develop a tool to be responsible for scheduling operations. This is especially true for things like the sample
feature (i.e., it's fine for this feature to exist, but in my view it's a responsibility of drt-run). It's a lot easier to have something reproducible if we have just one scheduler component to reason about.
if s.Skip == "" || runSkipped { | ||
notSkipped = append(notSkipped, s) | ||
} else { | ||
if print && roachtestflags.TeamCity { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The --- SKIP
and ##teamcity
logic here and a few times below doesn't apply to operations, right?
// eligible clusters to run. | ||
// | ||
// TODO(bilal): Unused. | ||
Dependency OperationDependency |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But is the current API consumable by teams at the moment? When would I choose OperationRequiresSqlConnection
over OperationRequiresDatabaseSchema
? What if I need both? Does one imply the other? What does it mean to "require a database schema"?
My point is that the dependencies are not defined precisely enough so I wonder what the value is of including them as-is right now. Even if teams do set it, we'd need to go back and update the specs anyway when we actually implement any kind of dependency enforcement.
If we do want to have an API, then I'd suggest having at least well defined semantics for what the dependencies are, so that we can come back and implement them later.
@@ -157,6 +158,13 @@ func (t *testImpl) Cockroach() string { | |||
t.l.Printf("Benchmark test, running with standard cockroach") | |||
return t.StandardCockroach() | |||
} | |||
if t.spec.Operation { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hrmm, there is a bit of a mismatch here. For context, t.Cockroach()
in roachtest runs translates to "path to cockroach binary in the test runner" (i.e., not in the cluster). This used to be needed because tests used to have to upload cockroach before starting the cluster, a repetitive step that has since been removed. We should have removed the Cockroach()
function from the test interface.
From what I understand, your purpose is for operations to do something like c.Run("%s workload run", t.Cockroach())
. If we were to do that, then it would 1) be confusing if you knew the previous meaning; and 2) also make it harder to call an operation from roachtest.
Any chance operations can assume a ./cockroach
path for now? My understanding is that we will want to be able to pass parameters to operations at some point, and the path to the binary on the cluster could be a parameter that we pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! I've split the non-running parts of this into #119796 to ease review, and also addressed some of the comments below (the ones that apply to those parts) there. I'll rebase this on that change soon.
The sprawling changes to |
This change introduces a new type in the roachtest registry for Operations; smaller subunits of tests that work strictly on existing clusters and are meant to have zero logical side-effects to a running cluster (eg. should not delete data in an irrecoverable way), but can induce chaos events like crashing nodes. This new spec type, OperationSpec, specifies one operation and is registered with a registry.Registry. It can be filtered with a registry.TestFilter similar to TestSpecs. Split from cockroachdb#118364. Epic: none Release note: None
This change introduces a new type in the roachtest registry for Operations; smaller subunits of tests that work strictly on existing clusters and are meant to have zero logical side-effects to a running cluster (eg. should not delete data in an irrecoverable way), but can induce chaos events like crashing nodes. This new spec type, OperationSpec, specifies one operation and is registered with a registry.Registry. Split from cockroachdb#118364. Epic: none Release note: None
Thanks for the reviews everyone! I had a meeting with Renato today and based on our conversation + comments in this PR, I've greatly simplified this and split it into 3 PRs, #119796, #120023 and #120024, each of which builds on top of the one before. I'll close this PR for now as it's superseded by those in combination. Thanks a ton again; I know this has been a lot of effort for everyone involved and I really appreciate it all. |
This change introduces a new type in the roachtest registry for Operations; smaller subunits of tests that work strictly on existing clusters and are meant to have zero logical side-effects to a running cluster (eg. should not delete data in an irrecoverable way), but can induce chaos events like crashing nodes. This new spec type, OperationSpec, specifies one operation and is registered with a registry.Registry. Split from cockroachdb#118364. Epic: none Release note: None
This change introduces a new type in the roachtest registry for Operations; smaller subunits of tests that work strictly on existing clusters and are meant to have zero logical side-effects to a running cluster (eg. should not delete data in an irrecoverable way), but can induce chaos events like crashing nodes. This new spec type, OperationSpec, specifies one operation and is registered with a registry.Registry. Split from cockroachdb#118364. Epic: none Release note: None
This change introduces a new type in the roachtest registry for Operations; smaller subunits of tests that work strictly on existing clusters and are meant to have zero logical side-effects to a running cluster (eg. should not delete data in an irrecoverable way), but can induce chaos events like crashing nodes. This new spec type, OperationSpec, specifies one operation and is registered with a registry.Registry. Split from cockroachdb#118364. Epic: none Release note: None
119796: cmd/roachtest: add registry.OperationSpec r=herkolategan a=itsbilal This change introduces a new type in the roachtest registry for Operations; smaller subunits of tests that work strictly on existing clusters and are meant to have zero logical side-effects to a running cluster (eg. should not delete data in an irrecoverable way), but can induce chaos events like crashing nodes. This new spec type, OperationSpec, specifies one operation and is registered with a registry.Registry. It can be filtered with a registry.TestFilter similar to TestSpecs. Split from #118364. Epic: none Release note: None Co-authored-by: Bilal Akhtar <bilal@cockroachlabs.com>
This change introduces a new type in the roachtest registry for Operations; smaller subunits of tests that work strictly on existing clusters and are meant to have zero logical side-effects to a running cluster (eg. should not delete data in an irrecoverable way), but can induce chaos events like crashing nodes. This new spec type, OperationSpec, specifies one operation and is registered with a registry.Registry. Split from cockroachdb#118364. Epic: none Release note: None
This change adds the ability to write roachtest Operations that, unlike a full-on roachtest.Test, do not require an ephemeral cluster to be spun up or down. These operations are expected to have as few logical side effects as possible and can be run on a cluster with running workloads. Running an Operation using
roachtest run-operation
also guarantees that the Cockroach/Workload binaries on that node will not be swapped with local ones, and that the cluster won't be wiped unintentionally at the end or in case of error.This change also adds add-index and add-column as two example operations that operate in SQL land and demonstrate the purpose of an Operation.
Epic: none.
Release note: None.