-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Roachtest redirect SSH flakes to test-eng #88492
Roachtest redirect SSH flakes to test-eng #88492
Conversation
bebfd0c
to
e135b80
Compare
e135b80
to
ba42dc5
Compare
We can also build on top of #88556 for this, I have a DM going with Miral. |
@tbg Yep. Marking as draft for now. |
ba42dc5
to
e7e4289
Compare
This is rebased on top of #88556 which introduces structured errors. Detecting the SSH flake via substring still works and keeps this PR simple. This might be a good place to introduce some retry logic for running the SSH commands |
09d6d85
to
70db102
Compare
A chance of a false positive (via substring) is probably very low, but structured errors should make it cleaner and pave the way for more cases. I am thinking of a switch statement on something like
There is a separate issue to deal with SSH retries: #73542 |
I recently spotted another issue... seeing this error message quite often during cluster teardown,
The above error message originates in [1]. The problem is that the actual error message is swallowed [2] because [1] https://github.com/cockroachdb/cockroach/blob/master/pkg/roachprod/install/cluster_synced.go#L2034 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
structured errors should make it cleaner and pave the way for more cases
I agree with this, I think structured errors open the door for us to do things based on the error types. I think that's cleaner and also wouldn't break if someone changed how errors.SSH
is formatted.
As general food for thought for everyone: IMO, errors like cluster creation and SSH problems are better reported in a more concise way other than a GitHub issue. We just want to be notified about them, but there's little (nothing?) we can do about them other than acknowledge and close the issue when it's opened.
One idea is to post something on Slack (in the existing test-eng-ops) channel with a summary of a roachtest nightly run, including how many tests were skipped due to cluster creation errors, SSH flakes, etc.
That's definitely the goal, and my thoughts were to introduce that when we switch to So we can keep this PR tight in scope and address just the SSH flakes, or expand it to more generically handle different error types. The PR does at least currently constrain the logic to to determine the owning team in
👍 This was getting a bit gnarly to implement in my test branch |
I agree that some errors are better reported via other mediums such as Slack, but having the same error reported a shared github issue (or similar) provides a better system of record. Ultimately, where an issue is posting should be agnostic to the caller and the posting code could drive which channels are notified via some coniguration |
3ab798e
to
7ae89fd
Compare
The second commit switches to using errors/marker api to mark an SSH flake where it is detected in roachprod. This allows us to look for a marker exception and act on it accordingly. Summary of changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r2, 7 of 7 files at r3, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)
pkg/cmd/roachtest/github.go
line 137 at r1 (raw file):
Previously, smg260 (Miral Gadani) wrote…
It probably makes sense to keep SSH flakes together, instead of having a new one created for each incidence.
We can prepend the test name to the message body like cluster creation does?
I agree that this isn't excessively pretty, but will it maybe get the job done just fine? For example, the cluster_creation messages mention the test name sufficiently prominently:
Not sure if this is the case here too (since there won't be a .Skip()
) but the message also contains the test name right at the top: #90045 (comment)
We can also go through the refactor and add a title override that we can use instead of the test name. I'm just not sure it's worth the time.
pkg/cmd/roachtest/github.go
line 40 at r2 (raw file):
clusterCreationErr issueCategory = iota sshErr otherErr
tiny nit: since otherErr is the catch-all, it should be represented by the zero value.
pkg/cmd/roachtest/github.go
line 162 at r3 (raw file):
//TODO: perhaps remove category completely and move this to // issuePoster?
The poster has multiple callers and so I think this should continue to live in roachtest.
pkg/cmd/roachtest/test_impl.go
line 45 at r3 (raw file):
// errors are all the errors passed to a single invokation of // `addFailure` errors []error
Add some color here, why are we stashing these errors? It's because they might be structured and we want to preserve that (which fmt.Sprintf(...)
wouldn't do). Maybe given an example that t.Fatalf("foo %s %s %s", "hello", err1, err2)
would lead to []error{err1,err2}
.
pkg/cmd/roachtest/test_impl.go
line 280 at r3 (raw file):
// from a test's closure. The test runner itself should never call this. func (t *testImpl) Fatal(args ...interface{}) { t.Fatalf("%v", args...)
This will print t.Fatal("hi")
as ["hi"]
which isn't what we want. I think you want something like
t.Fatalf(strings.Join(strings.Repeat(len(args), " %v"))[1:], args...)
pkg/cmd/roachtest/test_impl.go
line 287 at r3 (raw file):
errs := []error{errors.NewWithDepthf(1, format, args...)} errs = append(errs, collectErrors(args...)...) t.addFailure(newFailure(errs...))
The symmetry in newFailure
is a little misleading then since the first error is special, how about
type failure struct {
squashedErr error
wrapped []error
}
or something like that?
pkg/cmd/roachtest/test_impl.go
line 298 at r3 (raw file):
func (t *testImpl) Error(args ...interface{}) { t.Errorf("%v", args...)
ditto
pkg/roachprod/errors/errors.go
line 123 at r3 (raw file):
if exitErr, ok := asExitError(err); ok { if exitErr.ExitCode() == 255 { return SSH{errors.Mark(err, ErrSSH255)}
Does this work? Does SSH{}
implement the necessary Unwrap()
method?
7beb8a9
to
4d32e19
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @renatolabs, @srosenberg, and @tbg)
pkg/cmd/roachtest/github.go
line 105 at r1 (raw file):
Previously, smg260 (Miral Gadani) wrote…
that's a good catch - it would have been breaking on master too but we only override for cluster creation there
Done
pkg/cmd/roachtest/github.go
line 162 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
The poster has multiple callers and so I think this should continue to live in roachtest.
Yep sorry - unclear TODO. Was a thought to myself to remove the idea of a category, and instead depend completely on reference errors. issuePoster
here refers to the createPostRequest
function in roachtest/github.go
pkg/cmd/roachtest/test_impl.go
line 280 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This will print
t.Fatal("hi")
as["hi"]
which isn't what we want. I think you want something liket.Fatalf(strings.Join(strings.Repeat(len(args), " %v"))[1:], args...)
Done. Thank you. Got lost in the variadic args being passed from func to func.
pkg/cmd/roachtest/test_impl.go
line 287 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
The symmetry in
newFailure
is a little misleading then since the first error is special, how abouttype failure struct { squashedErr error wrapped []error }or something like that?
Agreed. That is much better.
pkg/roachprod/errors/errors.go
line 123 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Does this work? Does
SSH{}
implement the necessaryUnwrap()
method?
Yep . In roachprod/errors.go
Code snippet:
func (e SSH) Unwrap() error {
return e.Err
}
5c51602
to
921c153
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @renatolabs, @smg260, and @srosenberg)
pkg/cmd/roachtest/test_impl.go
line 48 at r6 (raw file):
// being captured in the squashedErr type failure struct { // this is the single error created from variadic args passed to t.Error(f)/t.Fatal(f)
general comment, our style guide asks for complete sentences incl punctuation and capitalization (except for inline comments where it's the opposite), i.e.
// This is the single error created from variadic args passed to t.{Fatal,Error}{,f}
.
I usually don't point out isolated incidents of this but it looks like it's happening throughout here, it will save you lots of nits down the road to get the muscle memory right. :-)
pkg/cmd/roachtest/test_impl.go
line 286 at r6 (raw file):
// from a test's closure. The test runner itself should never call this. func (t *testImpl) Fatal(args ...interface{}) { t.Error(args...)
ditto
pkg/cmd/roachtest/test_impl.go
line 292 at r6 (raw file):
// Fatalf is like Fatal, but takes a format string. func (t *testImpl) Fatalf(format string, args ...interface{}) { t.Errorf(format, args...)
Yo zneed to call addFailure
because you'l
pkg/cmd/roachtest/test_impl.go
line 298 at r6 (raw file):
// FailNow implements the TestingT interface. func (t *testImpl) FailNow() { t.addFailure("FailNow called", nil)
passing nil
here doesn't seem right, FailNow called
doesn't have a verb in it and besides a nil
doesn't add any info.
pkg/cmd/roachtest/test_impl.go
line 400 at r6 (raw file):
// failureContainsError returns true if any of the errors in a given failure // matches the reference error func failureContainsError(refError error, f failure) bool {
nit: switch the order of the arguments, to reflect "failure contains error", not "error contains failure".
pkg/testutils/lint/passes/fmtsafe/functions.go
line 96 at r6 (raw file):
// Both of these signatures need to be included for the linter to not flag // roachtest testImpl.addFailure since it is in the main package // This could be a bug in nogo
Did you mean to leave this? If so, mind linking an issue?
77cfc3e
to
5b80a31
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @renatolabs, @srosenberg, and @tbg)
pkg/cmd/roachtest/test_impl.go
line 48 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
general comment, our style guide asks for complete sentences incl punctuation and capitalization (except for inline comments where it's the opposite), i.e.
// This is the single error created from variadic args passed to
t.{Fatal,Error}{,f}
.I usually don't point out isolated incidents of this but it looks like it's happening throughout here, it will save you lots of nits down the road to get the muscle memory right. :-)
👍 Thanks, missed the capitalisation. I don't think it's happened anywhere else though
pkg/cmd/roachtest/test_impl.go
line 286 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
ditto
What is missing here?
pkg/cmd/roachtest/test_impl.go
line 292 at r6 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Yo zneed to call
addFailure
because you'l
Cut off comment, but addFailure
is being called via t.Errorf
8c0a2a7
to
052a360
Compare
Test name will be prepended to issue body in the case of cluster creation AND ssh flakes. Other conversations resolved. Will merge after TC builds are green. |
SSH flakes to the test-eng team. Also abstracts away how issue name and team is overriden based on an issue category, which can be ClusterCreation, SSH, or Other Resolves issue cockroachdb#82398 Release justification: test-only change Release note: None
This commit utilises cockroachdb/errors markers api to mark an SSH error with exit code of 255. The test runner, when it is posting an issue to github, can act according to which marker may be present in t.mu.failures[0]. In this case, we override the owning team to protect them from having to investigate what is likely a transient SSH issue. A test now has the concept of a failure[] instead of error[]. Each failure contains the original `squashedErr` and an errors[], which are all the errors passed via t.Failf/Errorf. We can then preserve the relationship of multiple errors to any particular failure within a test, and match on any given reference error (like SSH flakes from above) This moves us away from substring matching on error messages. The test name is prepended to both cluster creation and ssh errors to ease troubleshooting as all those issues will be bucketed under either category respectively. Release justification: test-only change Release note: None
is used. This prevents OSX from prompting to allow incoming network connections when running a roachtest from an ide or shell. Release justification: test-only change Release note: None
052a360
to
741a235
Compare
bors r=tbg |
Build failed (retrying...): |
This PR was included in a batch that was canceled, it will be automatically retried |
bors retry |
Already running a review |
Build succeeded: |
See second commit note at the bottom
This PR inspects the failure output of a roachtest, and if it sees an SSH_PROBLEM, overrides the owning team to test-eng when reporting the github issue.
Currently errors are classified as an
SSH
error by roachprod if the exit code is255
with an accompanying message prefixed withSSH_PROBLEM
[1]. The errors are stringified and saved intot.mu.output|failureMsg
. Thus in the test_runner at the call site of issue posting, we can checkt.mu.output
forSSH_PROBLEM
and override the team and issue name accordingly.Resolves: #82398
Release justification: test-only change
Release note: none