-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: report timeout failures accordingly #96743
Conversation
d4e9d18
to
4bcf740
Compare
Timeout failures are recorded at actual timeout, with subsequent failures secondary. `addFailure` accepts a depth parameter and no longer includes context cancellation, which is done separately. Epic: none Fixes: cockroachdb#91237 Release note: None
4bcf740
to
25045db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! A couple of comments/question:
-
Commit messages typically follow the format
packageName: description
. I think this also applies to merge commits, so I'd suggest updating the PR title here. -
Could you explain with a few more words in the commit message how this change fixes the linked issue? How is artifact collection related to the timeouts being reported as SSH flakes?
I'm also curious to learn how to reproduce these SSH flake errors on timeout: were you able to find a way to reliably produce this behavior?
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)
pkg/cmd/roachtest/test_runner.go
line 1110 at r1 (raw file):
t.mu.cancel() } t.L().Printf("test timed out; check __stacks.log and CRDB logs for goroutine dumps")
Could we move this message back to the block where we set timedOut = true
? With this change, the message here will now be logged on teardown.log
, which I assume is much less frequently inspected than test.log
. Also, IMO this is crucial information on the reason for the failure that should be on test.log
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-
The commit message is currently
roachtest: report timeout failures accordingly
which I believe is OK. I can update the title accordingly though - not sure why its different -
I will add some more detail. There is an existing comment at the artifact collection code site which explains.
Though I could not repro the exact issue initially seen, I was able time out a test, in which the underlying ssh command exited with a 255
soon after. This resulted in an ssh flake being reported, even though the timeout occurred first.
e.g. A roachtest with a 20s timeout, simply running sleep 25; exit 255
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)
pkg/cmd/roachtest/test_runner.go
line 1110 at r1 (raw file):
Previously, renatolabs (Renato Costa) wrote…
Could we move this message back to the block where we set
timedOut = true
? With this change, the message here will now be logged onteardown.log
, which I assume is much less frequently inspected thantest.log
. Also, IMO this is crucial information on the reason for the failure that should be ontest.log
.
t.addFailure(0, "test timed out (%s)", timeout)
already logs that information to the test.log
, and is located where you mentioned.
e.g. test.log
21:16:28 test_impl.go:344: test failure #1: full stack retained in failure_1.log: (test_runner.go:989).runTest: test timed out (20s)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for explaining!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)
tftr bors r=renatolabs,herkolategan |
Build succeeded: |
Previously, a timeout failure would be deferred until after artifacts were collected, which sometimes resulted in subsequent failures being attributed as the primary cause.
addFailure
accepts a depth parameter and no longer includes context cancellation, which is done separately.Epic: none
Fixes: #91237
Release note: None