Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: report timeout failures accordingly #96743

Merged
merged 1 commit into from
Feb 9, 2023

Conversation

smg260
Copy link
Contributor

@smg260 smg260 commented Feb 7, 2023

Previously, a timeout failure would be deferred until after artifacts were collected, which sometimes resulted in subsequent failures being attributed as the primary cause.

  • Timeout failures are now recorded at actual timeout, with subsequent failures secondary. Context cancellation occurs at the end of test teardown
  • addFailure accepts a depth parameter and no longer includes context cancellation, which is done separately.

Epic: none
Fixes: #91237
Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Timeout failures are recorded at actual timeout, with
subsequent failures secondary.

`addFailure` accepts a depth parameter and no longer
includes context cancellation, which is done separately.

Epic: none
Fixes: cockroachdb#91237
Release note: None
@smg260 smg260 marked this pull request as ready for review February 8, 2023 14:17
@smg260 smg260 requested a review from a team as a code owner February 8, 2023 14:17
@smg260 smg260 requested review from srosenberg and removed request for a team February 8, 2023 14:17
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! A couple of comments/question:

  • Commit messages typically follow the format packageName: description. I think this also applies to merge commits, so I'd suggest updating the PR title here.

  • Could you explain with a few more words in the commit message how this change fixes the linked issue? How is artifact collection related to the timeouts being reported as SSH flakes?

I'm also curious to learn how to reproduce these SSH flake errors on timeout: were you able to find a way to reliably produce this behavior?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @smg260, and @srosenberg)


pkg/cmd/roachtest/test_runner.go line 1110 at r1 (raw file):

			t.mu.cancel()
		}
		t.L().Printf("test timed out; check __stacks.log and CRDB logs for goroutine dumps")

Could we move this message back to the block where we set timedOut = true? With this change, the message here will now be logged on teardown.log, which I assume is much less frequently inspected than test.log. Also, IMO this is crucial information on the reason for the failure that should be on test.log.

Copy link
Contributor Author

@smg260 smg260 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The commit message is currently roachtest: report timeout failures accordingly which I believe is OK. I can update the title accordingly though - not sure why its different

  • I will add some more detail. There is an existing comment at the artifact collection code site which explains.

Though I could not repro the exact issue initially seen, I was able time out a test, in which the underlying ssh command exited with a 255 soon after. This resulted in an ssh flake being reported, even though the timeout occurred first.

e.g. A roachtest with a 20s timeout, simply running sleep 25; exit 255

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @herkolategan, @renatolabs, and @srosenberg)


pkg/cmd/roachtest/test_runner.go line 1110 at r1 (raw file):

Previously, renatolabs (Renato Costa) wrote…

Could we move this message back to the block where we set timedOut = true? With this change, the message here will now be logged on teardown.log, which I assume is much less frequently inspected than test.log. Also, IMO this is crucial information on the reason for the failure that should be on test.log.

t.addFailure(0, "test timed out (%s)", timeout) already logs that information to the test.log, and is located where you mentioned.

e.g. test.log
21:16:28 test_impl.go:344: test failure #1: full stack retained in failure_1.log: (test_runner.go:989).runTest: test timed out (20s)

@smg260 smg260 changed the title Roachtest timeouts roachtest: report timeout failures accordingly Feb 8, 2023
Copy link
Contributor

@renatolabs renatolabs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Thanks for explaining!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @herkolategan and @srosenberg)

Copy link
Collaborator

@herkolategan herkolategan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @renatolabs and @srosenberg)

@smg260
Copy link
Contributor Author

smg260 commented Feb 9, 2023

tftr

bors r=renatolabs,herkolategan

@craig
Copy link
Contributor

craig bot commented Feb 9, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: don't report SSH flake if test times out
4 participants