roachtest: use structured errors #93328

smg260 · 2022-12-09T15:54:33Z

roachtest predates cockroachdb/errors and as a result so far hasn't
capitalized many improvements. Today, roachtest errors are often noisy.

This commit adopts cockroachdb/errors and reaps some of the immediate
rewards, while making related improvements that would also have been
possible without the adoption.

The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:

(test_test.go:129).func3: first error
(test_test.go:130).func3: second error

The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.

Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
cockroachdb/errors already handles it. Custom rendering code
was similarly significantly trimmed down.

In making this change, I opted to always have roachtest create
loggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which roachtest is always used in
actual invocations.

Looking ahead, structured errors open a few avenues:

can assign the owner based on the type of failure. For example,
roachtest: opt some tests into aggressive stats checks #87106 wants consistency check failures to always go to the KV
team, regardless of which test's cluster was being checked.

Failures during an IMPORT/RESTORE (common across all tests) could be
routed to the Disaster Recovery team by default (assuming we provide a
wrapper around these operations that all tests use and which does this
wrapping).
Similarly, SSH failures can be special cased via a marking error
and can be directed away from the owning team, which can't do
anything about them anyway (roachtest: special-case SSH_PROBLEM as infra flakes #82398).
We can conceivably start grouping failure modes by "error signature".
That is, errors which have a "comparable" chain of errors (e.g. same
types, and within formatted errors the same format string). Issues
could then be reused only for compatible error signatures.

Release note: NoneBackport:

1/1 commits from "roachtest: use structured errors" (roachtest: use structured errors #88556)
3/3 commits from "Roachtest redirect SSH flakes to test-eng" (Roachtest redirect SSH flakes to test-eng #88492)
1/1 commits from "Revert fmtsafe.go output" (Revert fmtsafe.go output #90720)
1/1 commits from "Roachtest SSH Retries" (Roachtest SSH Retries #90451)
1/1 commits from "roachtest: Retry failed SCP attempts in the same way as SSH commands." (roachtest: Retry failed SCP attempts in the same way as SSH commands. #92082)

Please see individual PRs for details.

/cc @cockroachdb/release

`roachtest` predates `cockroachdb/errors` and as a result so far hasn't capitalized many improvements. Today, roachtest errors are often noisy. This commit adopts `cockroachdb/errors` and reaps some of the immediate rewards, while making related improvements that would also have been possible without the adoption. The main user-facing change is that error messages in the output are now a lot more concise; these would previously - sometimes - include the entire stack trace. Now they contain only the topmost stack record of the innermost error: ``` (test_test.go:129).func3: first error (test_test.go:130).func3: second error ``` The full error continues to be logged, but we write it to files in the artifacts, where they can be inspected just in case. This now happens unconditionally for all errors, whereas the old code only logged the stacks if the error was reported in a certain way. Internally, the reorganization has also considerably simplified roachtest. Stack frame offset tracking has been removed, since `cockroachdb/errors` already handles it. Custom rendering code was similarly significantly trimmed down. In making this change, I opted to always have `roachtest` create loggers backed by real files. This was previously elided in tests, but this would have caused extra conditionals. It's better to standardize on the way in which `roachtest` is always used in actual invocations. Looking ahead, structured errors open a few avenues: - can assign the owner based on the type of failure. For example, cockroachdb#87106 wants consistency check failures to always go to the KV team, regardless of which test's cluster was being checked. Failures during an IMPORT/RESTORE (common across all tests) could be routed to the Disaster Recovery team by default (assuming we provide a wrapper around these operations that all tests use and which does this wrapping). - Similarly, SSH failures can be special cased via a marking error and can be directed away from the owning team, which can't do anything about them anyway (cockroachdb#82398). - We can conceivably start grouping failure modes by "error signature". That is, errors which have a "comparable" chain of errors (e.g. same types, and within formatted errors the same format string). Issues could then be reused only for compatible error signatures. Release note: None

SSH flakes to the test-eng team. Also abstracts away how issue name and team is overriden based on an issue category, which can be ClusterCreation, SSH, or Other Resolves issue cockroachdb#82398 Release justification: test-only change Release note: None

cockroach-teamcity · 2022-12-09T15:54:56Z

This change is

This commit utilises cockroachdb/errors markers api to mark an SSH error with exit code of 255. The test runner, when it is posting an issue to github, can act according to which marker may be present in t.mu.failures[0]. In this case, we override the owning team to protect them from having to investigate what is likely a transient SSH issue. A test now has the concept of a failure[] instead of error[]. Each failure contains the original `squashedErr` and an errors[], which are all the errors passed via t.Failf/Errorf. We can then preserve the relationship of multiple errors to any particular failure within a test, and match on any given reference error (like SSH flakes from above) This moves us away from substring matching on error messages. The test name is prepended to both cluster creation and ssh errors to ease troubleshooting as all those issues will be bucketed under either category respectively. Release justification: test-only change Release note: None

is used. This prevents OSX from prompting to allow incoming network connections when running a roachtest from an ide or shell. Release justification: test-only change Release note: None

TestLint/TestRoachVet is a unit test which runs roachvet and greps various messages from its output - the reverted message being one of them. Release note: none. Epic: none

…try when encountering an error of 255 (SSH). The retry could be exposed to callers fairly simply, but this PR does not introduce that. Today there appears the concept of a "roachprod" and a "command error", the former of which denotes an issue in roachprod handling code, the latter representing an error executing a command over SSH. This PR preserves this behaviour and introduces an updated function signature from `f(i int) ([]byte, error) to f(i int) (*RunResultDetails, error). RunResultDetails has been expanded to capture information about the execution of a command, such as stderr/our, exit code, error, attempt number. Any roachprod error will result in an error being returned, and set in RunResultDetails.Err. Any command error would be only set in RunResultDetails.Err with the returned error being nil. This allows callers to differentiate between a roachprod and a command error, which existing code depends on. Retry handling code can use a function's returned RunResultDetails to determine whether a retry should take place. The default retry handling occurs on `RunResultDetails.RemoteExitStatus == 255` Release note: None Epic: CRDB-21386

Any error returned is assumed to be retryable (in contrast to SSH where by default we look for an exit code of 255) Release note: none Epic: CRDB-21386

Epic: none Release note: None

srosenberg

Thanks for rebasing!

tbg and others added 2 commits December 9, 2022 10:29

Miral Gadani added 6 commits December 9, 2022 11:41

roachtest: Set localhost as the http listener when --local flag

3a38f3b

is used. This prevents OSX from prompting to allow incoming network connections when running a roachtest from an ide or shell. Release justification: test-only change Release note: None

This change reverts an update to the fmtsafe.go linter output message.

87dda29

TestLint/TestRoachVet is a unit test which runs roachvet and greps various messages from its output - the reverted message being one of them. Release note: none. Epic: none

roachtest: Retry failed SCP attempts in the same way as SSH commands.

e08ff3b

Any error returned is assumed to be retryable (in contrast to SSH where by default we look for an exit code of 255) Release note: none Epic: CRDB-21386

roachtest: extraneous dep in bazel conf

990a545

Epic: none Release note: None

smg260 force-pushed the backport22.2-88556-88492-90720-90451-92082 branch from 99772b5 to 990a545 Compare December 9, 2022 16:42

smg260 marked this pull request as ready for review December 9, 2022 17:29

smg260 requested a review from a team as a code owner December 9, 2022 17:29

smg260 requested review from herkolategan and srosenberg and removed request for a team December 9, 2022 17:29

srosenberg approved these changes Dec 29, 2022

View reviewed changes

smg260 merged commit 428e9a9 into cockroachdb:release-22.2 Dec 29, 2022

smg260 mentioned this pull request Jan 17, 2023

roachtest: structured errors, retries #95393

Closed

smg260 deleted the backport22.2-88556-88492-90720-90451-92082 branch January 18, 2023 01:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachtest: use structured errors #93328

roachtest: use structured errors #93328

smg260 commented Dec 9, 2022

cockroach-teamcity commented Dec 9, 2022

srosenberg left a comment

roachtest: use structured errors #93328

roachtest: use structured errors #93328

Conversation

smg260 commented Dec 9, 2022

cockroach-teamcity commented Dec 9, 2022

srosenberg left a comment

Choose a reason for hiding this comment