Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: use structured errors #93328

Conversation

smg260
Copy link
Contributor

@smg260 smg260 commented Dec 9, 2022

roachtest predates cockroachdb/errors and as a result so far hasn't
capitalized many improvements. Today, roachtest errors are often noisy.

This commit adopts cockroachdb/errors and reaps some of the immediate
rewards, while making related improvements that would also have been
possible without the adoption.

The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:

(test_test.go:129).func3: first error
(test_test.go:130).func3: second error

The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.

Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
cockroachdb/errors already handles it. Custom rendering code
was similarly significantly trimmed down.

In making this change, I opted to always have roachtest create
loggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which roachtest is always used in
actual invocations.

Looking ahead, structured errors open a few avenues:

  • can assign the owner based on the type of failure. For example,
    roachtest: opt some tests into aggressive stats checks #87106 wants consistency check failures to always go to the KV
    team, regardless of which test's cluster was being checked.

    Failures during an IMPORT/RESTORE (common across all tests) could be
    routed to the Disaster Recovery team by default (assuming we provide a
    wrapper around these operations that all tests use and which does this
    wrapping).

  • Similarly, SSH failures can be special cased via a marking error
    and can be directed away from the owning team, which can't do
    anything about them anyway (roachtest: special-case SSH_PROBLEM as infra flakes #82398).

  • We can conceivably start grouping failure modes by "error signature".
    That is, errors which have a "comparable" chain of errors (e.g. same
    types, and within formatted errors the same format string). Issues
    could then be reused only for compatible error signatures.

Release note: NoneBackport:

Please see individual PRs for details.

/cc @cockroachdb/release

tbg and others added 2 commits December 9, 2022 10:29
`roachtest` predates `cockroachdb/errors` and as a result so far hasn't
capitalized many improvements. Today, roachtest errors are often noisy.

This commit adopts `cockroachdb/errors` and reaps some of the immediate
rewards, while making related improvements that would also have been
possible without the adoption.

The main user-facing change is that error messages in the output are now
a lot more concise; these would previously - sometimes - include the
entire stack trace. Now they contain only the topmost stack record of
the innermost error:

```
(test_test.go:129).func3: first error
(test_test.go:130).func3: second error
```

The full error continues to be logged, but we write it to files in
the artifacts, where they can be inspected just in case. This now
happens unconditionally for all errors, whereas the old code only
logged the stacks if the error was reported in a certain way.

Internally, the reorganization has also considerably simplified
roachtest. Stack frame offset tracking has been removed, since
`cockroachdb/errors` already handles it. Custom rendering code
was similarly significantly trimmed down.

In making this change, I opted to always have `roachtest` create
loggers backed by real files. This was previously elided in tests,
but this would have caused extra conditionals. It's better to
standardize on the way in which `roachtest` is always used in
actual invocations.

Looking ahead, structured errors open a few avenues:

- can assign the owner based on the type of failure. For example,
  cockroachdb#87106 wants consistency check failures to always go to the KV
  team, regardless of which test's cluster was being checked.

  Failures during an IMPORT/RESTORE (common across all tests) could be
  routed to the Disaster Recovery team by default (assuming we provide a
  wrapper around these operations that all tests use and which does this
  wrapping).
- Similarly, SSH failures can be special cased via a marking error
  and can be directed away from the owning team, which can't do
  anything about them anyway (cockroachdb#82398).
- We can conceivably start grouping failure modes by "error signature".
  That is, errors which have a "comparable" chain of errors (e.g. same
  types, and within formatted errors the same format string). Issues
  could then be reused only for compatible error signatures.

Release note: None
 SSH flakes to the test-eng team.

 Also abstracts away how issue name and team is overriden based on
 an issue category, which can be ClusterCreation, SSH, or Other

 Resolves issue cockroachdb#82398

 Release justification: test-only change
 Release note: None
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Miral Gadani added 6 commits December 9, 2022 11:41
This commit utilises cockroachdb/errors markers api to mark an SSH
error with exit code of 255. The test runner, when it is posting an
issue to github, can act according to which marker may be present
in t.mu.failures[0]. In this case, we override the owning team to
protect them from having to investigate what is likely a transient
SSH issue.

A test now has the concept of a failure[] instead of error[]. Each
failure contains the original `squashedErr` and an errors[],
which are all the errors passed via t.Failf/Errorf.

We can then preserve the relationship of multiple
errors to any particular failure within a test, and match on any given
reference error (like SSH flakes from above)

This moves us away from substring matching on error messages.

The test name is prepended to both cluster creation and ssh errors
to ease troubleshooting as all those issues will be bucketed
under either category respectively.

Release justification: test-only change
Release note: None
is used. This prevents OSX from prompting to allow incoming
network connections when running a roachtest from an ide or
 shell.

Release justification: test-only change
Release note: None
TestLint/TestRoachVet is a unit test which runs roachvet and greps various
messages from its output - the reverted message being one of them.

Release note: none.
Epic: none
…try when

encountering an error of 255 (SSH). The retry could be exposed to
callers fairly simply, but this PR does not introduce that.

Today there appears the concept of a "roachprod" and a "command error",
the former of which denotes an issue in roachprod handling code, the
latter representing an error executing a command over SSH. This PR
preserves this behaviour and introduces an updated function signature
from `f(i int) ([]byte, error) to f(i int) (*RunResultDetails, error).

RunResultDetails has been expanded to capture information about the
execution of a command, such as stderr/our, exit code, error, attempt
number. Any roachprod error will result in an error being returned, and set
in RunResultDetails.Err. Any command error would be only set in
RunResultDetails.Err with the returned error being nil. This allows callers to
differentiate between a roachprod and a command error, which existing code
depends on.

Retry handling code can use a function's returned RunResultDetails to
determine whether a retry should take place. The default retry
handling occurs on `RunResultDetails.RemoteExitStatus == 255`

Release note: None
Epic: CRDB-21386
Any error returned is assumed to be retryable (in contrast to SSH where
by default we look for an exit code of 255)

Release note: none
Epic: CRDB-21386
Epic: none
Release note: None
@smg260 smg260 force-pushed the backport22.2-88556-88492-90720-90451-92082 branch from 99772b5 to 990a545 Compare December 9, 2022 16:42
@smg260 smg260 marked this pull request as ready for review December 9, 2022 17:29
@smg260 smg260 requested a review from a team as a code owner December 9, 2022 17:29
@smg260 smg260 requested review from herkolategan and srosenberg and removed request for a team December 9, 2022 17:29
Copy link
Member

@srosenberg srosenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for rebasing!

@smg260 smg260 merged commit 428e9a9 into cockroachdb:release-22.2 Dec 29, 2022
@smg260 smg260 deleted the backport22.2-88556-88492-90720-90451-92082 branch January 18, 2023 01:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants