Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachtest: use structured errors #93328

Commits on Dec 9, 2022

  1. roachtest: use structured errors

    `roachtest` predates `cockroachdb/errors` and as a result so far hasn't
    capitalized many improvements. Today, roachtest errors are often noisy.
    
    This commit adopts `cockroachdb/errors` and reaps some of the immediate
    rewards, while making related improvements that would also have been
    possible without the adoption.
    
    The main user-facing change is that error messages in the output are now
    a lot more concise; these would previously - sometimes - include the
    entire stack trace. Now they contain only the topmost stack record of
    the innermost error:
    
    ```
    (test_test.go:129).func3: first error
    (test_test.go:130).func3: second error
    ```
    
    The full error continues to be logged, but we write it to files in
    the artifacts, where they can be inspected just in case. This now
    happens unconditionally for all errors, whereas the old code only
    logged the stacks if the error was reported in a certain way.
    
    Internally, the reorganization has also considerably simplified
    roachtest. Stack frame offset tracking has been removed, since
    `cockroachdb/errors` already handles it. Custom rendering code
    was similarly significantly trimmed down.
    
    In making this change, I opted to always have `roachtest` create
    loggers backed by real files. This was previously elided in tests,
    but this would have caused extra conditionals. It's better to
    standardize on the way in which `roachtest` is always used in
    actual invocations.
    
    Looking ahead, structured errors open a few avenues:
    
    - can assign the owner based on the type of failure. For example,
      cockroachdb#87106 wants consistency check failures to always go to the KV
      team, regardless of which test's cluster was being checked.
    
      Failures during an IMPORT/RESTORE (common across all tests) could be
      routed to the Disaster Recovery team by default (assuming we provide a
      wrapper around these operations that all tests use and which does this
      wrapping).
    - Similarly, SSH failures can be special cased via a marking error
      and can be directed away from the owning team, which can't do
      anything about them anyway (cockroachdb#82398).
    - We can conceivably start grouping failure modes by "error signature".
      That is, errors which have a "comparable" chain of errors (e.g. same
      types, and within formatted errors the same format string). Issues
      could then be reused only for compatible error signatures.
    
    Release note: None
    tbg authored and Miral Gadani committed Dec 9, 2022
    Configuration menu
    Copy the full SHA
    32fd127 View commit details
    Browse the repository at this point in the history
  2. roachtest: Reassign roachtest failures which may have been caused by

     SSH flakes to the test-eng team.
    
     Also abstracts away how issue name and team is overriden based on
     an issue category, which can be ClusterCreation, SSH, or Other
    
     Resolves issue cockroachdb#82398
    
     Release justification: test-only change
     Release note: None
    Miral Gadani committed Dec 9, 2022
    Configuration menu
    Copy the full SHA
    eebbd80 View commit details
    Browse the repository at this point in the history
  3. roachtest: Utilise structured errors for detection of SSH flakes.

    This commit utilises cockroachdb/errors markers api to mark an SSH
    error with exit code of 255. The test runner, when it is posting an
    issue to github, can act according to which marker may be present
    in t.mu.failures[0]. In this case, we override the owning team to
    protect them from having to investigate what is likely a transient
    SSH issue.
    
    A test now has the concept of a failure[] instead of error[]. Each
    failure contains the original `squashedErr` and an errors[],
    which are all the errors passed via t.Failf/Errorf.
    
    We can then preserve the relationship of multiple
    errors to any particular failure within a test, and match on any given
    reference error (like SSH flakes from above)
    
    This moves us away from substring matching on error messages.
    
    The test name is prepended to both cluster creation and ssh errors
    to ease troubleshooting as all those issues will be bucketed
    under either category respectively.
    
    Release justification: test-only change
    Release note: None
    Miral Gadani committed Dec 9, 2022
    Configuration menu
    Copy the full SHA
    2b14eea View commit details
    Browse the repository at this point in the history
  4. roachtest: Set localhost as the http listener when --local flag

    is used. This prevents OSX from prompting to allow incoming
    network connections when running a roachtest from an ide or
     shell.
    
    Release justification: test-only change
    Release note: None
    Miral Gadani committed Dec 9, 2022
    Configuration menu
    Copy the full SHA
    3a38f3b View commit details
    Browse the repository at this point in the history
  5. This change reverts an update to the fmtsafe.go linter output message.

    TestLint/TestRoachVet is a unit test which runs roachvet and greps various
    messages from its output - the reverted message being one of them.
    
    Release note: none.
    Epic: none
    Miral Gadani committed Dec 9, 2022
    Configuration menu
    Copy the full SHA
    87dda29 View commit details
    Browse the repository at this point in the history
  6. roachtest: This change introduces a transparent to caller, default re…

    …try when
    
    encountering an error of 255 (SSH). The retry could be exposed to
    callers fairly simply, but this PR does not introduce that.
    
    Today there appears the concept of a "roachprod" and a "command error",
    the former of which denotes an issue in roachprod handling code, the
    latter representing an error executing a command over SSH. This PR
    preserves this behaviour and introduces an updated function signature
    from `f(i int) ([]byte, error) to f(i int) (*RunResultDetails, error).
    
    RunResultDetails has been expanded to capture information about the
    execution of a command, such as stderr/our, exit code, error, attempt
    number. Any roachprod error will result in an error being returned, and set
    in RunResultDetails.Err. Any command error would be only set in
    RunResultDetails.Err with the returned error being nil. This allows callers to
    differentiate between a roachprod and a command error, which existing code
    depends on.
    
    Retry handling code can use a function's returned RunResultDetails to
    determine whether a retry should take place. The default retry
    handling occurs on `RunResultDetails.RemoteExitStatus == 255`
    
    Release note: None
    Epic: CRDB-21386
    Miral Gadani committed Dec 9, 2022
    Configuration menu
    Copy the full SHA
    0f7cc9a View commit details
    Browse the repository at this point in the history
  7. roachtest: Retry failed SCP attempts in the same way as SSH commands.

    Any error returned is assumed to be retryable (in contrast to SSH where
    by default we look for an exit code of 255)
    
    Release note: none
    Epic: CRDB-21386
    Miral Gadani committed Dec 9, 2022
    Configuration menu
    Copy the full SHA
    e08ff3b View commit details
    Browse the repository at this point in the history
  8. roachtest: extraneous dep in bazel conf

    Epic: none
    Release note: None
    Miral Gadani committed Dec 9, 2022
    Configuration menu
    Copy the full SHA
    990a545 View commit details
    Browse the repository at this point in the history