-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: use structured errors #93328
roachtest: use structured errors #93328
Commits on Dec 9, 2022
-
roachtest: use structured errors
`roachtest` predates `cockroachdb/errors` and as a result so far hasn't capitalized many improvements. Today, roachtest errors are often noisy. This commit adopts `cockroachdb/errors` and reaps some of the immediate rewards, while making related improvements that would also have been possible without the adoption. The main user-facing change is that error messages in the output are now a lot more concise; these would previously - sometimes - include the entire stack trace. Now they contain only the topmost stack record of the innermost error: ``` (test_test.go:129).func3: first error (test_test.go:130).func3: second error ``` The full error continues to be logged, but we write it to files in the artifacts, where they can be inspected just in case. This now happens unconditionally for all errors, whereas the old code only logged the stacks if the error was reported in a certain way. Internally, the reorganization has also considerably simplified roachtest. Stack frame offset tracking has been removed, since `cockroachdb/errors` already handles it. Custom rendering code was similarly significantly trimmed down. In making this change, I opted to always have `roachtest` create loggers backed by real files. This was previously elided in tests, but this would have caused extra conditionals. It's better to standardize on the way in which `roachtest` is always used in actual invocations. Looking ahead, structured errors open a few avenues: - can assign the owner based on the type of failure. For example, cockroachdb#87106 wants consistency check failures to always go to the KV team, regardless of which test's cluster was being checked. Failures during an IMPORT/RESTORE (common across all tests) could be routed to the Disaster Recovery team by default (assuming we provide a wrapper around these operations that all tests use and which does this wrapping). - Similarly, SSH failures can be special cased via a marking error and can be directed away from the owning team, which can't do anything about them anyway (cockroachdb#82398). - We can conceivably start grouping failure modes by "error signature". That is, errors which have a "comparable" chain of errors (e.g. same types, and within formatted errors the same format string). Issues could then be reused only for compatible error signatures. Release note: None
Configuration menu - View commit details
-
Copy full SHA for 32fd127 - Browse repository at this point
Copy the full SHA 32fd127View commit details -
roachtest: Reassign roachtest failures which may have been caused by
SSH flakes to the test-eng team. Also abstracts away how issue name and team is overriden based on an issue category, which can be ClusterCreation, SSH, or Other Resolves issue cockroachdb#82398 Release justification: test-only change Release note: None
Miral Gadani committedDec 9, 2022 Configuration menu - View commit details
-
Copy full SHA for eebbd80 - Browse repository at this point
Copy the full SHA eebbd80View commit details -
roachtest: Utilise structured errors for detection of SSH flakes.
This commit utilises cockroachdb/errors markers api to mark an SSH error with exit code of 255. The test runner, when it is posting an issue to github, can act according to which marker may be present in t.mu.failures[0]. In this case, we override the owning team to protect them from having to investigate what is likely a transient SSH issue. A test now has the concept of a failure[] instead of error[]. Each failure contains the original `squashedErr` and an errors[], which are all the errors passed via t.Failf/Errorf. We can then preserve the relationship of multiple errors to any particular failure within a test, and match on any given reference error (like SSH flakes from above) This moves us away from substring matching on error messages. The test name is prepended to both cluster creation and ssh errors to ease troubleshooting as all those issues will be bucketed under either category respectively. Release justification: test-only change Release note: None
Miral Gadani committedDec 9, 2022 Configuration menu - View commit details
-
Copy full SHA for 2b14eea - Browse repository at this point
Copy the full SHA 2b14eeaView commit details -
roachtest: Set localhost as the http listener when --local flag
is used. This prevents OSX from prompting to allow incoming network connections when running a roachtest from an ide or shell. Release justification: test-only change Release note: None
Miral Gadani committedDec 9, 2022 Configuration menu - View commit details
-
Copy full SHA for 3a38f3b - Browse repository at this point
Copy the full SHA 3a38f3bView commit details -
This change reverts an update to the fmtsafe.go linter output message.
TestLint/TestRoachVet is a unit test which runs roachvet and greps various messages from its output - the reverted message being one of them. Release note: none. Epic: none
Miral Gadani committedDec 9, 2022 Configuration menu - View commit details
-
Copy full SHA for 87dda29 - Browse repository at this point
Copy the full SHA 87dda29View commit details -
roachtest: This change introduces a transparent to caller, default re…
…try when encountering an error of 255 (SSH). The retry could be exposed to callers fairly simply, but this PR does not introduce that. Today there appears the concept of a "roachprod" and a "command error", the former of which denotes an issue in roachprod handling code, the latter representing an error executing a command over SSH. This PR preserves this behaviour and introduces an updated function signature from `f(i int) ([]byte, error) to f(i int) (*RunResultDetails, error). RunResultDetails has been expanded to capture information about the execution of a command, such as stderr/our, exit code, error, attempt number. Any roachprod error will result in an error being returned, and set in RunResultDetails.Err. Any command error would be only set in RunResultDetails.Err with the returned error being nil. This allows callers to differentiate between a roachprod and a command error, which existing code depends on. Retry handling code can use a function's returned RunResultDetails to determine whether a retry should take place. The default retry handling occurs on `RunResultDetails.RemoteExitStatus == 255` Release note: None Epic: CRDB-21386
Miral Gadani committedDec 9, 2022 Configuration menu - View commit details
-
Copy full SHA for 0f7cc9a - Browse repository at this point
Copy the full SHA 0f7cc9aView commit details -
roachtest: Retry failed SCP attempts in the same way as SSH commands.
Any error returned is assumed to be retryable (in contrast to SSH where by default we look for an exit code of 255) Release note: none Epic: CRDB-21386
Miral Gadani committedDec 9, 2022 Configuration menu - View commit details
-
Copy full SHA for e08ff3b - Browse repository at this point
Copy the full SHA e08ff3bView commit details -
roachtest: extraneous dep in bazel conf
Epic: none Release note: None
Miral Gadani committedDec 9, 2022 Configuration menu - View commit details
-
Copy full SHA for 990a545 - Browse repository at this point
Copy the full SHA 990a545View commit details