release-22.1: roachtest structured errors and ssh/scp retries #95448

smg260 · 2023-01-18T17:46:38Z

Backport:

1/1 commits from "roachtest: update TeamCity links to the new UI." (roachtest: update TeamCity links to the new UI. #81103)
1/1 commits from "roachtest: include some cluster configs in test failure reports" (roachtest: include some cluster configs in test failure reports #81845)
2/2 commits from "roachtest: refactor github issue posting and expose actual vm args for reporting" (roachtest: refactor github issue posting and expose actual vm args for reporting #87304)
1/1 commits from "roachprod: fix roachprod init." (roachprod: fix roachprod init. #88514)
1/1 commits from "roachprod: set default cluster settings when starting" (roachprod: set default cluster settings when starting #88716)
1/1 commits from "roachtest: use structured errors" (roachtest: use structured errors #88556)
1/1 commits from "roachprod: log Get failures in roachtests" (roachprod: log Get failures in roachtests #89850)
3/3 commits from "Roachtest redirect SSH flakes to test-eng" (Roachtest redirect SSH flakes to test-eng #88492)
1/1 commits from "Revert fmtsafe.go output" (Revert fmtsafe.go output #90720)
1/1 commits from "roachprod: improve error in ParallelE" (roachprod: improve error in ParallelE #91119)
1/1 commits from "Roachtest SSH Retries" (Roachtest SSH Retries #90451)
1/1 commits from "roachtest: Retry failed SCP attempts in the same way as SSH commands." (roachtest: Retry failed SCP attempts in the same way as SSH commands. #92082)
1/3 commits from "roachtest: misc fixes" (roachtest: misc fixes #94361)

Please see individual PRs for details.

/cc @cockroachdb/release

Release justification: test only change to make roachtests more resilient and easier to debug

In 2019, TeamCity released a new "experimental UI" that changed not only the look and feel of the app, but also the URLs where pages are located. Notably, information about a build used to be under the /viewLog.html page in the old UI, but was moved to /buildConfiguration/{buildTypeID}/{buildID} in the new UI. If the user enabled the new UI locally, TeamCity will automatically redirect requests from the old URLs to the new ones; however, the URL fragments are not preserved. This is quite bad for roachtest, since the links generated in failed test reports relies on the fragments in order to expand the relevant folders in a potentially long list of artifacts. This commit updates TeamCity links generated by roachtest when reporting failed tests to always use the new UI. It has the drawback of forcing the new UI on everyone, but the benefit of always expanding the correct folder. A new environment variable, `TC_BUILDTYPE_ID`, was added to TeamCity builds to support this use case; it contains the ID of the corresponding build type (also known as "build configuration"). Resolves: cockroachdb#80852. Release note: None.

This commit changes the 'Parameters' section of a test failure. Previously, it would only include the contents of the `TAGS` and `GOFLAGS` environment variables. It now adds cluster configuration values obtained from the `ClusterSpec`. These parameters are prefixed with `ROACHTEST` (e.g., `ROACHTEST_cloud`). For now, just the cloud name, number of CPUs and SSDs are included. In the future, we can expand this work to support other relevant cluster configuration values as needed to debug test failures. Resolves: cockroachdb#80799. Release note: None.

…orted. Some runtime values are probabilistically set or otherwise have defaults which are not evident at the time we report issues. This change returns the vm options associated with a created cluster, from which we can extract the final computed values for such fields including the type of filesystem and whether encryption is enabled. Release justification: test-only change Release note: none

…test. No previous coverage exists for posting issues to github. This change moves the relevant code into a separate source file with associated test. External functions dealing with actual issue posting to github, and loading the TEAMS.yaml are now injected to facilitate easier testing. Release justification: test-only change Release note: none

When a cluster is started with the `--skip-init` option, the caller can run `roachprod init` at any time to initialize the cluster. Unfortunately, the code used to initialize the cluster was duplicated: one copy existed in the `start` path, and another in the `init` path. Since the latter is used far less frequently, it had a bug that went unnoticed: it hardcoded the first node index as `0`, when node indices start at 1. This commit fixes the issue by updating the constant and sharing code between `init `and `start`. Fixes cockroachdb#88226. Release note: None

In cockroachdb#88514, the cluster start logic was refactored to reuse the same code across `init` and `start`, fixing a bug in the former. However, the refactoring overlooked the fact that we previously always set the default cluster settings when there's more than one node in the cluster. This fixes that by setting the default cluster settings in that case; one particularly important cluster setting is the license key, necessary for some roachtests. Fixes cockroachdb#88660. Fixes cockroachdb#88665. Fixes cockroachdb#88666. Fixes cockroachdb#88710. Release note: None

`roachtest` predates `cockroachdb/errors` and as a result so far hasn't capitalized many improvements. Today, roachtest errors are often noisy. This commit adopts `cockroachdb/errors` and reaps some of the immediate rewards, while making related improvements that would also have been possible without the adoption. The main user-facing change is that error messages in the output are now a lot more concise; these would previously - sometimes - include the entire stack trace. Now they contain only the topmost stack record of the innermost error: ``` (test_test.go:129).func3: first error (test_test.go:130).func3: second error ``` The full error continues to be logged, but we write it to files in the artifacts, where they can be inspected just in case. This now happens unconditionally for all errors, whereas the old code only logged the stacks if the error was reported in a certain way. Internally, the reorganization has also considerably simplified roachtest. Stack frame offset tracking has been removed, since `cockroachdb/errors` already handles it. Custom rendering code was similarly significantly trimmed down. In making this change, I opted to always have `roachtest` create loggers backed by real files. This was previously elided in tests, but this would have caused extra conditionals. It's better to standardize on the way in which `roachtest` is always used in actual invocations. Looking ahead, structured errors open a few avenues: - can assign the owner based on the type of failure. For example, cockroachdb#87106 wants consistency check failures to always go to the KV team, regardless of which test's cluster was being checked. Failures during an IMPORT/RESTORE (common across all tests) could be routed to the Disaster Recovery team by default (assuming we provide a wrapper around these operations that all tests use and which does this wrapping). - Similarly, SSH failures can be special cased via a marking error and can be directed away from the owning team, which can't do anything about them anyway (cockroachdb#82398). - We can conceivably start grouping failure modes by "error signature". That is, errors which have a "comparable" chain of errors (e.g. same types, and within formatted errors the same format string). Issues could then be reused only for compatible error signatures. Release note: None

`roachprod.Get` and `roachprod.Put` behaved differently when it comes to logging of errors: the former would not log errors found while running `scp` if there was a file backing the logger, while the latter would. However, there is no reason for this difference, and it was probably an oversight introduced in an old, really large PR (cockroachdb#74223). This commit makes behavior consistent in both calls and should allow us to see `Get` errors in logs, especially important in `roachtest` failures. Epic: None. Release note: None.

SSH flakes to the test-eng team. Also abstracts away how issue name and team is overriden based on an issue category, which can be ClusterCreation, SSH, or Other Resolves issue cockroachdb#82398 Release justification: test-only change Release note: None

This commit utilises cockroachdb/errors markers api to mark an SSH error with exit code of 255. The test runner, when it is posting an issue to github, can act according to which marker may be present in t.mu.failures[0]. In this case, we override the owning team to protect them from having to investigate what is likely a transient SSH issue. A test now has the concept of a failure[] instead of error[]. Each failure contains the original `squashedErr` and an errors[], which are all the errors passed via t.Failf/Errorf. We can then preserve the relationship of multiple errors to any particular failure within a test, and match on any given reference error (like SSH flakes from above) This moves us away from substring matching on error messages. The test name is prepended to both cluster creation and ssh errors to ease troubleshooting as all those issues will be bucketed under either category respectively. Release justification: test-only change Release note: None

is used. This prevents OSX from prompting to allow incoming network connections when running a roachtest from an ide or shell. Release justification: test-only change Release note: None

blathers-crl · 2023-01-18T17:46:43Z

cockroach-teamcity · 2023-01-18T17:47:04Z

This change is

TestLint/TestRoachVet is a unit test which runs roachvet and greps various messages from its output - the reverted message being one of them. Release note: none. Epic: none

Prior to this commit, the error's stack trace did not link back to the caller of `ParallelE`. Now it does. Release note: None

…try when encountering an error of 255 (SSH). The retry could be exposed to callers fairly simply, but this PR does not introduce that. Today there appears the concept of a "roachprod" and a "command error", the former of which denotes an issue in roachprod handling code, the latter representing an error executing a command over SSH. This PR preserves this behaviour and introduces an updated function signature from `f(i int) ([]byte, error) to f(i int) (*RunResultDetails, error). RunResultDetails has been expanded to capture information about the execution of a command, such as stderr/our, exit code, error, attempt number. Any roachprod error will result in an error being returned, and set in RunResultDetails.Err. Any command error would be only set in RunResultDetails.Err with the returned error being nil. This allows callers to differentiate between a roachprod and a command error, which existing code depends on. Retry handling code can use a function's returned RunResultDetails to determine whether a retry should take place. The default retry handling occurs on `RunResultDetails.RemoteExitStatus == 255` Release note: None Epic: CRDB-21386

Any error returned is assumed to be retryable (in contrast to SSH where by default we look for an exit code of 255) Release note: none Epic: CRDB-21386

@1

Errors were reported like this: > (test_impl.go:291).Fatal: pq: invalid syntax @1 but that line is the impl of `Fatal`. We want to walk up one more frame. Epic: none Release note: None

srosenberg

Thanks for the backport! This will keep 22.1 ssh-related flakes under control.

smg260 · 2023-01-20T14:01:53Z

Nightly roachtest on this branch

renatolabs and others added 11 commits January 18, 2023 17:17

roachtest: Set localhost as the http listener when --local flag

e165d3c

is used. This prevents OSX from prompting to allow incoming network connections when running a roachtest from an ide or shell. Release justification: test-only change Release note: None

smg260 force-pushed the backport22.1-81103-81845-87304-88514-88716-88556-89850-88492-91119-90451-92082-94361 branch from 9ef35bd to 1a01005 Compare January 18, 2023 17:50

smg260 changed the title ~~release-22.1: TODO~~ release-22.1: roachtest structured errors and ssh/scp retries Jan 18, 2023

smg260 force-pushed the backport22.1-81103-81845-87304-88514-88716-88556-89850-88492-91119-90451-92082-94361 branch from 1a01005 to b44b6be Compare January 18, 2023 19:08

Miral Gadani and others added 5 commits January 18, 2023 15:47

This change reverts an update to the fmtsafe.go linter output message.

d974dc9

TestLint/TestRoachVet is a unit test which runs roachvet and greps various messages from its output - the reverted message being one of them. Release note: none. Epic: none

roachprod: improve error in ParallelE

98f72f9

Prior to this commit, the error's stack trace did not link back to the caller of `ParallelE`. Now it does. Release note: None

roachtest: Retry failed SCP attempts in the same way as SSH commands.

ddbfcd3

Any error returned is assumed to be retryable (in contrast to SSH where by default we look for an exit code of 255) Release note: none Epic: CRDB-21386

roachtest: fix error depth

f7c1cce

Errors were reported like this: > (test_impl.go:291).Fatal: pq: invalid syntax @1 but that line is the impl of `Fatal`. We want to walk up one more frame. Epic: none Release note: None

smg260 force-pushed the backport22.1-81103-81845-87304-88514-88716-88556-89850-88492-91119-90451-92082-94361 branch from b44b6be to f7c1cce Compare January 18, 2023 20:47

smg260 marked this pull request as ready for review January 18, 2023 20:55

smg260 requested review from a team as code owners January 18, 2023 20:55

smg260 requested review from herkolategan, renatolabs and srosenberg and removed request for a team January 18, 2023 20:55

srosenberg approved these changes Jan 19, 2023

View reviewed changes

smg260 merged commit 59f807e into cockroachdb:release-22.1 Jan 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-22.1: roachtest structured errors and ssh/scp retries #95448

release-22.1: roachtest structured errors and ssh/scp retries #95448

smg260 commented Jan 18, 2023 •

edited

Loading

blathers-crl bot commented Jan 18, 2023

cockroach-teamcity commented Jan 18, 2023

srosenberg left a comment

smg260 commented Jan 20, 2023

release-22.1: roachtest structured errors and ssh/scp retries #95448

release-22.1: roachtest structured errors and ssh/scp retries #95448

Conversation

smg260 commented Jan 18, 2023 • edited Loading

blathers-crl bot commented Jan 18, 2023

cockroach-teamcity commented Jan 18, 2023

srosenberg left a comment

Choose a reason for hiding this comment

smg260 commented Jan 20, 2023

smg260 commented Jan 18, 2023 •

edited

Loading