-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.1: roachtest structured errors and ssh/scp retries #95448
release-22.1: roachtest structured errors and ssh/scp retries #95448
Conversation
In 2019, TeamCity released a new "experimental UI" that changed not only the look and feel of the app, but also the URLs where pages are located. Notably, information about a build used to be under the /viewLog.html page in the old UI, but was moved to /buildConfiguration/{buildTypeID}/{buildID} in the new UI. If the user enabled the new UI locally, TeamCity will automatically redirect requests from the old URLs to the new ones; however, the URL fragments are not preserved. This is quite bad for roachtest, since the links generated in failed test reports relies on the fragments in order to expand the relevant folders in a potentially long list of artifacts. This commit updates TeamCity links generated by roachtest when reporting failed tests to always use the new UI. It has the drawback of forcing the new UI on everyone, but the benefit of always expanding the correct folder. A new environment variable, `TC_BUILDTYPE_ID`, was added to TeamCity builds to support this use case; it contains the ID of the corresponding build type (also known as "build configuration"). Resolves: cockroachdb#80852. Release note: None.
This commit changes the 'Parameters' section of a test failure. Previously, it would only include the contents of the `TAGS` and `GOFLAGS` environment variables. It now adds cluster configuration values obtained from the `ClusterSpec`. These parameters are prefixed with `ROACHTEST` (e.g., `ROACHTEST_cloud`). For now, just the cloud name, number of CPUs and SSDs are included. In the future, we can expand this work to support other relevant cluster configuration values as needed to debug test failures. Resolves: cockroachdb#80799. Release note: None.
…orted. Some runtime values are probabilistically set or otherwise have defaults which are not evident at the time we report issues. This change returns the vm options associated with a created cluster, from which we can extract the final computed values for such fields including the type of filesystem and whether encryption is enabled. Release justification: test-only change Release note: none
…test. No previous coverage exists for posting issues to github. This change moves the relevant code into a separate source file with associated test. External functions dealing with actual issue posting to github, and loading the TEAMS.yaml are now injected to facilitate easier testing. Release justification: test-only change Release note: none
When a cluster is started with the `--skip-init` option, the caller can run `roachprod init` at any time to initialize the cluster. Unfortunately, the code used to initialize the cluster was duplicated: one copy existed in the `start` path, and another in the `init` path. Since the latter is used far less frequently, it had a bug that went unnoticed: it hardcoded the first node index as `0`, when node indices start at 1. This commit fixes the issue by updating the constant and sharing code between `init `and `start`. Fixes cockroachdb#88226. Release note: None
In cockroachdb#88514, the cluster start logic was refactored to reuse the same code across `init` and `start`, fixing a bug in the former. However, the refactoring overlooked the fact that we previously always set the default cluster settings when there's more than one node in the cluster. This fixes that by setting the default cluster settings in that case; one particularly important cluster setting is the license key, necessary for some roachtests. Fixes cockroachdb#88660. Fixes cockroachdb#88665. Fixes cockroachdb#88666. Fixes cockroachdb#88710. Release note: None
`roachtest` predates `cockroachdb/errors` and as a result so far hasn't capitalized many improvements. Today, roachtest errors are often noisy. This commit adopts `cockroachdb/errors` and reaps some of the immediate rewards, while making related improvements that would also have been possible without the adoption. The main user-facing change is that error messages in the output are now a lot more concise; these would previously - sometimes - include the entire stack trace. Now they contain only the topmost stack record of the innermost error: ``` (test_test.go:129).func3: first error (test_test.go:130).func3: second error ``` The full error continues to be logged, but we write it to files in the artifacts, where they can be inspected just in case. This now happens unconditionally for all errors, whereas the old code only logged the stacks if the error was reported in a certain way. Internally, the reorganization has also considerably simplified roachtest. Stack frame offset tracking has been removed, since `cockroachdb/errors` already handles it. Custom rendering code was similarly significantly trimmed down. In making this change, I opted to always have `roachtest` create loggers backed by real files. This was previously elided in tests, but this would have caused extra conditionals. It's better to standardize on the way in which `roachtest` is always used in actual invocations. Looking ahead, structured errors open a few avenues: - can assign the owner based on the type of failure. For example, cockroachdb#87106 wants consistency check failures to always go to the KV team, regardless of which test's cluster was being checked. Failures during an IMPORT/RESTORE (common across all tests) could be routed to the Disaster Recovery team by default (assuming we provide a wrapper around these operations that all tests use and which does this wrapping). - Similarly, SSH failures can be special cased via a marking error and can be directed away from the owning team, which can't do anything about them anyway (cockroachdb#82398). - We can conceivably start grouping failure modes by "error signature". That is, errors which have a "comparable" chain of errors (e.g. same types, and within formatted errors the same format string). Issues could then be reused only for compatible error signatures. Release note: None
`roachprod.Get` and `roachprod.Put` behaved differently when it comes to logging of errors: the former would not log errors found while running `scp` if there was a file backing the logger, while the latter would. However, there is no reason for this difference, and it was probably an oversight introduced in an old, really large PR (cockroachdb#74223). This commit makes behavior consistent in both calls and should allow us to see `Get` errors in logs, especially important in `roachtest` failures. Epic: None. Release note: None.
SSH flakes to the test-eng team. Also abstracts away how issue name and team is overriden based on an issue category, which can be ClusterCreation, SSH, or Other Resolves issue cockroachdb#82398 Release justification: test-only change Release note: None
This commit utilises cockroachdb/errors markers api to mark an SSH error with exit code of 255. The test runner, when it is posting an issue to github, can act according to which marker may be present in t.mu.failures[0]. In this case, we override the owning team to protect them from having to investigate what is likely a transient SSH issue. A test now has the concept of a failure[] instead of error[]. Each failure contains the original `squashedErr` and an errors[], which are all the errors passed via t.Failf/Errorf. We can then preserve the relationship of multiple errors to any particular failure within a test, and match on any given reference error (like SSH flakes from above) This moves us away from substring matching on error messages. The test name is prepended to both cluster creation and ssh errors to ease troubleshooting as all those issues will be bucketed under either category respectively. Release justification: test-only change Release note: None
is used. This prevents OSX from prompting to allow incoming network connections when running a roachtest from an ide or shell. Release justification: test-only change Release note: None
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
9ef35bd
to
1a01005
Compare
1a01005
to
b44b6be
Compare
TestLint/TestRoachVet is a unit test which runs roachvet and greps various messages from its output - the reverted message being one of them. Release note: none. Epic: none
Prior to this commit, the error's stack trace did not link back to the caller of `ParallelE`. Now it does. Release note: None
…try when encountering an error of 255 (SSH). The retry could be exposed to callers fairly simply, but this PR does not introduce that. Today there appears the concept of a "roachprod" and a "command error", the former of which denotes an issue in roachprod handling code, the latter representing an error executing a command over SSH. This PR preserves this behaviour and introduces an updated function signature from `f(i int) ([]byte, error) to f(i int) (*RunResultDetails, error). RunResultDetails has been expanded to capture information about the execution of a command, such as stderr/our, exit code, error, attempt number. Any roachprod error will result in an error being returned, and set in RunResultDetails.Err. Any command error would be only set in RunResultDetails.Err with the returned error being nil. This allows callers to differentiate between a roachprod and a command error, which existing code depends on. Retry handling code can use a function's returned RunResultDetails to determine whether a retry should take place. The default retry handling occurs on `RunResultDetails.RemoteExitStatus == 255` Release note: None Epic: CRDB-21386
Any error returned is assumed to be retryable (in contrast to SSH where by default we look for an exit code of 255) Release note: none Epic: CRDB-21386
Errors were reported like this: > (test_impl.go:291).Fatal: pq: invalid syntax @1 but that line is the impl of `Fatal`. We want to walk up one more frame. Epic: none Release note: None
b44b6be
to
f7c1cce
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the backport! This will keep 22.1 ssh-related flakes under control.
Nightly roachtest on this branch |
Backport:
roachprod init
." (roachprod: fixroachprod init
. #88514)Please see individual PRs for details.
/cc @cockroachdb/release
Release justification: test only change to make roachtests more resilient and easier to debug