util/log: make test report their stderr in the test output #51410

knz · 2020-07-14T10:35:01Z

Previously, a log.Scope() would capture logs from tests in the same
way as a running server; that is, by also capturing direct writes to
stderr (such as those performed by the Go runtime for uncaught panics,
or internal errors) to a separate cockroach-stderr.log.

This behavior, while desirable for running servers, is problematic for
unit testing: in that case, we prefer that uncaught panics and
possible other Go runtime errors are interleaved in the test output,
so as to automate the triage of test failures.

This patch improves the situation by carving an exception to the
default logging behavior for the purpose of log.Scope(). With this
change, direct writes to stderr by code ran under a test will go to
the test output, regardless of the logging configuration.

The few tests that are truly interested in checking that stderr writes
go to a log file can use the new constructor
log.ScopeWithoutShowLogsAndForcedStderrCapture().

For example, consider this pseudo-test:

func TestFoo(t *testing.T) {
	defer log.Scope(t).Close(t)
	t.Log("i'm t.Logging")
	fmt.Println("i'm on stdout")
	fmt.Fprintln(os.Stderr, "i'm on stderr")
	go func() {
		panic("i panicked somewhere else")
	}()
	time.Sleep(time.Second)
}

Before this patch, the output is like:

=== RUN   TestFoo
    TestFoo: test_log_scope.go:77: test logs captured to: /var/folders/yy/4q8rrssd27vdgbr59w9qbffr0000gn/T/logTestFoo927366034
    TestFoo: test_log_scope.go:58: use -show-logs to present logs inline
    TestFoo: stopper_test.go:743: i'm t.Logging
i'm on stdout
FAIL	github.com/cockroachdb/cockroach/pkg/util/stop	1.624s
FAIL

And the stderr writes + panic are going to a separate file.

With this patch, we get the following output instead:

=== RUN   TestFoo
    TestFoo: test_log_scope.go:88: test logs captured to: /tmp/tmp.v9jl56br/logTestFoo207407502
    TestFoo: test_log_scope.go:58: use -show-logs to present logs inline
    TestFoo: clog_test.go:721: i'm t.Logging
i'm on stdout
i'm on stderr
panic: i panicked somewhere else

goroutine 69 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.TestFoo.func1()
	/data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog_test.go:725 +0x39
created by github.com/cockroachdb/cockroach/pkg/util/log.TestFoo
	/data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog_test.go:724 +0x1db
FAIL	github.com/cockroachdb/cockroach/pkg/util/log	0.047s
FAIL

Alternatives that were considered:

have the deferred Close() method of the TestLogScope pull out the
contents of cockroach-stderr.log and manually duplicate them on the
test output.

This solution was rejected because when a go runtime error occurs on
a separate goroutine than the one where the TestLogScope is defined,
the deferred don't get a chance to run.
use a "tee writer" for the stderr stream. There are two problems
with this:
- the Go runtime writes to file descriptor 2 directly, so it's not
  sufficient to override os.Stderr with a tee writer.
- teeing writes to fd 2 would require spawning a goroutine that
  captures the writes and duplicates them, asynchronously.
  The problem here is that the go runtime promptly terminates
  the process after it completes writing to fd 2. An async
  tee goroutine would not get a chance to run and the fd 2
  output would get lost.

Release note: None

cockroach-teamcity · 2020-07-14T10:35:10Z

This change is

tbg

Thank you, that was fast.
For my indulgement, could you add the "pseudo test" as a separate commit so we can see it fail on CI? I'm mildly worried that TC will bury it even deeper now that it's not in a file. I would love to be proven wrong in which case we could simply proceed.

knz · 2020-07-14T12:11:50Z

Added the dummy test.

Previously, a `log.Scope()` would capture logs from tests in the same way as a running server; that is, by also capturing direct writes to stderr (such as those performed by the Go runtime for uncaught panics, or internal errors) to a separate `cockroach-stderr.log`. This behavior, while desirable for running servers, is problematic for unit testing: in that case, we prefer that uncaught panics and possible other Go runtime errors are interleaved in the test output, so as to automate the triage of test failures. This patch improves the situation by carving an exception to the default logging behavior for the purpose of `log.Scope()`. With this change, direct writes to stderr by code ran under a test will go to the test output, regardless of the logging configuration. The few tests that are truly interested in checking that stderr writes go to a log file can use the new constructor `log.ScopeWithoutShowLogsAndForcedStderrCapture()`. For example, consider this pseudo-test: ```go func TestFoo(t *testing.T) { defer log.Scope(t).Close(t) t.Log("i'm t.Logging") fmt.Println("i'm on stdout") fmt.Fprintln(os.Stderr, "i'm on stderr") go func() { panic("i panicked somewhere else") }() time.Sleep(time.Second) } ``` Before this patch, the output is like: ``` === RUN TestFoo TestFoo: test_log_scope.go:77: test logs captured to: /var/folders/yy/4q8rrssd27vdgbr59w9qbffr0000gn/T/logTestFoo927366034 TestFoo: test_log_scope.go:58: use -show-logs to present logs inline TestFoo: stopper_test.go:743: i'm t.Logging i'm on stdout FAIL github.com/cockroachdb/cockroach/pkg/util/stop 1.624s FAIL ``` And the stderr writes + panic are going to a separate file. With this patch, we get the following output instead: ``` === RUN TestFoo TestFoo: test_log_scope.go:88: test logs captured to: /tmp/tmp.v9jl56br/logTestFoo207407502 TestFoo: test_log_scope.go:58: use -show-logs to present logs inline TestFoo: clog_test.go:721: i'm t.Logging i'm on stdout i'm on stderr panic: i panicked somewhere else goroutine 69 [running]: github.com/cockroachdb/cockroach/pkg/util/log.TestFoo.func1() /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog_test.go:725 +0x39 created by github.com/cockroachdb/cockroach/pkg/util/log.TestFoo /data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog_test.go:724 +0x1db FAIL github.com/cockroachdb/cockroach/pkg/util/log 0.047s FAIL ``` Alternatives that were considered: - have the deferred `Close()` method of the TestLogScope pull out the contents of `cockroach-stderr.log` and manually duplicate them on the test output. This solution was rejected because when a go runtime error occurs on a separate goroutine than the one where the TestLogScope is defined, the deferred don't get a chance to run. - use a "tee writer" for the stderr stream. There are two problems with this: - the Go runtime writes to file descriptor 2 directly, so it's not sufficient to override `os.Stderr` with a tee writer. - teeing writes to fd 2 would require spawning a goroutine that captures the writes and duplicates them, asynchronously. The problem here is that the go runtime promptly terminates the process after it completes writing to fd 2. An async tee goroutine would not get a chance to run and the fd 2 output would get lost. Release note: None

knz · 2020-07-15T18:28:32Z

@tbg this PR is revealing 3 separate problems which you may want to address separately:

as you expected, the test fails but because the test process terminates early (async uncaught panic), it does not output the text "FAIL TestFoo". There's only a package fail.

That's expected, but what to do? We probably want to see what the issue reporter thinks about that and perhaps do something clever? At the very least, we may want to file an issue saying "please investigate this" with e.g. the name of the package that's failing.
however we don't even get that far. The first thing that teamcity-test.sh invokes is the teamcity-support.sh, which in turns runs a Maybe stress test target. That thing runs a stress check on the package that were changed ahead of running make test. Unfortunately, go test -stress is extremely unhelpful as it completely hides the details of failed tests. The result is that we see the Maybe stress test target fail ahead of the verbose test run and there's no indication whatsoever of where/why the test CI target failed, not even in the build log.

Example runs:

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_Test/2088623

https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_Test/2091961
the logs from the "Maybe stress test" targets are not even in artifacts!! 😭

My recommendation at this point is twofold:

amend teamcity-support.sh as follows:

modify the call to go test -stress (maybe inside github-pull-request-make? dunno) to send its output to a log file in artifacts then preserve the exit code if there's a failure.

However, do not stop teamcity-support.sh if that stage fails and instead let the caller script proceed.
after the remainder of the caller script, if there's no test failure in the main CI target but there was one in stress, force the exit code to be 1 and tell the build log about the fact that the CI target is failing due to stress.

consider not running "Maybe stress test" as a prerequisite to all the teamcity-xxx.sh scripts, and instead run it as a separate TC target.

knz · 2020-07-15T18:31:28Z

@tbg in light of my last comment, I'm also adding another non-panicy dummy test to demonstrate how the stderr output is reported in the build log.

Release note (<category, see below>): <what> <show> <why>

knz · 2020-07-15T19:41:06Z

Look at this beauty:

Given this works, we now have the following:

without this PR:
- regardless of the failure mode, the stderr/panic goes to a file. The user must manually navigate to artifacts.
with this PR:
- if the test fails, or panics synchronously, the stderr is in the build log and the TC UI is nice and practical.
- if there's an uncaught async panic, the stderr is invisible/buried at package-level and possibly lost. There's no trace of it in artifacts.

I'm mildly in favor of keeping (and merging the PR) if there's a way to make the test runner pick up the pieces when there's a package-level failure and tell us just a bit about where to look.

knz · 2020-07-15T19:46:38Z

There's another thing we can do btw:

keep the previous config, which sends stderr to files.
make TestLogScope.Close() fetch the stderr from file and output it to the test log, if it runs
make the test script detect whenever a package failed without reporting a failed test -- which is thus an uncaught async panic or some other go runtime error -- and only then have the test script pull out all the *-stderr.log files from artifacts corresponding to that directory and present a list of this in the build log, for further investigation.

What do you think?

knz · 2020-07-15T19:47:45Z

(I'm particularly anxious about the stderr not being sent to artifacts in the case where the go test runner is obnoxious and also hides the test output from us, which unfortunately happens)

knz · 2020-07-15T19:48:32Z

(It happens because go test -stress captures stderr in-RAM while the test is running. IF the test process is aborted due to an uncaught panic, this in-RAM stderr output is never printed and we lose everything.)

tbg · 2020-07-16T09:35:56Z

That's expected, but what to do? We probably want to see what the issue reporter thinks about that and perhaps do something clever? At the very least, we may want to file an issue saying "please investigate this" with e.g. the name of the package that's failing.

My take-away from test-infra rotation is that we should try to be less clever in general. We have a hodgepodge of cleverness that turns the CI pipeline into hard-to-maintain magic. I basically want things to be simple and smooth for "normal" test failures using only the TC built-in Go tooling, and for "abnormal" once (like off-goroutine panics) I want to make sure it's easy enough to find the failure cause via an artifact or the build log. Unfortunately, TeamCity isn't exactly... sane here when we use JSON output (as we do and I think continue to want to do) as it tries really hard to not show us that JSON (which is why we have the testfilter -mode=convert pipe, making a legible artifact).

Let's talk about maybe_stress separately. I agree it's not good. But one at a time. You can "just" comment it out in your intentionally failing commit.

make the test script detect whenever a package failed without reporting a failed test -- which is thus an uncaught async panic or some other go runtime error -- and only then have the test script pull out all the *-stderr.log files from artifacts corresponding to that directory and present a list of this in the build log, for further investigation.

See above - my preference is adherence to the principle of least magic. It's tempting to want to fill in the missing pieces but so far that has only been done through one-off initiatives and ownership then sticks to the author.

knz · 2020-07-16T09:58:44Z

So you and I agree we should not focus too much on the case of "off-goroutine panics".

In that case, why do you so much insist on having the stderr go to the test log directly? If the log.Scope's Close method is going to run in the common case, why not have that pull out the stderr output and report it?

tbg · 2020-07-16T10:22:51Z

I looked into maybe_stress. It does not seem that broken. On failure, it will output correct JSON output and it goes through run_json_test under the hood, meaning that it... should work about as well as a failure in the main make test invocation. I'm probably missing something.

tbg · 2020-07-16T10:25:04Z

So you and I agree we should not focus too much on the case of "off-goroutine panics".

Right, I don't want to build too much magic to make them extra-pleasant to work with, but they do happen and need to be actionable. Which I think would not be the case now, since the output will be in "some" artifact but it's hard to document which one. I think it is worthwhile to have one file that has everything as this is how one debugs "locally", and since the TC build log is so unpleasant to deal with this file needs to be an artifact.

At the same time I'm also averse to magic for the common case, hence I'm not keen on letting the log scope pull conditional tricks on the artifacts.

knz · 2020-07-31T11:26:38Z

This PR is worse when a test fails in the stress pre-req in teamcity-common.

In that case, the details of the test failure are hidden by Go, and they are also lost from the artifacts directory. That's strictly worse than the other PR #51499.

To make it better, we need to remove the stress prereq.

tbg · 2020-07-31T11:43:02Z

Ah, the stress problem. Remind me again - stress is merging stderr and stdout, so the stack trace now ends up on stdout only. And go test -json will basically pick that up, but it will be as output to some test, and that test will not be marked as failed (due to the off-goroutine panic), so we lose. (If it was an on-goroutine panic though, I think things should actually work as the test is marked as failed in that case?)

knz · 2020-07-31T12:09:34Z

Remind me again - stress is merging stderr and stdout, so the stack trace now ends up on stdout only. And go test -json will basically pick that up, but it will be as output to some test, and that test will not be marked as failed (due to the off-goroutine panic), so we lose.

Sounds possible. (I don't fully understand the problem - I just see that the stress prereq fails, we have zero evidence left, and the actual test target after that never runs)

f it was an on-goroutine panic though, I think things should actually work as the test is marked as failed in that case?

I'll run that now.

tbg · 2020-07-31T12:20:41Z

Next time you see that, go to the build log, switch to VERBOSE and you'll see the raw json output. The stack trace should be there if my understanding is correct - but it is attributed to a test that is not marked as failed, so TC will not surface it anywhere.

knz · 2020-08-01T09:09:45Z

closing in favor of #52200

52200: util/log: improvements, simplifications and bug fixes r=irfansharif a=knz Fixes #52026 Fixes #52176 Fixes #52128 Fixes #51332 Supersedes #51499, #51410 For context, I was working on #51990, #51499 and #51410 when I stumbled upon bug #52128 in the logging package. I decided to clean up the code and fix that bug, and I ended up removing a lot of complexity and add many missing comments. Example CI output: ![image](https://user-images.githubusercontent.com/642886/89288431-efaedc80-d655-11ea-97d9-b096b580d951.png) By removing the complexity, I removed the problem otherwise solved by #51499 & #51410 and I removed the source of the flake causing #52026 and #52176. Salient bits in this PR: - **util/log: reinforce the behavior of log file rotation** - "This patch "does the right thing" by opening the new file before switching over whatever redirection may have been set up from the old file to the new one." - **util/log: simplify TestLogScope+don't capture internal stderr writes** - "This complexity was thus removed and the doc clarified that TestLogScope is only valid under the default configuration, when there was no file output configured yet and no secondary loggers active yet." - "However we also have other reasons to dislike the stderr redirect in tests, so I decided to remove it here instead of conjuring new code." - **pgwire: un-skip TestauthenticationAndHBARules** Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

knz requested a review from tbg July 14, 2020 10:35

tbg reviewed Jul 14, 2020

View reviewed changes

knz force-pushed the 20200714-log branch from 3aeba67 to ebef786 Compare July 14, 2020 12:11

knz requested a review from a team as a code owner July 14, 2020 12:11

knz force-pushed the 20200714-log branch from ebef786 to 164c1ab Compare July 15, 2020 18:19

knz force-pushed the 20200714-log branch from 164c1ab to dec9d36 Compare July 15, 2020 18:31

log: dummy test

88815d7

Release note (<category, see below>): <what> <show> <why>

knz force-pushed the 20200714-log branch from dec9d36 to 88815d7 Compare July 15, 2020 19:31

knz mentioned this pull request Jul 16, 2020

util: improve log.Scope runtime output handling #51409

Closed

tbg mentioned this pull request Jul 31, 2020

util/log: make test report their stderr in the test output (v2) #51499

Closed

knz mentioned this pull request Jul 31, 2020

util/log: improvements, simplifications and bug fixes #52200

Merged

knz closed this Aug 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

util/log: make test report their stderr in the test output #51410

util/log: make test report their stderr in the test output #51410

knz commented Jul 14, 2020

cockroach-teamcity commented Jul 14, 2020

tbg left a comment

knz commented Jul 14, 2020

knz commented Jul 15, 2020 •

edited

Loading

knz commented Jul 15, 2020

knz commented Jul 15, 2020

knz commented Jul 15, 2020

knz commented Jul 15, 2020

knz commented Jul 15, 2020

tbg commented Jul 16, 2020

knz commented Jul 16, 2020

tbg commented Jul 16, 2020

tbg commented Jul 16, 2020

knz commented Jul 31, 2020

tbg commented Jul 31, 2020

knz commented Jul 31, 2020

tbg commented Jul 31, 2020

knz commented Aug 1, 2020

util/log: make test report their stderr in the test output #51410

util/log: make test report their stderr in the test output #51410

Conversation

knz commented Jul 14, 2020

cockroach-teamcity commented Jul 14, 2020

tbg left a comment

Choose a reason for hiding this comment

knz commented Jul 14, 2020

knz commented Jul 15, 2020 • edited Loading

knz commented Jul 15, 2020

knz commented Jul 15, 2020

knz commented Jul 15, 2020

knz commented Jul 15, 2020

knz commented Jul 15, 2020

tbg commented Jul 16, 2020

knz commented Jul 16, 2020

tbg commented Jul 16, 2020

tbg commented Jul 16, 2020

knz commented Jul 31, 2020

tbg commented Jul 31, 2020

knz commented Jul 31, 2020

tbg commented Jul 31, 2020

knz commented Aug 1, 2020

knz commented Jul 15, 2020 •

edited

Loading