Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/log: make test report their stderr in the test output #51410

Closed
wants to merge 2 commits into from

Conversation

knz
Copy link
Contributor

@knz knz commented Jul 14, 2020

Fixes #51409.

Previously, a log.Scope() would capture logs from tests in the same
way as a running server; that is, by also capturing direct writes to
stderr (such as those performed by the Go runtime for uncaught panics,
or internal errors) to a separate cockroach-stderr.log.

This behavior, while desirable for running servers, is problematic for
unit testing: in that case, we prefer that uncaught panics and
possible other Go runtime errors are interleaved in the test output,
so as to automate the triage of test failures.

This patch improves the situation by carving an exception to the
default logging behavior for the purpose of log.Scope(). With this
change, direct writes to stderr by code ran under a test will go to
the test output, regardless of the logging configuration.

The few tests that are truly interested in checking that stderr writes
go to a log file can use the new constructor
log.ScopeWithoutShowLogsAndForcedStderrCapture().

For example, consider this pseudo-test:

func TestFoo(t *testing.T) {
	defer log.Scope(t).Close(t)
	t.Log("i'm t.Logging")
	fmt.Println("i'm on stdout")
	fmt.Fprintln(os.Stderr, "i'm on stderr")
	go func() {
		panic("i panicked somewhere else")
	}()
	time.Sleep(time.Second)
}

Before this patch, the output is like:

=== RUN   TestFoo
    TestFoo: test_log_scope.go:77: test logs captured to: /var/folders/yy/4q8rrssd27vdgbr59w9qbffr0000gn/T/logTestFoo927366034
    TestFoo: test_log_scope.go:58: use -show-logs to present logs inline
    TestFoo: stopper_test.go:743: i'm t.Logging
i'm on stdout
FAIL	github.com/cockroachdb/cockroach/pkg/util/stop	1.624s
FAIL

And the stderr writes + panic are going to a separate file.

With this patch, we get the following output instead:

=== RUN   TestFoo
    TestFoo: test_log_scope.go:88: test logs captured to: /tmp/tmp.v9jl56br/logTestFoo207407502
    TestFoo: test_log_scope.go:58: use -show-logs to present logs inline
    TestFoo: clog_test.go:721: i'm t.Logging
i'm on stdout
i'm on stderr
panic: i panicked somewhere else

goroutine 69 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.TestFoo.func1()
	/data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog_test.go:725 +0x39
created by github.com/cockroachdb/cockroach/pkg/util/log.TestFoo
	/data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog_test.go:724 +0x1db
FAIL	github.com/cockroachdb/cockroach/pkg/util/log	0.047s
FAIL

Alternatives that were considered:

  • have the deferred Close() method of the TestLogScope pull out the
    contents of cockroach-stderr.log and manually duplicate them on the
    test output.

    This solution was rejected because when a go runtime error occurs on
    a separate goroutine than the one where the TestLogScope is defined,
    the deferred don't get a chance to run.

  • use a "tee writer" for the stderr stream. There are two problems
    with this:

    • the Go runtime writes to file descriptor 2 directly, so it's not
      sufficient to override os.Stderr with a tee writer.

    • teeing writes to fd 2 would require spawning a goroutine that
      captures the writes and duplicates them, asynchronously.
      The problem here is that the go runtime promptly terminates
      the process after it completes writing to fd 2. An async
      tee goroutine would not get a chance to run and the fd 2
      output would get lost.

Release note: None

@knz knz requested a review from tbg July 14, 2020 10:35
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, that was fast.
For my indulgement, could you add the "pseudo test" as a separate commit so we can see it fail on CI? I'm mildly worried that TC will bury it even deeper now that it's not in a file. I would love to be proven wrong in which case we could simply proceed.

@knz
Copy link
Contributor Author

knz commented Jul 14, 2020

Added the dummy test.

@knz knz requested a review from a team as a code owner July 14, 2020 12:11
Previously, a `log.Scope()` would capture logs from tests in the same
way as a running server; that is, by also capturing direct writes to
stderr (such as those performed by the Go runtime for uncaught panics,
or internal errors) to a separate `cockroach-stderr.log`.

This behavior, while desirable for running servers, is problematic for
unit testing: in that case, we prefer that uncaught panics and
possible other Go runtime errors are interleaved in the test output,
so as to automate the triage of test failures.

This patch improves the situation by carving an exception to the
default logging behavior for the purpose of `log.Scope()`. With this
change, direct writes to stderr by code ran under a test will go to
the test output, regardless of the logging configuration.

The few tests that are truly interested in checking that stderr writes
go to a log file can use the new constructor
`log.ScopeWithoutShowLogsAndForcedStderrCapture()`.

For example, consider this pseudo-test:

```go
func TestFoo(t *testing.T) {
	defer log.Scope(t).Close(t)
	t.Log("i'm t.Logging")
	fmt.Println("i'm on stdout")
	fmt.Fprintln(os.Stderr, "i'm on stderr")
	go func() {
		panic("i panicked somewhere else")
	}()
	time.Sleep(time.Second)
}
```

Before this patch, the output is like:

```
=== RUN   TestFoo
    TestFoo: test_log_scope.go:77: test logs captured to: /var/folders/yy/4q8rrssd27vdgbr59w9qbffr0000gn/T/logTestFoo927366034
    TestFoo: test_log_scope.go:58: use -show-logs to present logs inline
    TestFoo: stopper_test.go:743: i'm t.Logging
i'm on stdout
FAIL	github.com/cockroachdb/cockroach/pkg/util/stop	1.624s
FAIL
```

And the stderr writes + panic are going to a separate file.

With this patch, we get the following output instead:

```
=== RUN   TestFoo
    TestFoo: test_log_scope.go:88: test logs captured to: /tmp/tmp.v9jl56br/logTestFoo207407502
    TestFoo: test_log_scope.go:58: use -show-logs to present logs inline
    TestFoo: clog_test.go:721: i'm t.Logging
i'm on stdout
i'm on stderr
panic: i panicked somewhere else

goroutine 69 [running]:
github.com/cockroachdb/cockroach/pkg/util/log.TestFoo.func1()
	/data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog_test.go:725 +0x39
created by github.com/cockroachdb/cockroach/pkg/util/log.TestFoo
	/data/home/kena/src/go/src/github.com/cockroachdb/cockroach/pkg/util/log/clog_test.go:724 +0x1db
FAIL	github.com/cockroachdb/cockroach/pkg/util/log	0.047s
FAIL
```

Alternatives that were considered:

- have the deferred `Close()` method of the TestLogScope pull out the
  contents of `cockroach-stderr.log` and manually duplicate them on the
  test output.

  This solution was rejected because when a go runtime error occurs on
  a separate goroutine than the one where the TestLogScope is defined,
  the deferred don't get a chance to run.

- use a "tee writer" for the stderr stream. There are two problems
  with this:

  - the Go runtime writes to file descriptor 2 directly, so it's not
    sufficient to override `os.Stderr` with a tee writer.

  - teeing writes to fd 2 would require spawning a goroutine that
    captures the writes and duplicates them, asynchronously.
    The problem here is that the go runtime promptly terminates
    the process after it completes writing to fd 2. An async
    tee goroutine would not get a chance to run and the fd 2
    output would get lost.

Release note: None
@knz
Copy link
Contributor Author

knz commented Jul 15, 2020

@tbg this PR is revealing 3 separate problems which you may want to address separately:

  • as you expected, the test fails but because the test process terminates early (async uncaught panic), it does not output the text "FAIL TestFoo". There's only a package fail.

    That's expected, but what to do? We probably want to see what the issue reporter thinks about that and perhaps do something clever? At the very least, we may want to file an issue saying "please investigate this" with e.g. the name of the package that's failing.

  • however we don't even get that far. The first thing that teamcity-test.sh invokes is the teamcity-support.sh, which in turns runs a Maybe stress test target. That thing runs a stress check on the package that were changed ahead of running make test. Unfortunately, go test -stress is extremely unhelpful as it completely hides the details of failed tests. The result is that we see the Maybe stress test target fail ahead of the verbose test run and there's no indication whatsoever of where/why the test CI target failed, not even in the build log.

    Example runs:

    https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_Test/2088623

    https://teamcity.cockroachdb.com/buildConfiguration/Cockroach_UnitTests_Test/2091961

  • the logs from the "Maybe stress test" targets are not even in artifacts!! 😭

My recommendation at this point is twofold:

  1. amend teamcity-support.sh as follows:
  • modify the call to go test -stress (maybe inside github-pull-request-make? dunno) to send its output to a log file in artifacts then preserve the exit code if there's a failure.

    However, do not stop teamcity-support.sh if that stage fails and instead let the caller script proceed.

  • after the remainder of the caller script, if there's no test failure in the main CI target but there was one in stress, force the exit code to be 1 and tell the build log about the fact that the CI target is failing due to stress.

  1. consider not running "Maybe stress test" as a prerequisite to all the teamcity-xxx.sh scripts, and instead run it as a separate TC target.

@knz
Copy link
Contributor Author

knz commented Jul 15, 2020

@tbg in light of my last comment, I'm also adding another non-panicy dummy test to demonstrate how the stderr output is reported in the build log.

Release note (<category, see below>): <what> <show> <why>
@knz
Copy link
Contributor Author

knz commented Jul 15, 2020

Look at this beauty:

image

Given this works, we now have the following:

  • without this PR:

    • regardless of the failure mode, the stderr/panic goes to a file. The user must manually navigate to artifacts.
  • with this PR:

    • if the test fails, or panics synchronously, the stderr is in the build log and the TC UI is nice and practical.

    • if there's an uncaught async panic, the stderr is invisible/buried at package-level and possibly lost. There's no trace of it in artifacts.

I'm mildly in favor of keeping (and merging the PR) if there's a way to make the test runner pick up the pieces when there's a package-level failure and tell us just a bit about where to look.

@knz
Copy link
Contributor Author

knz commented Jul 15, 2020

There's another thing we can do btw:

  1. keep the previous config, which sends stderr to files.

  2. make TestLogScope.Close() fetch the stderr from file and output it to the test log, if it runs

  3. make the test script detect whenever a package failed without reporting a failed test -- which is thus an uncaught async panic or some other go runtime error -- and only then have the test script pull out all the *-stderr.log files from artifacts corresponding to that directory and present a list of this in the build log, for further investigation.

What do you think?

@knz
Copy link
Contributor Author

knz commented Jul 15, 2020

(I'm particularly anxious about the stderr not being sent to artifacts in the case where the go test runner is obnoxious and also hides the test output from us, which unfortunately happens)

@knz
Copy link
Contributor Author

knz commented Jul 15, 2020

(It happens because go test -stress captures stderr in-RAM while the test is running. IF the test process is aborted due to an uncaught panic, this in-RAM stderr output is never printed and we lose everything.)

@tbg
Copy link
Member

tbg commented Jul 16, 2020

That's expected, but what to do? We probably want to see what the issue reporter thinks about that and perhaps do something clever? At the very least, we may want to file an issue saying "please investigate this" with e.g. the name of the package that's failing.

My take-away from test-infra rotation is that we should try to be less clever in general. We have a hodgepodge of cleverness that turns the CI pipeline into hard-to-maintain magic. I basically want things to be simple and smooth for "normal" test failures using only the TC built-in Go tooling, and for "abnormal" once (like off-goroutine panics) I want to make sure it's easy enough to find the failure cause via an artifact or the build log. Unfortunately, TeamCity isn't exactly... sane here when we use JSON output (as we do and I think continue to want to do) as it tries really hard to not show us that JSON (which is why we have the testfilter -mode=convert pipe, making a legible artifact).

Let's talk about maybe_stress separately. I agree it's not good. But one at a time. You can "just" comment it out in your intentionally failing commit.

make the test script detect whenever a package failed without reporting a failed test -- which is thus an uncaught async panic or some other go runtime error -- and only then have the test script pull out all the *-stderr.log files from artifacts corresponding to that directory and present a list of this in the build log, for further investigation.

See above - my preference is adherence to the principle of least magic. It's tempting to want to fill in the missing pieces but so far that has only been done through one-off initiatives and ownership then sticks to the author.

@knz
Copy link
Contributor Author

knz commented Jul 16, 2020

So you and I agree we should not focus too much on the case of "off-goroutine panics".

In that case, why do you so much insist on having the stderr go to the test log directly? If the log.Scope's Close method is going to run in the common case, why not have that pull out the stderr output and report it?

@tbg
Copy link
Member

tbg commented Jul 16, 2020

I looked into maybe_stress. It does not seem that broken. On failure, it will output correct JSON output and it goes through run_json_test under the hood, meaning that it... should work about as well as a failure in the main make test invocation. I'm probably missing something.

@tbg
Copy link
Member

tbg commented Jul 16, 2020

So you and I agree we should not focus too much on the case of "off-goroutine panics".

Right, I don't want to build too much magic to make them extra-pleasant to work with, but they do happen and need to be actionable. Which I think would not be the case now, since the output will be in "some" artifact but it's hard to document which one. I think it is worthwhile to have one file that has everything as this is how one debugs "locally", and since the TC build log is so unpleasant to deal with this file needs to be an artifact.

At the same time I'm also averse to magic for the common case, hence I'm not keen on letting the log scope pull conditional tricks on the artifacts.

@knz
Copy link
Contributor Author

knz commented Jul 31, 2020

This PR is worse when a test fails in the stress pre-req in teamcity-common.

In that case, the details of the test failure are hidden by Go, and they are also lost from the artifacts directory. That's strictly worse than the other PR #51499.

To make it better, we need to remove the stress prereq.

@tbg
Copy link
Member

tbg commented Jul 31, 2020

Ah, the stress problem. Remind me again - stress is merging stderr and stdout, so the stack trace now ends up on stdout only. And go test -json will basically pick that up, but it will be as output to some test, and that test will not be marked as failed (due to the off-goroutine panic), so we lose. (If it was an on-goroutine panic though, I think things should actually work as the test is marked as failed in that case?)

@knz
Copy link
Contributor Author

knz commented Jul 31, 2020

Remind me again - stress is merging stderr and stdout, so the stack trace now ends up on stdout only. And go test -json will basically pick that up, but it will be as output to some test, and that test will not be marked as failed (due to the off-goroutine panic), so we lose.

Sounds possible. (I don't fully understand the problem - I just see that the stress prereq fails, we have zero evidence left, and the actual test target after that never runs)

f it was an on-goroutine panic though, I think things should actually work as the test is marked as failed in that case?

I'll run that now.

@tbg
Copy link
Member

tbg commented Jul 31, 2020

Next time you see that, go to the build log, switch to VERBOSE and you'll see the raw json output. The stack trace should be there if my understanding is correct - but it is attributed to a test that is not marked as failed, so TC will not surface it anywhere.

@knz
Copy link
Contributor Author

knz commented Aug 1, 2020

closing in favor of #52200

@knz knz closed this Aug 1, 2020
craig bot pushed a commit that referenced this pull request Aug 6, 2020
52200: util/log: improvements, simplifications and bug fixes r=irfansharif a=knz

Fixes #52026
Fixes #52176
Fixes #52128 
Fixes #51332 
Supersedes #51499, #51410 

For context, I was working on #51990, #51499 and #51410  when I stumbled upon bug #52128 in the logging package. I decided to clean up the code and fix that bug, and I ended up removing a lot of complexity and add many missing comments.

Example CI output:
![image](https://user-images.githubusercontent.com/642886/89288431-efaedc80-d655-11ea-97d9-b096b580d951.png)


By removing the complexity, I removed the problem otherwise solved by  #51499 & #51410 and  I removed the source of the flake causing #52026 and #52176.

Salient bits in this PR:

- **util/log: reinforce the behavior of log file rotation** - "This patch "does the right thing" by opening the new file before switching over whatever redirection may have been set up from the old file to the new one."
- **util/log: simplify TestLogScope+don't capture internal stderr writes** 
  - "This complexity was thus removed and the doc clarified that TestLogScope is only valid under the default configuration, when there was no file output configured yet and no secondary loggers active yet."
  - "However we also have other reasons to dislike the stderr redirect in tests, so I decided to remove it here instead of conjuring new code."
- **pgwire: un-skip TestauthenticationAndHBARules**


Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

util: improve log.Scope runtime output handling
3 participants