util/log: improvements, simplifications and bug fixes #52200

knz · 2020-07-31T19:15:51Z

Fixes #52026
Fixes #52176
Fixes #52128
Fixes #51332
Supersedes #51499, #51410

For context, I was working on #51990, #51499 and #51410 when I stumbled upon bug #52128 in the logging package. I decided to clean up the code and fix that bug, and I ended up removing a lot of complexity and add many missing comments.

Example CI output:

By removing the complexity, I removed the problem otherwise solved by #51499 & #51410 and I removed the source of the flake causing #52026 and #52176.

Salient bits in this PR:

util/log: reinforce the behavior of log file rotation - "This patch "does the right thing" by opening the new file before switching over whatever redirection may have been set up from the old file to the new one."
util/log: simplify TestLogScope+don't capture internal stderr writes
- "This complexity was thus removed and the doc clarified that TestLogScope is only valid under the default configuration, when there was no file output configured yet and no secondary loggers active yet."
- "However we also have other reasons to dislike the stderr redirect in tests, so I decided to remove it here instead of conjuring new code."
pgwire: un-skip TestauthenticationAndHBARules

cockroach-teamcity · 2020-07-31T19:16:03Z

This change is

knz · 2020-07-31T19:17:11Z

I recommend reviewing commit by commit

@andreimatei

52201: util/log: prevent ReportPanic() from swallowing panics in some cases r=andreimatei a=knz Accompanies #52200 (but is an independent change) Prior to this patch, if `ReportPanic()` was called and the panic object ultimately discarded (ie. caught) during tests that were not otherwise using `TestLogScope`, the panic object would be lost. I believe this was directly noticed by @andreimatei prior. I also believe this was causing panic objects to disappear quite a lot in fact, because the test runner *also* catches panics and fails to report them adequately in some cases (e.g. when the panic occurs during a `stress` run). To alleviate the situation, this patch removes some logic previously present in the code that was attempting (and failing) to remove duplicate panic prints. That logic was misdesigned to start with anyway (by me), because it was working under the assumption that `ReportPanic()` was *only* called for uncaught panics, and another function `RecoverAndReportNonfatalPanic()` just before that was blatantly violating that assumption. Removing the logic causes reportable panics to always be reported in logs and on stderr, regardless of whether it's caught or not. This provides more guarantees that the panic object will be seen, at the expense of having a duplicate print in some edge cases. Release note: None 52250: sql: use a string builder in decodeCopy r=mjibson a=mjibson For returning strings, a string builder is better than a bytes buffer. ``` name old time/op new time/op delta DecodeCopySimple-12 10.8ns ± 1% 10.6ns ± 1% -2.03% (p=0.024 n=5+5) DecodeCopyEscaped-12 249ns ± 1% 178ns ± 2% -28.71% (p=0.008 n=5+5) name old alloc/op new alloc/op delta DecodeCopySimple-12 0.00B 0.00B ~ (all equal) DecodeCopyEscaped-12 96.0B ± 0% 40.0B ± 0% -58.33% (p=0.008 n=5+5) name old allocs/op new allocs/op delta DecodeCopySimple-12 0.00 0.00 ~ (all equal) DecodeCopyEscaped-12 2.00 ± 0% 2.00 ± 0% ~ (all equal) ``` Release note: None 52254: roachtest: bump PredecessorVersion(20.2) to 20.1.4 r=RaduBerinde a=RaduBerinde Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net> Co-authored-by: Matt Jibson <matt.jibson@gmail.com> Co-authored-by: Radu Berinde <radu@cockroachlabs.com>

knz · 2020-08-04T13:34:59Z

it's green in CI 🎉

irfansharif

This LGTM, and it makes sense to me at a high level (thanks for breaking up the commits and the guiding comments, made it much easier than it could have been), but I'll admit that if you had intentionally added bugs in r4 and r6 I would not have caught them.

Reviewed 4 of 4 files at r1, 3 of 3 files at r2, 2 of 2 files at r3, 11 of 12 files at r4, 1 of 1 files at r5, 5 of 5 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)

pkg/cli/cli_test.go, line 445 at r4 (raw file):

	for _, cmd := range testData {
		// `demo` sets up a server and log file redirection, which asserts
		// that the logging subsystem has not been initialized yet.  Fake

[nit] double space after period.

pkg/util/log/clog.go, line 131 at r6 (raw file):

		// not only assign a different *os.File reference to
		// os.Stderr. This is because the Go runtime hardcodes stderr writes
		// as writes to file descriptor 2 and disregards the value of

Out of curiosity, why does it do that?

pkg/util/log/flags.go, line 157 at r4 (raw file):

var redactableLogsRequested bool

// TestingResetActive clears the active bit. This is for use int tests

s/int/in

pkg/util/log/flags.go, line 158 at r4 (raw file):

// TestingResetActive clears the active bit. This is for use int tests
// that call stderr redirection alongside other tests that use

s/call/use?

pkg/util/log/stderr_redirect.go, line 112 at r6 (raw file):

	defer l.mu.Unlock()
	if !l.mu.redirectInternalStderrWrites {
		const basemsg = "can't relinquish stderr writes - this logger is not owner%s"

s/not owner/not the owner?

pkg/util/log/test_log_scope.go, line 55 at r4 (raw file):

// like ScopeWithoutShowLogs().
//
// The the documentation of ScopeWithoutShowLogs() for API usage and

s/The/See

52221: cli: make fewer assumptions about the origin of `stderr` in tests r=irfansharif a=knz This further removes a dependency on the knowledge that the `log` package knows anything about stderr, in CLI tests. Found this while working on #52200. Release note: None Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>

... for clarity while reading the code. Release note: None

The functions that manage the status of file buffers are all in `sync_buffer.go` now. Release note: None

Prior to this patch, the file logging code would close the current log file before opening a new one. This was problematic, because if opening a new file would fail, there was nothing left over to work with. In particular, if the following conditions all held: - file logging was enabled - internal stderr writes (direct writes to stderr) were redirected to files - some error would occur asynchronously, or a panic initiated inside the Go runtime then the details of the error condition would be written to a now-closed file, and lost. This patch "does the right thing" by opening the new file before switching over whatever redirection may have been set up from the old file to the new one. This incidentally brings us one step closer to reducing the scope of the logger mutex, because with the newer logic it becomes safe(er) to let writes happen asynchronously with the file rotation logic. The commit also adds a few clarifying comments for the existing functions. Release note: None

Prior to this patch, TestLogScope attempted to be too clever for its own good: - it was trying to "do its work" by saving/restoring arbitrarily complex logging configurations. - it was also setting up redirection of stderr writes by the Go runtime (direct writes to fd 2 / os.Stderr, like what servers do). The first point above was complex because of the existence of secondary loggers. The code was also incorrect because it did not properly restore the state of secondary loggers, if they differed from the main logger. The code was also inadequate/surprising because it would take over file logging even if a surrounding test had logging redirected to files already. This is undesirable because, arguably, if testing code has set up file logging already (this is not the default) then the original files are the desired destination for log events. It's not clear that TestLogScope legitimately can take that over. This complexity was thus removed and the doc clarified that TestLogScope is only valid under the default configuration, when there was no file output configured yet and no secondary loggers active yet. The second point above (internal stderr redirect) was overly complex because the TestLogScope.Close() method was trying to "restore" whatever was there when the scope was open, using a stack-like state preservation structure. The problem here is that this code was invalid; the stderr redirection initialization in `SetupRedactionAndStderrRedirects()` really does not work well in recursive contexts. After removing the first point of complexity above (removing the stack-like behavior of TestLogScope), it could be theoretically possible to preserve the redirect of internal stderr writes; without the complexity of state preservation the implementation would likely be simpler. However we also have other reasons to dislike the stderr redirect in tests, so I decided to remove it here instead of conjuring new code. Finally, this patch removes the NoFileTestLogScope which was only good for the `demo` code. Instead of creating API complexity which many readers are bound to find and read about it, this patch transfers the complexity to the two unit tests that need it. Release note: None

Now that TestLogScope is not too clever any more, this test can work properly without error. Release note: None

Prior to this patch, `closeFileLocked()` would inconditionally reset the redirection of internal stderr writes back to the original stderr stream (`OrigStderr`). This was incorrect, because of the existence of secondary loggers. To understand why, consider a setup where client code wants internal stderr writes to be redirected to a logger B, while a non-stderr A logger is active. Then the following sequence can occur: - A creates a file. - B creates a file, and redirects internal stderr writes to it. At this point, the behavior is as expected: stderr writes are redirected. - now the code causes A.closeFileLocked() to be called. This causes the call on A to cancel the effects of the redirects on logger B - that is, unwanted "action at a distance". To correct this, this patch only restores the stderr redirect if the logger was also owning the redirection to start with. Furthermore, the entire idea of "redirecting stderr writes to a file" would be a little confusing if two or more loggers had the boolean set to do it. In that case, two loggers could non-deterministically compete to capture stderr when rotating their files alternatively. In order to clarify the API and prevent client code from accidentally relying on this poorly-defined behavior, the patch adds a new method `takeOverInternalStderr()` which enforces the invariant that at most one logger may take over stderr. Finally, for context, be it known that none of the irregularities fixed by this patch are currently impacting the remainder of the CockroachDB code base: `closeFileLocked()` is only called upon `TestLogScope.Close()`, and the previous patches in this sequence have ensured that no stderr is being redirected in the context of a TestLogScope. This patch thus only exists for the benefit of future uses of the API, or readers of the code. Release note: None

knz

I'll admit that if you had intentionally added bugs in r4 and r6 I would not have caught them.

There's an authority you have as reviewer: that to draw the line and say "I want to break this further up and/or I want more unit tests to demonstrate what is being done" before you approve the PR.

Do you want to exercise this authority in this case?

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)

pkg/cli/cli_test.go, line 445 at r4 (raw file):