Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/log: improvements, simplifications and bug fixes #52200

Merged
merged 6 commits into from
Aug 6, 2020

Conversation

knz
Copy link
Contributor

@knz knz commented Jul 31, 2020

Fixes #52026
Fixes #52176
Fixes #52128
Fixes #51332
Supersedes #51499, #51410

For context, I was working on #51990, #51499 and #51410 when I stumbled upon bug #52128 in the logging package. I decided to clean up the code and fix that bug, and I ended up removing a lot of complexity and add many missing comments.

Example CI output:
image

By removing the complexity, I removed the problem otherwise solved by #51499 & #51410 and I removed the source of the flake causing #52026 and #52176.

Salient bits in this PR:

  • util/log: reinforce the behavior of log file rotation - "This patch "does the right thing" by opening the new file before switching over whatever redirection may have been set up from the old file to the new one."
  • util/log: simplify TestLogScope+don't capture internal stderr writes
    • "This complexity was thus removed and the doc clarified that TestLogScope is only valid under the default configuration, when there was no file output configured yet and no secondary loggers active yet."
    • "However we also have other reasons to dislike the stderr redirect in tests, so I decided to remove it here instead of conjuring new code."
  • pgwire: un-skip TestauthenticationAndHBARules

@knz knz requested review from tbg and irfansharif July 31, 2020 19:15
@knz knz requested a review from a team as a code owner July 31, 2020 19:15
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Jul 31, 2020

I recommend reviewing commit by commit

craig bot pushed a commit that referenced this pull request Aug 3, 2020
52201: util/log: prevent ReportPanic() from swallowing panics in some cases r=andreimatei a=knz

Accompanies #52200 (but is an independent change)

Prior to this patch, if `ReportPanic()` was called and the panic
object ultimately discarded (ie. caught) during tests that were not otherwise using
`TestLogScope`, the panic object would be lost. I believe this was directly noticed by @andreimatei prior.

I also believe this was causing panic objects to disappear quite a lot in fact,
because the test runner *also* catches panics and fails to report
them adequately in some cases (e.g. when the panic occurs during a
`stress` run).

To alleviate the situation, this patch removes some logic previously
present in the code that was attempting (and failing) to remove
duplicate panic prints. That logic was misdesigned to start with anyway
(by me), because it was working under the assumption that
`ReportPanic()` was *only* called for uncaught panics, and another
function `RecoverAndReportNonfatalPanic()` just before that was
blatantly violating that assumption.

Removing the logic causes reportable panics to always be reported in
logs and on stderr, regardless of whether it's caught or not. This
provides more guarantees that the panic object will be seen, at the
expense of having a duplicate print in some edge cases.

Release note: None

52250: sql: use a string builder in decodeCopy r=mjibson a=mjibson

For returning strings, a string builder is better than a bytes buffer.

```
name                  old time/op    new time/op    delta
DecodeCopySimple-12     10.8ns ± 1%    10.6ns ± 1%   -2.03%  (p=0.024 n=5+5)
DecodeCopyEscaped-12     249ns ± 1%     178ns ± 2%  -28.71%  (p=0.008 n=5+5)

name                  old alloc/op   new alloc/op   delta
DecodeCopySimple-12      0.00B          0.00B          ~     (all equal)
DecodeCopyEscaped-12     96.0B ± 0%     40.0B ± 0%  -58.33%  (p=0.008 n=5+5)

name                  old allocs/op  new allocs/op  delta
DecodeCopySimple-12       0.00           0.00          ~     (all equal)
DecodeCopyEscaped-12      2.00 ± 0%      2.00 ± 0%     ~     (all equal)
```

Release note: None

52254: roachtest: bump PredecessorVersion(20.2) to 20.1.4 r=RaduBerinde a=RaduBerinde

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
Co-authored-by: Matt Jibson <matt.jibson@gmail.com>
Co-authored-by: Radu Berinde <radu@cockroachlabs.com>
@knz
Copy link
Contributor Author

knz commented Aug 4, 2020

it's green in CI 🎉

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, and it makes sense to me at a high level (thanks for breaking up the commits and the guiding comments, made it much easier than it could have been), but I'll admit that if you had intentionally added bugs in r4 and r6 I would not have caught them.

Reviewed 4 of 4 files at r1, 3 of 3 files at r2, 2 of 2 files at r3, 11 of 12 files at r4, 1 of 1 files at r5, 5 of 5 files at r6.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @knz and @tbg)


pkg/cli/cli_test.go, line 445 at r4 (raw file):

	for _, cmd := range testData {
		// `demo` sets up a server and log file redirection, which asserts
		// that the logging subsystem has not been initialized yet.  Fake

[nit] double space after period.


pkg/util/log/clog.go, line 131 at r6 (raw file):

		// not only assign a different *os.File reference to
		// os.Stderr. This is because the Go runtime hardcodes stderr writes
		// as writes to file descriptor 2 and disregards the value of

Out of curiosity, why does it do that?


pkg/util/log/flags.go, line 157 at r4 (raw file):

var redactableLogsRequested bool

// TestingResetActive clears the active bit. This is for use int tests

s/int/in


pkg/util/log/flags.go, line 158 at r4 (raw file):

// TestingResetActive clears the active bit. This is for use int tests
// that call stderr redirection alongside other tests that use

s/call/use?


pkg/util/log/stderr_redirect.go, line 112 at r6 (raw file):

	defer l.mu.Unlock()
	if !l.mu.redirectInternalStderrWrites {
		const basemsg = "can't relinquish stderr writes - this logger is not owner%s"

s/not owner/not the owner?


pkg/util/log/test_log_scope.go, line 55 at r4 (raw file):

// like ScopeWithoutShowLogs().
//
// The the documentation of ScopeWithoutShowLogs() for API usage and

s/The/See

craig bot pushed a commit that referenced this pull request Aug 5, 2020
52221: cli: make fewer assumptions about the origin of `stderr` in tests r=irfansharif a=knz

This further removes a dependency on the knowledge that the `log`
package knows anything about stderr, in CLI tests.

Found this while working on #52200.

Release note: None

Co-authored-by: Raphael 'kena' Poss <knz@thaumogen.net>
knz added 6 commits August 5, 2020 19:50
... for clarity while reading the code.

Release note: None
The functions that manage the status of file buffers
are all in `sync_buffer.go` now.

Release note: None
Prior to this patch, the file logging code would close the current log
file before opening a new one.

This was problematic, because if opening a new file would fail, there
was nothing left over to work with. In particular, if the following
conditions all held:

- file logging was enabled
- internal stderr writes (direct writes to stderr) were redirected
  to files
- some error would occur asynchronously, or a panic initiated
  inside the Go runtime

then the details of the error condition would be written to a
now-closed file, and lost.

This patch "does the right thing" by opening the new file before
switching over whatever redirection may have been set up from the old
file to the new one.

This incidentally brings us one step closer to reducing the scope of
the logger mutex, because with the newer logic it becomes safe(er) to
let writes happen asynchronously with the file rotation logic.

The commit also adds a few clarifying comments for the existing
functions.

Release note: None
Prior to this patch, TestLogScope attempted to be too clever for its
own good:

- it was trying to "do its work" by saving/restoring arbitrarily
  complex logging configurations.
- it was also setting up redirection of stderr writes by the Go
  runtime (direct writes to fd 2 / os.Stderr, like what
  servers do).

The first point above was complex because of the existence of
secondary loggers. The code was also incorrect because it did not
properly restore the state of secondary loggers, if they differed from
the main logger.

The code was also inadequate/surprising because it would take over
file logging even if a surrounding test had logging redirected
to files already. This is undesirable because, arguably, if testing
code has set up file logging already (this is not the default) then
the original files are the desired destination for log events. It's
not clear that TestLogScope legitimately can take that over.

This complexity was thus removed and the doc clarified that
TestLogScope is only valid under the default configuration, when there
was no file output configured yet and no secondary loggers active yet.

The second point above (internal stderr redirect) was overly complex
because the TestLogScope.Close() method was trying to "restore"
whatever was there when the scope was open, using a stack-like state
preservation structure. The problem here is that this code was
invalid; the stderr redirection initialization in
`SetupRedactionAndStderrRedirects()` really does not work well in
recursive contexts.

After removing the first point of complexity above (removing the
stack-like behavior of TestLogScope), it could be theoretically
possible to preserve the redirect of internal stderr writes; without
the complexity of state preservation the implementation would likely
be simpler. However we also have other reasons to dislike the
stderr redirect in tests, so I decided to remove it here instead
of conjuring new code.

Finally, this patch removes the NoFileTestLogScope which was only good
for the `demo` code. Instead of creating API complexity which many
readers are bound to find and read about it, this patch transfers
the complexity to the two unit tests that need it.

Release note: None
Now that TestLogScope is not too clever any more, this test can work
properly without error.

Release note: None
Prior to this patch, `closeFileLocked()` would inconditionally reset
the redirection of internal stderr writes back to the original
stderr stream (`OrigStderr`).

This was incorrect, because of the existence of secondary loggers.

To understand why, consider a setup where client code wants
internal stderr writes to be redirected to a logger B, while
a non-stderr A logger is active. Then the following sequence can
occur:

- A creates a file.
- B creates a file, and redirects internal stderr writes to it.
  At this point, the behavior is as expected: stderr writes are
  redirected.
- now the code causes A.closeFileLocked() to be called.

This causes the call on A to cancel the effects of the redirects on
logger B - that is, unwanted "action at a distance".

To correct this, this patch only restores the stderr redirect
if the logger was also owning the redirection to start with.

Furthermore, the entire idea of "redirecting stderr writes to a file"
would be a little confusing if two or more loggers had the boolean set
to do it. In that case, two loggers could non-deterministically
compete to capture stderr when rotating their files alternatively.

In order to clarify the API and prevent client code from
accidentally relying on this poorly-defined behavior, the patch adds a
new method `takeOverInternalStderr()` which enforces the invariant
that at most one logger may take over stderr.

Finally, for context, be it known that none of the irregularities
fixed by this patch are currently impacting the remainder of the
CockroachDB code base: `closeFileLocked()` is only called upon
`TestLogScope.Close()`, and the previous patches in this sequence
have ensured that no stderr is being redirected in the context
of a TestLogScope. This patch thus only exists for the benefit
of future uses of the API, or readers of the code.

Release note: None
Copy link
Contributor Author

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll admit that if you had intentionally added bugs in r4 and r6 I would not have caught them.

There's an authority you have as reviewer: that to draw the line and say "I want to break this further up and/or I want more unit tests to demonstrate what is being done" before you approve the PR.

Do you want to exercise this authority in this case?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)


pkg/cli/cli_test.go, line 445 at r4 (raw file):

Previously, irfansharif (irfan sharif) wrote…

[nit] double space after period.

Done.


pkg/util/log/clog.go, line 131 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

Out of curiosity, why does it do that?

Because os.Stderr is an *os.File, and *os.File is a struct, and so the runtime would need to issue a memory barrier before it can access it, and it's not safe to issue a memory barrier while reporting an error happening e.g. during heap GC or while recovering from a segfault.


pkg/util/log/flags.go, line 157 at r4 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/int/in

Done.


pkg/util/log/flags.go, line 158 at r4 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/call/use?

Done.


pkg/util/log/stderr_redirect.go, line 112 at r6 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/not owner/not the owner?

the message would remain the same if there were multiple owners


pkg/util/log/test_log_scope.go, line 55 at r4 (raw file):

Previously, irfansharif (irfan sharif) wrote…

s/The/See

Done.

Copy link
Contributor

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All's fair in love, war, and testing code - @RaduBerinde.

I'm ok with this PR as is.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @irfansharif and @tbg)

@knz
Copy link
Contributor Author

knz commented Aug 6, 2020

thanks!

bors r=irfansharif

@craig
Copy link
Contributor

craig bot commented Aug 6, 2020

Build succeeded:

@craig craig bot merged commit 62b11e1 into cockroachdb:master Aug 6, 2020
@knz knz deleted the 20200731-log-stderr branch August 6, 2020 09:39
@tbg
Copy link
Member

tbg commented Aug 6, 2020 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants