Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

util/log: don't panic #17871

Merged
merged 1 commit into from
Sep 5, 2017
Merged

util/log: don't panic #17871

merged 1 commit into from
Sep 5, 2017

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Aug 23, 2017

Previously, log.outputLogEntry could panic while holding the log mutex.
This would deadlock any goroutine that logged while recovering from the
panic, which is approximately all of the recover routines. Most
annoyingly, the crash reporter would deadlock, swallowing the cause of
the panic.

Avoid panicking while holding the log mutex and use l.exit instead,
which exists for this very purpose. In the process, enforce the
invariant that l.mu is held when l.exit is called. (The previous
behavior was, in fact, incorrect, as l.flushAll should not be called
without holding l.mu.)

Also add a Tcl test to ensure this doesn't break in the future.

@benesch benesch requested review from danhhz and tbg August 23, 2017 21:07
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@benesch
Copy link
Contributor Author

benesch commented Aug 23, 2017

Wait, this will also deadlock because the mutex is already held when l.exit() is called. Grr.

@knz
Copy link
Contributor

knz commented Aug 24, 2017

I think this should be backported to 1.0.6 if we release one.

@tbg
Copy link
Member

tbg commented Aug 24, 2017

I think we'll release one, given the recent bugs that we fixed.

@tbg
Copy link
Member

tbg commented Aug 24, 2017

@benesch Imk when I can take a look.

@benesch benesch force-pushed the log-defer-unlock branch 2 times, most recently from db3909e to edbd714 Compare August 31, 2017 19:44
@benesch benesch requested a review from a team as a code owner August 31, 2017 19:44
@benesch benesch requested a review from knz August 31, 2017 19:45
@benesch
Copy link
Contributor Author

benesch commented Aug 31, 2017

@tschottdorf @knz mind taking another look today? Turns out this was more broken than I thought.

@benesch benesch changed the title util/log: ensure log mutex is unlocked after panic util/log: don't panic Aug 31, 2017
@benesch
Copy link
Contributor Author

benesch commented Aug 31, 2017

Specifically, @petermattis made l.exit acquire l.mu in 9fcfe5e, and @tschottdorf fixed the deadlock this caused in ed6a100 by unlocking l.mu before calling l.exit.

The fix I've applied here is to instead require that l.exit's caller acquire l.mu. This was both the original expectation when @tschottdorf imported clog in b0d884b and makes l.outputLogEntry much cleaner because it can defer the unlock of l.mu.

@knz
Copy link
Contributor

knz commented Aug 31, 2017

This looks good, but I am a bit sad about introducing a defer in the logging hot path. We've tried hard to avoid that until now. Tobias what do you think?


Reviewed 1 of 2 files at r1.
Review status: 1 of 2 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@benesch
Copy link
Contributor Author

benesch commented Aug 31, 2017

Ha, I had no idea defer was slow! I only added the defer to protect against someone adding a panic, since it's very confusing to panic while holding a lock unless you defer the unlock. But it's not strictly necessary now that I've removed all the panics from the code path.


Review status: 1 of 2 files reviewed at latest revision, all discussions resolved, some commit checks failed.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 1 of 2 files reviewed at latest revision, 3 unresolved discussions, some commit checks failed.


pkg/util/log/clog.go, line 713 at r1 (raw file):

	// TODO(tschottdorf): this is a pretty horrible critical section.
	l.mu.Lock()
	defer l.mu.Unlock()

Rather than the defer, I think you should leave the explicit l.mu.Unlock on the two returns.


pkg/util/log/clog.go, line 803 at r1 (raw file):

	done := make(chan bool, 1)
	go func() {
		logging.flushAll()

This can get run without logging.mu being held. I'm not understanding this portion of the change given that timeoutFlush was only called from one place.


pkg/util/log/clog.go, line 841 at r1 (raw file):

// to stderr. It flushes the logs and exits the program; there's no point in
// hanging around. l.mu is held.
func (l *loggingT) exit(err error) {

Now that the lock is required, let's rename this exitLocked.


Comments from Reviewable

@tbg
Copy link
Member

tbg commented Sep 1, 2017 via email

@benesch
Copy link
Contributor Author

benesch commented Sep 1, 2017

@tschottdorf, ack. @petermattis @knz PTAL.


Review status: 1 of 2 files reviewed at latest revision, 3 unresolved discussions, some commit checks pending.


pkg/util/log/clog.go, line 713 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Rather than the defer, I think you should leave the explicit l.mu.Unlock on the two returns.

Done.


pkg/util/log/clog.go, line 803 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

This can get run without logging.mu being held. I'm not understanding this portion of the change given that timeoutFlush was only called from one place.

It was changed to work with the defer l.mu.Unlock(); not necessary now. Reverted.


pkg/util/log/clog.go, line 841 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Now that the lock is required, let's rename this exitLocked.

Done.


Comments from Reviewable

@petermattis
Copy link
Collaborator

:lgtm:


Review status: 1 of 2 files reviewed at latest revision, all discussions resolved, some commit checks pending.


Comments from Reviewable

@benesch benesch force-pushed the log-defer-unlock branch 2 times, most recently from 3e89c81 to f0c4fa4 Compare September 5, 2017 13:55
Previously, log.outputLogEntry could panic while holding the log mutex.
This would deadlock any goroutine that logged while recovering from the
panic, which is approximately all of the recover routines. Most
annoyingly, the crash reporter would deadlock, swallowing the cause of
the panic.

Avoid panicking while holding the log mutex and use l.exit instead,
which exists for this very purpose. In the process, enforce the
invariant that l.mu is held when l.exit is called. (The previous
behavior was, in fact, incorrect, as l.flushAll should not be called
without holding l.mu.)

Also add a Tcl test to ensure this doesn't break in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants