pkg/util/log: unify explicit flush of network and file sinks #102772

abarganier · 2023-05-04T17:27:17Z

both for files and network sinks, such as fluent-servers.

I received some feedback that we shouldn't divide these flush operations based on sink type, and instead we should unify the flush operation to flush both (as the crash reporter already does).

The existing function to flush just the file sinks is quite widely used. Introducing flushing of network sinks begs the question, "What if this adds to the runtime of code using this explicit flush mechanism?"

Well, the existing FlushFileSinks calls fsync() 1 with F_FULLFSYNC 2. IIUC, this means that it already is a blocking operation as the fsync() call will wait for the buffered data to be written to permanent storage (not just the disk cache). Given that, I think any caller of this function already assumes it's a blocking operation and therefore should be tolerant of that.

Nonetheless, we implement a timeout mechanism for the flushing of the buffered network sinks as an added protection.

Fixes: #101369

cockroach-teamcity · 2023-05-04T17:27:28Z

This change is

knz

Two thoughts come to mind:

the function could be called just log.Flush, like it was originally, and this would be easier on the eye.
we probably still have excessive calls to the Sync() method on file sinks.

Combining Sync() with Flush() calls is very ancient practice in crdb, and predates even my time. At the very beginning I didn't understand it; but then at some point Ben and I acknowledged it was excessive (see the commit message for 6a1f010). Yet, at the time, we still kept quite a few Sync calls, like the one you're looking at now.

In hindsight, we learned that we only need to call Sync() for file sinks with the "audit" flag set, which means "tamper-resistance" -- where we want the data to reach disk even if the computer is switched off quickly after a log write. We don't need it in other cases -- the OS will sync the data eventually and this is good enough, even for concurrent access to the log file by another process on the machine (which also uses OS buffers).

Removing the unnecessary calls to Sync() would be a good investment, as it would reduce IOPS pressure generally.

I do not necessarily recommend doing this in this PR, but I think this should be queued as a followup cleanup task.

Reviewed 37 of 37 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)

pkg/util/log/log_flush.go line 62 at r1 (raw file):

				// destination sink anyway (there's a good chance we're flushing
				// as part of handling a panic). Display the error.
				fmt.Printf("Error draining buffered log sink: %v\n", err)

make this at least fmt.Fprintf(OrigStderr)
and if possible use the logging format.

pkg/util/log/log_flush.go line 70 at r1 (raw file):

		case <-doneCh:
		case <-timer.C:
			fmt.Printf("Timed out draining buffered log sink: %T\n", bs.child)

ditto

abarganier

After some further testing, I noticed that making the calls to output() on the buffered sinks with forceSync: true in separate goroutines, in order to introduce a timeout, creates some challenges.

log.Flush() is called in many places throughout the code (which I agree, we should clean up). This causes us to make output calls on the buffered log sinks in separate goroutines. Because of this, it's possible for one goroutine to call output() with forceSync: true on a buffered log sink before the previous output call has completed. This triggers a panic within the buffered log sink, which I think is justified.

I think that leaves us with a few options.

Get rid of the separate goroutines to make the output calls, and do away with the timeout.
Introduce some additional mechanism to prevent output calls if a flush is already in progress.
Settle for the state of the original PR pkg/util/log: flush bufferedSinks as part of crash reporter process #101562 and keep separate functions for flushing all buffered sinks, vs. only flushing the file sinks.

@knz I'm inclined to options #1 or #3, as I don't want to introduce additional complexity here. However, please let me know if you have any other ideas!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

knz · 2023-05-08T20:44:53Z

My suggestion would be to:

make an atomic bool variable that is set to true while Flush is currently running
make Flush avoid execution entirely when it observes another Flush is running already

abarganier

@knz thank you for the suggestion - it works well. PTAL when you have some time.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)

pkg/util/log/log_flush.go line 62 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

make this at least fmt.Fprintf(OrigStderr)
and if possible use the logging format.

I was able to make fmt.Fprintf(OrigStderr) work without issue.

However, manually constructing an unstructured entry and attempting to write it to the logger's testingFd2CaptureLogger didn't seem to work the way I would have liked, as I didn't see the logs output anywhere useful.

Let me know if you think I'm using this incorrectly.

pkg/util/log/log_flush.go line 70 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

ditto

Done.

knz

Reviewed 38 of 38 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)

pkg/util/log/log_flush.go line 73 at r2 (raw file):

				// destination sink anyway (there's a good chance we're flushing
				// as part of handling a panic). Display the error.
				fmt.Fprintf(OrigStderr, "Error draining buffered log sink: %v\n", err)

nit: errors do not start with a capital letter. Ditto below.

Ref: cockroachdb#101562 both for files *and* network sinks, such as fluent-servers. I received some feedback that we shouldn't divide these flush operations based on sink type, and instead we should unify the flush operation to flush both (as the crash reporter already does). The existing function to flush just the file sinks is quite widely used. Introducing flushing of network sinks begs the question, "What if this adds to the runtime of code using this explicit flush mechanism?" Well, the existing FlushFileSinks calls fsync() [1] with F_FULLFSYNC [2]. IIUC, this means that it already is a blocking operation as the fsync() call will wait for the buffered data to be written to permanent storage (not just the disk cache). Given that, I think any caller of this function already assumes it's a blocking operation and therefore should be tolerant of that. [1]: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/fsync.2.html [2]: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/fcntl.2.html#//apple_ref/doc/man/2/fcntl Nonetheless, we implement a timeout mechanism for the flushing of the buffered network sinks as an added protection.

abarganier

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)

pkg/util/log/log_flush.go line 73 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: errors do not start with a capital letter. Ditto below.

Ack - I wasn't sure in this case since we're printing to OrigStderr. I've switched to lower case. Thanks for the guidance 🙏

abarganier · 2023-05-15T16:27:44Z

TFTR!

bors r=knz

craig · 2023-05-15T16:58:24Z

Build succeeded:

Bazel Essential CI (Cockroach)

abarganier marked this pull request as ready for review May 4, 2023 17:27

abarganier requested a review from a team May 4, 2023 17:27

abarganier requested review from a team as code owners May 4, 2023 17:27

abarganier requested a review from a team May 4, 2023 17:27

abarganier requested review from a team as code owners May 4, 2023 17:27

abarganier requested a review from a team May 4, 2023 17:27

abarganier requested review from a team as code owners May 4, 2023 17:27

abarganier requested a review from a team May 4, 2023 17:27

abarganier requested a review from a team as a code owner May 4, 2023 17:27

abarganier requested review from a team, benbardin and yuzefovich and removed request for a team May 4, 2023 17:27

abarganier requested review from shermanCRL and removed request for a team May 4, 2023 17:27

abarganier requested review from knz and dhartunian and removed request for a team, benbardin, yuzefovich and shermanCRL May 4, 2023 17:27

abarganier mentioned this pull request May 4, 2023

release-23.1: pkg/util/log: flush bufferedSinks as part of crash reporter process #102144

Merged

knz reviewed May 4, 2023

View reviewed changes

abarganier commented May 8, 2023

View reviewed changes

abarganier force-pushed the log-flush-consolidate branch 2 times, most recently from 8466bd1 to 7022ca5 Compare May 10, 2023 21:19

abarganier commented May 10, 2023

View reviewed changes

abarganier requested a review from knz May 11, 2023 17:33

knz approved these changes May 12, 2023

View reviewed changes

abarganier force-pushed the log-flush-consolidate branch from 7022ca5 to 1629e11 Compare May 15, 2023 16:15

abarganier commented May 15, 2023

View reviewed changes

craig bot merged commit 58fbcfb into cockroachdb:master May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pkg/util/log: unify explicit flush of network and file sinks #102772

pkg/util/log: unify explicit flush of network and file sinks #102772

abarganier commented May 4, 2023

cockroach-teamcity commented May 4, 2023

knz left a comment •

edited

Loading

abarganier left a comment

knz commented May 8, 2023

abarganier left a comment

knz left a comment

abarganier left a comment

abarganier commented May 15, 2023

craig bot commented May 15, 2023

pkg/util/log: unify explicit flush of network and file sinks #102772

pkg/util/log: unify explicit flush of network and file sinks #102772

Conversation

abarganier commented May 4, 2023

cockroach-teamcity commented May 4, 2023

knz left a comment • edited Loading

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

knz commented May 8, 2023

abarganier left a comment

Choose a reason for hiding this comment

knz left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

abarganier commented May 15, 2023

craig bot commented May 15, 2023

knz left a comment •

edited

Loading