Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pkg/util/log: unify explicit flush of network and file sinks #102772

Merged
merged 1 commit into from
May 15, 2023

Conversation

abarganier
Copy link
Contributor

Ref: #101562

both for files and network sinks, such as fluent-servers.

I received some feedback that we shouldn't divide these flush operations based on sink type, and instead we should unify the flush operation to flush both (as the crash reporter already does).

The existing function to flush just the file sinks is quite widely used. Introducing flushing of network sinks begs the question, "What if this adds to the runtime of code using this explicit flush mechanism?"

Well, the existing FlushFileSinks calls fsync() 1 with F_FULLFSYNC 2. IIUC, this means that it already is a blocking operation as the fsync() call will wait for the buffered data to be written to permanent storage (not just the disk cache). Given that, I think any caller of this function already assumes it's a blocking operation and therefore should be tolerant of that.

Nonetheless, we implement a timeout mechanism for the flushing of the buffered network sinks as an added protection.

Fixes: #101369

@abarganier abarganier marked this pull request as ready for review May 4, 2023 17:27
@abarganier abarganier requested a review from a team May 4, 2023 17:27
@abarganier abarganier requested review from a team as code owners May 4, 2023 17:27
@abarganier abarganier requested a review from a team May 4, 2023 17:27
@abarganier abarganier requested review from a team as code owners May 4, 2023 17:27
@abarganier abarganier requested a review from a team May 4, 2023 17:27
@abarganier abarganier requested review from a team as code owners May 4, 2023 17:27
@abarganier abarganier requested a review from a team May 4, 2023 17:27
@abarganier abarganier requested a review from a team as a code owner May 4, 2023 17:27
@abarganier abarganier requested review from a team, benbardin and yuzefovich and removed request for a team May 4, 2023 17:27
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@abarganier abarganier requested review from shermanCRL and removed request for a team May 4, 2023 17:27
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two thoughts come to mind:

  1. the function could be called just log.Flush, like it was originally, and this would be easier on the eye.
  2. we probably still have excessive calls to the Sync() method on file sinks.

Combining Sync() with Flush() calls is very ancient practice in crdb, and predates even my time. At the very beginning I didn't understand it; but then at some point Ben and I acknowledged it was excessive (see the commit message for 6a1f010). Yet, at the time, we still kept quite a few Sync calls, like the one you're looking at now.

In hindsight, we learned that we only need to call Sync() for file sinks with the "audit" flag set, which means "tamper-resistance" -- where we want the data to reach disk even if the computer is switched off quickly after a log write. We don't need it in other cases -- the OS will sync the data eventually and this is good enough, even for concurrent access to the log file by another process on the machine (which also uses OS buffers).

Removing the unnecessary calls to Sync() would be a good investment, as it would reduce IOPS pressure generally.

I do not necessarily recommend doing this in this PR, but I think this should be queued as a followup cleanup task.

Reviewed 37 of 37 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)


pkg/util/log/log_flush.go line 62 at r1 (raw file):

				// destination sink anyway (there's a good chance we're flushing
				// as part of handling a panic). Display the error.
				fmt.Printf("Error draining buffered log sink: %v\n", err)

make this at least fmt.Fprintf(OrigStderr)
and if possible use the logging format.


pkg/util/log/log_flush.go line 70 at r1 (raw file):

		case <-doneCh:
		case <-timer.C:
			fmt.Printf("Timed out draining buffered log sink: %T\n", bs.child)

ditto

Copy link
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After some further testing, I noticed that making the calls to output() on the buffered sinks with forceSync: true in separate goroutines, in order to introduce a timeout, creates some challenges.

log.Flush() is called in many places throughout the code (which I agree, we should clean up). This causes us to make output calls on the buffered log sinks in separate goroutines. Because of this, it's possible for one goroutine to call output() with forceSync: true on a buffered log sink before the previous output call has completed. This triggers a panic within the buffered log sink, which I think is justified.

I think that leaves us with a few options.

  1. Get rid of the separate goroutines to make the output calls, and do away with the timeout.
  2. Introduce some additional mechanism to prevent output calls if a flush is already in progress.
  3. Settle for the state of the original PR pkg/util/log: flush bufferedSinks as part of crash reporter process #101562 and keep separate functions for flushing all buffered sinks, vs. only flushing the file sinks.

@knz I'm inclined to options #1 or #3, as I don't want to introduce additional complexity here. However, please let me know if you have any other ideas!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

@knz
Copy link
Contributor

knz commented May 8, 2023

My suggestion would be to:

  1. make an atomic bool variable that is set to true while Flush is currently running
  2. make Flush avoid execution entirely when it observes another Flush is running already

@abarganier abarganier force-pushed the log-flush-consolidate branch 2 times, most recently from 8466bd1 to 7022ca5 Compare May 10, 2023 21:19
Copy link
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@knz thank you for the suggestion - it works well. PTAL when you have some time.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


pkg/util/log/log_flush.go line 62 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

make this at least fmt.Fprintf(OrigStderr)
and if possible use the logging format.

I was able to make fmt.Fprintf(OrigStderr) work without issue.

However, manually constructing an unstructured entry and attempting to write it to the logger's testingFd2CaptureLogger didn't seem to work the way I would have liked, as I didn't see the logs output anywhere useful.

Let me know if you think I'm using this incorrectly.


pkg/util/log/log_flush.go line 70 at r1 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

ditto

Done.

@abarganier abarganier requested a review from knz May 11, 2023 17:33
Copy link
Contributor

@knz knz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 38 of 38 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)


pkg/util/log/log_flush.go line 73 at r2 (raw file):

				// destination sink anyway (there's a good chance we're flushing
				// as part of handling a panic). Display the error.
				fmt.Fprintf(OrigStderr, "Error draining buffered log sink: %v\n", err)

nit: errors do not start with a capital letter. Ditto below.

Ref: cockroachdb#101562

both for files *and* network sinks, such as fluent-servers.

I received some feedback that we shouldn't divide these
flush operations based on sink type, and instead we should
unify the flush operation to flush both (as the crash reporter
already does).

The existing function to flush just the file sinks is
quite widely used. Introducing flushing of network sinks
begs the question, "What if this adds to the runtime of
code using this explicit flush mechanism?"

Well, the existing FlushFileSinks calls fsync() [1] with
F_FULLFSYNC [2]. IIUC, this means that it already is a
blocking operation as the fsync() call will wait for the
buffered data to be written to permanent storage (not just
the disk cache). Given that, I think any caller of this
function already assumes it's a blocking operation and
therefore should be tolerant of that.

[1]: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/fsync.2.html
[2]: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/fcntl.2.html#//apple_ref/doc/man/2/fcntl

Nonetheless, we implement a timeout mechanism for the
flushing of the buffered network sinks as an added
protection.
Copy link
Contributor Author

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)


pkg/util/log/log_flush.go line 73 at r2 (raw file):

Previously, knz (Raphael 'kena' Poss) wrote…

nit: errors do not start with a capital letter. Ditto below.

Ack - I wasn't sure in this case since we're printing to OrigStderr. I've switched to lower case. Thanks for the guidance 🙏

@abarganier
Copy link
Contributor Author

TFTR!

bors r=knz

@craig craig bot merged commit 58fbcfb into cockroachdb:master May 15, 2023
@craig
Copy link
Contributor

craig bot commented May 15, 2023

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pkg/util/log: route panics through network sinks for SQL pods
3 participants