-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pkg/util/log: unify explicit flush of network and file sinks #102772
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two thoughts come to mind:
- the function could be called just
log.Flush
, like it was originally, and this would be easier on the eye. - we probably still have excessive calls to the
Sync()
method on file sinks.
Combining Sync() with Flush() calls is very ancient practice in crdb, and predates even my time. At the very beginning I didn't understand it; but then at some point Ben and I acknowledged it was excessive (see the commit message for 6a1f010). Yet, at the time, we still kept quite a few Sync calls, like the one you're looking at now.
In hindsight, we learned that we only need to call Sync()
for file sinks with the "audit" flag set, which means "tamper-resistance" -- where we want the data to reach disk even if the computer is switched off quickly after a log write. We don't need it in other cases -- the OS will sync the data eventually and this is good enough, even for concurrent access to the log file by another process on the machine (which also uses OS buffers).
Removing the unnecessary calls to Sync() would be a good investment, as it would reduce IOPS pressure generally.
I do not necessarily recommend doing this in this PR, but I think this should be queued as a followup cleanup task.
Reviewed 37 of 37 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)
pkg/util/log/log_flush.go
line 62 at r1 (raw file):
// destination sink anyway (there's a good chance we're flushing // as part of handling a panic). Display the error. fmt.Printf("Error draining buffered log sink: %v\n", err)
make this at least fmt.Fprintf(OrigStderr)
and if possible use the logging format.
pkg/util/log/log_flush.go
line 70 at r1 (raw file):
case <-doneCh: case <-timer.C: fmt.Printf("Timed out draining buffered log sink: %T\n", bs.child)
ditto
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After some further testing, I noticed that making the calls to output()
on the buffered sinks with forceSync: true
in separate goroutines, in order to introduce a timeout, creates some challenges.
log.Flush()
is called in many places throughout the code (which I agree, we should clean up). This causes us to make output calls on the buffered log sinks in separate goroutines. Because of this, it's possible for one goroutine to call output()
with forceSync: true
on a buffered log sink before the previous output call has completed. This triggers a panic within the buffered log sink, which I think is justified.
I think that leaves us with a few options.
- Get rid of the separate goroutines to make the
output
calls, and do away with the timeout. - Introduce some additional mechanism to prevent
output
calls if a flush is already in progress. - Settle for the state of the original PR pkg/util/log: flush bufferedSinks as part of crash reporter process #101562 and keep separate functions for flushing all buffered sinks, vs. only flushing the file sinks.
@knz I'm inclined to options #1 or #3, as I don't want to introduce additional complexity here. However, please let me know if you have any other ideas!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)
My suggestion would be to:
|
8466bd1
to
7022ca5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@knz thank you for the suggestion - it works well. PTAL when you have some time.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)
pkg/util/log/log_flush.go
line 62 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
make this at least
fmt.Fprintf(OrigStderr)
and if possible use the logging format.
I was able to make fmt.Fprintf(OrigStderr)
work without issue.
However, manually constructing an unstructured entry and attempting to write it to the logger's testingFd2CaptureLogger
didn't seem to work the way I would have liked, as I didn't see the logs output anywhere useful.
Let me know if you think I'm using this incorrectly.
pkg/util/log/log_flush.go
line 70 at r1 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
ditto
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 38 of 38 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @dhartunian)
pkg/util/log/log_flush.go
line 73 at r2 (raw file):
// destination sink anyway (there's a good chance we're flushing // as part of handling a panic). Display the error. fmt.Fprintf(OrigStderr, "Error draining buffered log sink: %v\n", err)
nit: errors do not start with a capital letter. Ditto below.
Ref: cockroachdb#101562 both for files *and* network sinks, such as fluent-servers. I received some feedback that we shouldn't divide these flush operations based on sink type, and instead we should unify the flush operation to flush both (as the crash reporter already does). The existing function to flush just the file sinks is quite widely used. Introducing flushing of network sinks begs the question, "What if this adds to the runtime of code using this explicit flush mechanism?" Well, the existing FlushFileSinks calls fsync() [1] with F_FULLFSYNC [2]. IIUC, this means that it already is a blocking operation as the fsync() call will wait for the buffered data to be written to permanent storage (not just the disk cache). Given that, I think any caller of this function already assumes it's a blocking operation and therefore should be tolerant of that. [1]: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/fsync.2.html [2]: https://developer.apple.com/library/archive/documentation/System/Conceptual/ManPages_iPhoneOS/man2/fcntl.2.html#//apple_ref/doc/man/2/fcntl Nonetheless, we implement a timeout mechanism for the flushing of the buffered network sinks as an added protection.
7022ca5
to
1629e11
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian and @knz)
pkg/util/log/log_flush.go
line 73 at r2 (raw file):
Previously, knz (Raphael 'kena' Poss) wrote…
nit: errors do not start with a capital letter. Ditto below.
Ack - I wasn't sure in this case since we're printing to OrigStderr
. I've switched to lower case. Thanks for the guidance 🙏
TFTR! bors r=knz |
Build succeeded: |
Ref: #101562
both for files and network sinks, such as fluent-servers.
I received some feedback that we shouldn't divide these flush operations based on sink type, and instead we should unify the flush operation to flush both (as the crash reporter already does).
The existing function to flush just the file sinks is quite widely used. Introducing flushing of network sinks begs the question, "What if this adds to the runtime of code using this explicit flush mechanism?"
Well, the existing FlushFileSinks calls fsync() 1 with F_FULLFSYNC 2. IIUC, this means that it already is a blocking operation as the fsync() call will wait for the buffered data to be written to permanent storage (not just the disk cache). Given that, I think any caller of this function already assumes it's a blocking operation and therefore should be tolerant of that.
Nonetheless, we implement a timeout mechanism for the flushing of the buffered network sinks as an added protection.
Fixes: #101369