-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reopen journald reader when journal is corrupt #26116
Conversation
Pinging @elastic/agent (Team:Agent) |
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
if errors.Is(err, syscall.EBADMSG) { | ||
reader.Close() | ||
goto OPEN_JOURNAL | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reopening a corrupted journal does it mean that a new file is generated on disk? Is some contents to be skipped in the existing file?
How does the reopen affect the Seek
call right after reopening? The checkpoint
still points to the last reported event. Do we need to reset the checkpoint, or set another seek mode on reopen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When reopening a corrupted journal does it mean that a new file is generated on disk? Is some contents to be skipped in the existing file?
I am not sure I understand your question. When a file becomes corrupted and journald realizes it, it rotates the file. Rotation does mean creating a new file and saving the corrupt one with a new name. The new file is going to be empty until new log lines are written to it.
How does the reopen affect the
Seek
call right after reopening? Thecheckpoint
still points to the last reported event. Do we need to reset the checkpoint, or set another seek mode on reopen?
I have added seeking to the head of the journal just to be sure. But after further investigation with more corrupted journals I am not sure if this is the right thing to do. It seems that the reader is already handling these errors and skipping bad messages.
I open a journal, start tailing it, mess with the file to cause data corruption, I get 2-3 bad messages errors and then Journalbeat reads events as if nothing has happened. Maybe we should rather only log one bad messages error and suppress others until the reading of the events is normalized again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your goto OPEN_JOURNAL
triggers an open + a seek instructions with seek
being based the statestore state + the input configuration. If one configured the input to always restart collection on seek, we will restart collection from the oldest available logs. If one configured to always start collection from the latest available message, we have lost logs now. If one has configured to continue from the 'checkpoint', we will continue from the last 'offset', which is what we might want always.
What I wonder is: Do we need to ensure that the seek after reopening does not use the users initial configuration?
Closing in favour of #26224 |
What does this PR do?
When a Beat encounters a corrupt journal, it reopens the journal reader.
Why is it important?
When running Journalbeat for a while when it encounters a corrupt journal, it fills up the log with error messages. Now it requires manual intervention to clean up the status.
Checklist
- [ ] I have made corresponding changes to the documentation- [ ] I have made corresponding change to the default configuration files- [ ] I have added tests that prove my fix is effective or that my feature worksCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Related issues
Closes #23627