Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reopen journald reader when journal is corrupt #26116

Closed

Conversation

kvch
Copy link
Contributor

@kvch kvch commented Jun 2, 2021

What does this PR do?

When a Beat encounters a corrupt journal, it reopens the journal reader.

Why is it important?

When running Journalbeat for a while when it encounters a corrupt journal, it fills up the log with error messages. Now it requires manual intervention to clean up the status.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
    - [ ] I have made corresponding changes to the documentation
    - [ ] I have made corresponding change to the default configuration files
    - [ ] I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

Closes #23627

@kvch kvch added Journalbeat Team:Elastic-Agent Label for the Agent team labels Jun 2, 2021
@kvch kvch requested a review from urso June 2, 2021 16:15
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@botelastic botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jun 2, 2021
@kvch kvch added backport-v7.14.0 Automated backport with mergify backport-v7.13.0 Automated backport with mergify labels Jun 2, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jun 2, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #26116 event

  • Start Time: 2021-06-02T16:19:33.713+0000

  • Duration: 120 min 37 sec

  • Commit: 33915e1

Test stats 🧪

Test Results
Failed 0
Passed 13930
Skipped 2296
Total 16226

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 13930
Skipped 2296
Total 16226

if errors.Is(err, syscall.EBADMSG) {
reader.Close()
goto OPEN_JOURNAL
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reopening a corrupted journal does it mean that a new file is generated on disk? Is some contents to be skipped in the existing file?

How does the reopen affect the Seek call right after reopening? The checkpoint still points to the last reported event. Do we need to reset the checkpoint, or set another seek mode on reopen?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reopening a corrupted journal does it mean that a new file is generated on disk? Is some contents to be skipped in the existing file?

I am not sure I understand your question. When a file becomes corrupted and journald realizes it, it rotates the file. Rotation does mean creating a new file and saving the corrupt one with a new name. The new file is going to be empty until new log lines are written to it.

How does the reopen affect the Seek call right after reopening? The checkpoint still points to the last reported event. Do we need to reset the checkpoint, or set another seek mode on reopen?

I have added seeking to the head of the journal just to be sure. But after further investigation with more corrupted journals I am not sure if this is the right thing to do. It seems that the reader is already handling these errors and skipping bad messages.

I open a journal, start tailing it, mess with the file to cause data corruption, I get 2-3 bad messages errors and then Journalbeat reads events as if nothing has happened. Maybe we should rather only log one bad messages error and suppress others until the reading of the events is normalized again?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your goto OPEN_JOURNAL triggers an open + a seek instructions with seek being based the statestore state + the input configuration. If one configured the input to always restart collection on seek, we will restart collection from the oldest available logs. If one configured to always start collection from the latest available message, we have lost logs now. If one has configured to continue from the 'checkpoint', we will continue from the last 'offset', which is what we might want always.

What I wonder is: Do we need to ensure that the seek after reopening does not use the users initial configuration?

@kvch
Copy link
Contributor Author

kvch commented Jun 23, 2021

Closing in favour of #26224

@kvch kvch closed this Jun 23, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.13.0 Automated backport with mergify backport-v7.14.0 Automated backport with mergify Journalbeat Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Journalbeat reopen reader on corrupted journal
3 participants