Reopen journald reader when journal is corrupt #26116

kvch · 2021-06-02T16:14:59Z

What does this PR do?

When a Beat encounters a corrupt journal, it reopens the journal reader.

Why is it important?

When running Journalbeat for a while when it encounters a corrupt journal, it fills up the log with error messages. Now it requires manual intervention to clean up the status.

Checklist

My code follows the style guidelines of this project
I have commented my code, particularly in hard-to-understand areas
~~- [ ] I have made corresponding changes to the documentation~~
~~- [ ] I have made corresponding change to the default configuration files~~
~~- [ ] I have added tests that prove my fix is effective or that my feature works~~
I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

Related issues

Closes #23627

elasticmachine · 2021-06-02T16:15:02Z

Pinging @elastic/agent (Team:Agent)

elasticmachine · 2021-06-02T16:22:06Z

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: Pull request #26116 event
Start Time: 2021-06-02T16:19:33.713+0000
Duration: 120 min 37 sec
Commit: 33915e1

Test stats 🧪

Test	Results
Failed	0
Passed	13930
Skipped	2296
Total	16226

Trends 🧪

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test	Results
Failed	0
Passed	13930
Skipped	2296
Total	16226

urso · 2021-06-03T12:13:33Z

filebeat/input/journald/input.go

+			if errors.Is(err, syscall.EBADMSG) {
+				reader.Close()
+				goto OPEN_JOURNAL
+			}


When reopening a corrupted journal does it mean that a new file is generated on disk? Is some contents to be skipped in the existing file?

How does the reopen affect the Seek call right after reopening? The checkpoint still points to the last reported event. Do we need to reset the checkpoint, or set another seek mode on reopen?

When reopening a corrupted journal does it mean that a new file is generated on disk? Is some contents to be skipped in the existing file?

I am not sure I understand your question. When a file becomes corrupted and journald realizes it, it rotates the file. Rotation does mean creating a new file and saving the corrupt one with a new name. The new file is going to be empty until new log lines are written to it.

How does the reopen affect the Seek call right after reopening? The checkpoint still points to the last reported event. Do we need to reset the checkpoint, or set another seek mode on reopen?

I have added seeking to the head of the journal just to be sure. But after further investigation with more corrupted journals I am not sure if this is the right thing to do. It seems that the reader is already handling these errors and skipping bad messages.

I open a journal, start tailing it, mess with the file to cause data corruption, I get 2-3 bad messages errors and then Journalbeat reads events as if nothing has happened. Maybe we should rather only log one bad messages error and suppress others until the reading of the events is normalized again?

your goto OPEN_JOURNAL triggers an open + a seek instructions with seek being based the statestore state + the input configuration. If one configured the input to always restart collection on seek, we will restart collection from the oldest available logs. If one configured to always start collection from the latest available message, we have lost logs now. If one has configured to continue from the 'checkpoint', we will continue from the last 'offset', which is what we might want always.

What I wonder is: Do we need to ensure that the seek after reopening does not use the users initial configuration?

kvch · 2021-06-23T14:13:13Z

Closing in favour of #26224

Reopen journald reader when journal is corrupt

4f67f04

kvch added Journalbeat Team:Elastic-Agent Label for the Agent team labels Jun 2, 2021

kvch requested a review from urso June 2, 2021 16:15

botelastic bot added needs_team Indicates that the issue/PR needs a Team:* label and removed needs_team Indicates that the issue/PR needs a Team:* label labels Jun 2, 2021

kvch added backport-v7.14.0 Automated backport with mergify backport-v7.13.0 Automated backport with mergify labels Jun 2, 2021

add changelog entry

33915e1

urso reviewed Jun 3, 2021

View reviewed changes

kvch mentioned this pull request Jun 9, 2021

Suppress too many bad error messages when reading from corrupted journal in Journalbeat #26224

Merged

3 tasks

kvch closed this Jun 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reopen journald reader when journal is corrupt #26116

Reopen journald reader when journal is corrupt #26116

kvch commented Jun 2, 2021 •

edited

Loading

elasticmachine commented Jun 2, 2021

elasticmachine commented Jun 2, 2021 •

edited by jenkins-beats-ci bot

Loading

Build stats

Test stats 🧪

Trends 🧪

Test stats 🧪

urso Jun 3, 2021

kvch Jun 4, 2021

urso Jun 10, 2021

kvch commented Jun 23, 2021

Reopen journald reader when journal is corrupt #26116

Reopen journald reader when journal is corrupt #26116

Conversation

kvch commented Jun 2, 2021 • edited Loading

What does this PR do?

Why is it important?

Checklist

Related issues

elasticmachine commented Jun 2, 2021

elasticmachine commented Jun 2, 2021 • edited by jenkins-beats-ci bot Loading

💚 Build Succeeded

Build stats

Test stats 🧪

Trends 🧪

💚 Flaky test report

Test stats 🧪

urso Jun 3, 2021

Choose a reason for hiding this comment

kvch Jun 4, 2021

Choose a reason for hiding this comment

urso Jun 10, 2021

Choose a reason for hiding this comment

kvch commented Jun 23, 2021

kvch commented Jun 2, 2021 •

edited

Loading

elasticmachine commented Jun 2, 2021 •

edited by jenkins-beats-ci bot

Loading