-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Journald] Restart journalctl if it exits unexpectedly #40558
Conversation
This pull request is now in conflicts. Could you fix it? 🙏
|
93230ef
to
6507efe
Compare
This pull request is now in conflicts. Could you fix it? 🙏
|
6507efe
to
7b93e74
Compare
Pinging @elastic/elastic-agent-data-plane (Team:Elastic-Agent-Data-Plane) |
logger.Infof("journalctl started with PID %d", cmd.Process.Pid) | ||
|
||
go func() { | ||
if err := cmd.Wait(); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we should also watch for the dataChan
and errChan
to close. For example, if we failed to read stdout for some reason, that go routine loop would exit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goroutine is here only to log any error returned by cmd.Wait
, if the process exits unexpectedly its stderr and stdout will be closed and the reader gorourines will get an EOF or error. At the moment I'm doing the best to read and ship/log all data without over complicating the code and risking getting into a dead lock.
cc66c5d
to
7037dc2
Compare
|
||
func (j *journalctl) Next(cancel input.Canceler) ([]byte, error) { | ||
select { | ||
case <-cancel.Done(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if cancel.Done(), do we have to kill the process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This input.Canceler
is in the context of the Next
call, so I don't think we should kill the journalctl process. If the journald input is stopped, it will call Close on the Reader:
and the reader will kill the journalctl process
I added some comments explaining kill needs to be called.
This pull request is now in conflicts. Could you fix it? 🙏
|
Fully isolates the journalctl process into a new type called journalctl, so the reader does not know about it and can simply call a Next method on the new type.
If journalctl exits for any reason, the reader now restarts it.
This commit adds a test to the journald reader that ensures is can restart the journalctl process in case it exits. It is tested by mocking the type abstracting the calls to journalctl, if there is an error reading the next message, we ensure a new journalclt is created.
This commit refactors TestEventWithNonStringData to use the new types and re-enables the test.
This commit adds an integration test for the journald input. The test ensures that if the journalctl process exits unexpectedly it is restarted.
Fix publishing an empty message when journalctl crashes and needs to be restarted. Now when the reader restarts journalctl it returns ErrRestarting and the input can correctly handle this situation. The tests are updated and the code formatted (mage fmt).
Test the bytes written to systemd-cat on every call, properly handle errors when closing its stdin and better error reporting.
This commit removes two wait groups that were not being used correctly and improves error handling. Now if journalctl writes to stderr, the lines are read and directly logged. Lint warnings are also fixed.
Make sure everywhere in the reader and its subcomponets the logger name is consistent.
When restarting journalclt, a exponential backoff will be used if the last restart was less than 5s ago. In the unlikely case jouranlctl will crash right after being installed, the exponential backoff will make Filebeat restart journalctl at most once every two seconds.
3a5aaec
to
e7d2260
Compare
|
If journalctl exits unexpectedly the journald input will restart it and set the cursor to the last know position. Any error/non zero return code is logged at level error. There is an exponential backoff that caps at 1 restart every 2s. (cherry picked from commit a9fb9fa)
If journalctl exits unexpectedly the journald input will restart it and set the cursor to the last know position. Any error/non zero return code is logged at level error. There is an exponential backoff that caps at 1 restart every 2s. (cherry picked from commit a9fb9fa) Co-authored-by: Tiago Queiroz <tiago.queiroz@elastic.co>
Proposed commit message
If journalctl exits unexpectedly the journald input will restart it and set the cursor to the last know position. Any error/non zero return code is logged at level error.
Checklist
I have made corresponding changes to the documentationI have made corresponding change to the default configuration filesCHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.Disruptive User Impact
Author's Checklist
How to test this PR locally
filebeat.yml
potato
Related issues
Tests
I run a test overnight where I used a mock to simulate constant failures and restarts of journalclt and monitored the host and Filebeat process to ensure there weren't any goroutine/resouce leaks. The screenshot below shows the CPU and memory usage from Filebeat as well as counters for the log entries stating journalctl crashed and was restarted.
## Use cases## Screenshots## Logs