-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for copytruncate method when rotating input logs with an external tool in filestream
input
#23457
Add support for copytruncate method when rotating input logs with an external tool in filestream
input
#23457
Conversation
💚 Build Succeeded
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪💚 Flaky test reportTests succeeded. Expand to view the summary
Test stats 🧪
|
0be2e9b
to
68d8934
Compare
Pinging @elastic/agent (Team:Agent) |
jenkins run tests |
Maybe we can clarify the PR description. FB will not rotate the files, but this change will make it possible to correctly follow log files that are rotated via Limitation (for docs): copytruncate via logrotate (or others) leads to data loss, as the file is copied while it is still written too. This can lead to log lines that can not be parsed correctly... |
filestream
input
f634df5
to
235e17a
Compare
--- | ||
|
||
`count` is the number of rotated files. {beatname_uc} needs this information to clean the | ||
registry when the rotated file has been purged. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we keep this 0
? Will files be removed based on ttl? Shall we automatically set clean_removed
for rotated files?
Regex and count seem to be helpful for other log rotation strategies (move files) as well, as it allows us to close files early and remove state from the registry early.
Closing files on EOF reduces the number of FDs we need to keep + while removing state from registry more early helps with inode reuse issues.
Apropos inode reuse: If we find a state to be used where we know for sure we did already reach EOF (rotated log file), then we can tell for sure that the inode was reused and the state must be deleted in favor of the new state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if we keep this
0
? Will files be removed based on ttl? Shall we automatically setclean_removed
for rotated files?
0 is an invalid setting as I use this number to when creating a slice for rotatedFiles
.
Rotated files that are removed are immediately stopped and cleaned up, so we keep number of open files as low as possible and the size of the registry as small as possible. However, I am open to keeping the files open until they stop themselves and let the registry clean the state if you think that is better.
Regex and count seem to be helpful for other log rotation strategies (move files) as well, as it allows us to close files early and remove state from the registry early.
Closing files on EOF reduces the number of FDs we need to keep + while removing state from registry more early helps with inode reuse issues.
Indeed, that's why I enable close_eof
and remove the state of rotated files when they are deleted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0 is an invalid setting as I use this number to when creating a slice for rotatedFiles.
Hm, I wonder if we really the count
setting at all. For slices we should use append
, but we should always be ready to allow slices to grow. If someone updates the application settings to produce more logs when trying to hunt down an issue, the user should not have to remember to update/change the beats setting as well.
Rotated files that are removed are immediately stopped and cleaned up, so we keep number of open files as low as possible and the size of the registry as small as possible. However, I am open to keeping the files open until they stop themselves and let the registry clean the state if you think that is better.
This behavior should be configurable. Normally this is what close_removed
and clean_removed
had been used for. The number of file descriptors can indeed grow due to bandwidth limitation, but we don't want to drops logs, just because a file was rotated and we have to wait 1min for ACK from Elasticsearch if the hole environment is low EPS only. More sophisticated alternative is to configure a limit of removed files that should be kept open. By default we would keep open up to 2
removed files for a log-stream for example. Once the limit is reached, we delete the close and clean the files we have kept open the longest (oldest files).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have kept removing and cleaning configurable. Also, removed count
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you still need to update the doc? I still see count
being mentioned.
rotation.external.strategy.copytruncate: | ||
suffix_regex: \-\d{6}$ | ||
count: 7 | ||
dateformat: -20060504 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is dateformat
used for? Shouldn't the regex already find the file? Do I still need count
in that case?
Given this example format, is it {year}{month}{day}
or {year}{day}{month}
. Depending on the logger used we might also see formats like {day}-{month}-{year}
, even time might be included. If we want ordering, we do require a more 'intelligent' timestamp parsing, thus we can sort based on actual time.
This reminds me, we have a dtfmt package, but unfortunately we never added parsing support. Another use-case for timestamp parsing would be JSON. We have 2 hard coded timestamp formats for JSON parsing, but it would be nice to allow users to configure a custom pattern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is
dateformat
used for? Shouldn't the regex already find the file? Do I still needcount
in that case?
dateformat
is used to tell Filebeat how to parse the date at the end of the file so the input can sort the rotated files.
Given this example format, is it
{year}{month}{day}
or{year}{day}{month}
. Depending on the logger used we might also see formats like{day}-{month}-{year}
, even time might be included. If we want ordering, we do require a more 'intelligent' timestamp parsing, thus we can sort based on actual time.
Yes, that is why we need dateformat
. These are the example settings for the different suffixes you had listed above:
suffix format | dateformat |
---|---|
{year}{month}{day} |
20060102 |
{year}{day}{month} |
20060201 |
{day}-{month}-{year} |
02-01-2006 |
{year}{month}{day}{hour}{minute} |
200601021514 |
This reminds me, we have a dtfmt package, but unfortunately we never added parsing support. Another use-case for timestamp parsing would be JSON. We have 2 hard coded timestamp formats for JSON parsing, but it would be nice to allow users to configure a custom pattern.
I agree, I am happy to add it in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
200601021514
Wow... I wouldn't call this intuitive :)
{year}{month}{day}{hour}{minute}
I kinda like this more human friendly readable format. dtfmt
splits parsing and executing dataformat patterns. Maybe (as another follow up), we want to add this to dtfmt
as well... :)
if err != nil { | ||
log.Errorf("Cannot continue file, error while getting the information of the original file: %+v", err) | ||
log.Debugf("Starting possibly rotated file from the beginning: %s", fe.NewPath) | ||
hg.Start(ctx, src) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, shall we really try to collect if Stat failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I choose to collect it to make sure we are not losing any logs until the user notices the error message above. However, I am open to skipping the file, I am just worried about the implications.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the reasons err != nil
. Most often I would assume that the file has been removed, or we just don't have permissions. Can you think of others. In general, the harvester would also fail to collect, unless we run into some other issue.
I wonder about network shares. Maybe we just failed because of network errors. Shall we really try to restart from offset 0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general, removing those copytruncated log files is a mistake. But I see where you are coming from. Then what happens when the file shows up in the end? Shall we scan for rotated files again to try to Continue
again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tricky. I would lean towards not collecting and see what happens (if users run into issues).
On the other hand the harvester will fail as well, and if the harvester does not fail we still have the inode check (and in the future maybe some fingerprint?). Given that the harvester should detect if something is odd, we can keep the start here. => Let's keep the code you have written.
This pull request is now in conflicts. Could you fix it? 🙏
|
e11374c
to
a7775b9
Compare
@urso The PR is ready for another round of review. I would like to address these things in a follow-up PR:
|
This pull request is now in conflicts. Could you fix it? 🙏
|
52ce161
to
63e8211
Compare
The jenkins errors suggest that your recent changes did break the build. |
switch fe.Op { | ||
case loginp.OpCreate, loginp.OpWrite: | ||
if fe.Op == loginp.OpCreate { | ||
log.Debugf("A new file %s has been found", fe.NewPath) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit; structured logging. We have quite a few debug message here that are quire short and maybe difficult to grep for if the path is interspersed. In order to ease support for grepping and even correlating events for a single source, it would be helpful to create a temporary logger for the current source. The logger would contain the path and if available the inode + internal file ID as context that is always printed. The log message would become A new file has been found
.
For example: https://github.com/elastic/beats/blob/master/filebeat/input/log/input.go#L242
That approach did help us to understand some bugs, as it did enable us to correlate log message to a single inode.
filebeat/input/filestream/input.go
Outdated
identifier, err := newFileIdentifier(config.FileIdentity) | ||
if err != nil { | ||
return nil, nil, fmt.Errorf("error while creating file identifier: %v", err) | ||
return nil, nil, fmt.Errorf("cannot create prospector: %+v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use %w
in order to capture the error.
e45d20d
to
8a8a73c
Compare
LGTM. Please mark the feature as experimental. |
…external tool in `filestream` input (#23457) ## What does this PR do? The PR makes the `filestream` log rotation aware to make sure Filebeat can cooperate better with external log rotation tools. The first supported strategy is `copytruncate`. When `logrotate` rotates e.g. `boot.log` with `copytruncate` the following things happen: 1. all archived files are renamed e.g. `boot.log.2` is renamed `boot.log.3` until `boot.log.1` no longer exists 2. `boot.log` is copied to `boot.log.1` 3. `boot.log` is truncated You can see my tests on my machine: Before rotation: ``` root@sleipnir:/home/n# ls -lisaht /var/log/boot.log* 130476 30K -rw------- 1 root root 28K Jan 29 08:59 /var/log/boot.log 130577 36K -rw------- 1 root root 34K Jan 29 08:59 /var/log/boot.log.1 130657 60K -rw------- 1 root root 57K Jan 7 09:51 /var/log/boot.log.2 ``` After rotation: ``` root@sleipnir:/home/n# ls -lisaht /var/log/boot.log* 130476 0 -rw------- 1 root root 0 May 25 12:41 /var/log/boot.log 130430 30K -rw------- 1 root root 28K May 25 12:41 /var/log/boot.log.1 130577 36K -rw------- 1 root root 34K Jan 29 08:59 /var/log/boot.log.2 130657 60K -rw------- 1 root root 57K Jan 7 09:51 /var/log/boot.log.3 ``` On rotation, the active file is continued and archived files are kept open until EOF is reached. ### Configuration ```yaml rotation.external.strategy.copytruncate: suffix_regex: \.\d$ count: 10 ``` Note: when Filebeat will be able to rotate input logs, its configuration will be under `rotation.internal.*`. ## Why is it important? Previously, Filebeat was not able to cooperate with external log rotation tools that used `copytruncate` method. (cherry picked from commit 1f5198e)
…external tool in `filestream` input (#23457) (#26475) ## What does this PR do? The PR makes the `filestream` log rotation aware to make sure Filebeat can cooperate better with external log rotation tools. The first supported strategy is `copytruncate`. When `logrotate` rotates e.g. `boot.log` with `copytruncate` the following things happen: 1. all archived files are renamed e.g. `boot.log.2` is renamed `boot.log.3` until `boot.log.1` no longer exists 2. `boot.log` is copied to `boot.log.1` 3. `boot.log` is truncated You can see my tests on my machine: Before rotation: ``` root@sleipnir:/home/n# ls -lisaht /var/log/boot.log* 130476 30K -rw------- 1 root root 28K Jan 29 08:59 /var/log/boot.log 130577 36K -rw------- 1 root root 34K Jan 29 08:59 /var/log/boot.log.1 130657 60K -rw------- 1 root root 57K Jan 7 09:51 /var/log/boot.log.2 ``` After rotation: ``` root@sleipnir:/home/n# ls -lisaht /var/log/boot.log* 130476 0 -rw------- 1 root root 0 May 25 12:41 /var/log/boot.log 130430 30K -rw------- 1 root root 28K May 25 12:41 /var/log/boot.log.1 130577 36K -rw------- 1 root root 34K Jan 29 08:59 /var/log/boot.log.2 130657 60K -rw------- 1 root root 57K Jan 7 09:51 /var/log/boot.log.3 ``` On rotation, the active file is continued and archived files are kept open until EOF is reached. ### Configuration ```yaml rotation.external.strategy.copytruncate: suffix_regex: \.\d$ count: 10 ``` Note: when Filebeat will be able to rotate input logs, its configuration will be under `rotation.internal.*`. ## Why is it important? Previously, Filebeat was not able to cooperate with external log rotation tools that used `copytruncate` method. (cherry picked from commit 1f5198e) Co-authored-by: Noémi Ványi <kvch@users.noreply.github.com>
* master: (32 commits) [Metricbeat] Change Account ID to Project ID in `gcp.billing` module (elastic#26412) update libbeat fields.ecs.yml file and ecsVersion to 1.10.0 (elastic#26121) [Filebeat] Update AWS ELB ingest pipeline (elastic#26441) [FIlebeat] add strict_date_optional_time_nanos date format to PanOS module (elastic#26158) Fix the irregular and typo on prometheus module. (elastic#25726) [Filebeat] Parse additonal debug data fields for Okta module (elastic#25818) fix: update MSSQL Server linux image's Docker registry (elastic#26440) Update indexing.go godocs (elastic#26408) Do not close filestream harvester if an unexpected error is returned when close.on_state_change.* is enabled (elastic#26411) Add support for copytruncate method when rotating input logs with an external tool in `filestream` input (elastic#23457) Allow fields with ip_range datatype (elastic#26444) Add Anomali ThreatStream support to threatintel module (elastic#26350) fix: use the right param type (elastic#26469) [Automation] Update elastic stack version to 8.0.0-7640093f for testing (elastic#26460) Set SM Filebeat modules as GA (elastic#26226) Fix rfc5464 date parsing in the syslog input (elastic#26419) Add linked account information into billing metricset (elastic#26285) [Filebeat] Update HA Proxy log grok patterns (elastic#25835) disable metricbeat logstash test_node_stats (elastic#26436) chore: pass BEAT_VERSION when running E2E tests (elastic#26291) ...
What does this PR do?
The PR makes the
filestream
log rotation aware to make sure Filebeat can cooperate better with external log rotation tools. The first supported strategy iscopytruncate
.When
logrotate
rotates e.g.boot.log
withcopytruncate
the following things happen:boot.log.2
is renamedboot.log.3
untilboot.log.1
no longer existsboot.log
is copied toboot.log.1
boot.log
is truncatedYou can see my tests on my machine:
Before rotation:
After rotation:
On rotation, the active file is continued and archived files are kept open until EOF is reached.
Configuration
Note: when Filebeat will be able to rotate input logs, its configuration will be under
rotation.internal.*
.Why is it important?
Previously, Filebeat was not able to cooperate with external log rotation tools that used
copytruncate
method.Checklist
CHANGELOG.next.asciidoc
orCHANGELOG-developer.next.asciidoc
.