Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for copytruncate method when rotating input logs with an external tool in filestream input #23457

Merged
merged 34 commits into from
Jun 24, 2021

Conversation

kvch
Copy link
Contributor

@kvch kvch commented Jan 12, 2021

What does this PR do?

The PR makes the filestream log rotation aware to make sure Filebeat can cooperate better with external log rotation tools. The first supported strategy is copytruncate.

When logrotate rotates e.g. boot.log with copytruncate the following things happen:

  1. all archived files are renamed e.g. boot.log.2 is renamed boot.log.3 until boot.log.1 no longer exists
  2. boot.log is copied to boot.log.1
  3. boot.log is truncated

You can see my tests on my machine:

Before rotation:

root@sleipnir:/home/n# ls -lisaht /var/log/boot.log*
130476 30K -rw------- 1 root root 28K Jan 29 08:59 /var/log/boot.log
130577 36K -rw------- 1 root root 34K Jan 29 08:59 /var/log/boot.log.1
130657 60K -rw------- 1 root root 57K Jan  7 09:51 /var/log/boot.log.2

After rotation:

root@sleipnir:/home/n# ls -lisaht /var/log/boot.log*
130476   0 -rw------- 1 root root   0 May 25 12:41 /var/log/boot.log
130430 30K -rw------- 1 root root 28K May 25 12:41 /var/log/boot.log.1
130577 36K -rw------- 1 root root 34K Jan 29 08:59 /var/log/boot.log.2
130657 60K -rw------- 1 root root 57K Jan  7 09:51 /var/log/boot.log.3

On rotation, the active file is continued and archived files are kept open until EOF is reached.

Configuration

rotation.external.strategy.copytruncate:
  suffix_regex: \.\d$
  count: 10

Note: when Filebeat will be able to rotate input logs, its configuration will be under rotation.internal.*.

Why is it important?

Previously, Filebeat was not able to cooperate with external log rotation tools that used copytruncate method.

Checklist

  • My code follows the style guidelines of this project
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have made corresponding change to the default configuration files
  • I have added tests that prove my fix is effective or that my feature works
  • I have added an entry in CHANGELOG.next.asciidoc or CHANGELOG-developer.next.asciidoc.

@botelastic botelastic bot added the needs_team Indicates that the issue/PR needs a Team:* label label Jan 12, 2021
@elasticmachine
Copy link
Collaborator

elasticmachine commented Jan 12, 2021

💚 Build Succeeded

the below badges are clickable and redirect to their specific view in the CI or DOCS
Pipeline View Test View Changes Artifacts preview

Expand to view the summary

Build stats

  • Build Cause: Pull request #23457 updated

  • Start Time: 2021-06-24T11:53:32.530+0000

  • Duration: 106 min 1 sec

  • Commit: c2d872d

Test stats 🧪

Test Results
Failed 0
Passed 14223
Skipped 2311
Total 16534

Trends 🧪

Image of Build Times

Image of Tests

💚 Flaky test report

Tests succeeded.

Expand to view the summary

Test stats 🧪

Test Results
Failed 0
Passed 14223
Skipped 2311
Total 16534

@andresrc andresrc added the Team:Elastic-Agent Label for the Agent team label Jan 12, 2021
@botelastic botelastic bot removed the needs_team Indicates that the issue/PR needs a Team:* label label Jan 12, 2021
@kvch kvch force-pushed the feature-filebeat-log-rotation-filestream branch from 0be2e9b to 68d8934 Compare May 17, 2021 10:16
@kvch kvch marked this pull request as ready for review May 26, 2021 16:24
@elasticmachine
Copy link
Collaborator

Pinging @elastic/agent (Team:Agent)

@kvch
Copy link
Contributor Author

kvch commented May 26, 2021

jenkins run tests

@kvch kvch requested a review from urso May 26, 2021 16:24
@urso
Copy link

urso commented May 26, 2021

Maybe we can clarify the PR description. FB will not rotate the files, but this change will make it possible to correctly follow log files that are rotated via copytruncate. Pretty cool!

Limitation (for docs): copytruncate via logrotate (or others) leads to data loss, as the file is copied while it is still written too. This can lead to log lines that can not be parsed correctly...

@kvch kvch changed the title Add inital support for copytruncate method in filestream input Add support for copytruncate method when rotating input logs with an external tool in filestream input May 27, 2021
@kvch kvch force-pushed the feature-filebeat-log-rotation-filestream branch from f634df5 to 235e17a Compare May 31, 2021 10:20
---

`count` is the number of rotated files. {beatname_uc} needs this information to clean the
registry when the rotated file has been purged.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we keep this 0? Will files be removed based on ttl? Shall we automatically set clean_removed for rotated files?

Regex and count seem to be helpful for other log rotation strategies (move files) as well, as it allows us to close files early and remove state from the registry early.
Closing files on EOF reduces the number of FDs we need to keep + while removing state from registry more early helps with inode reuse issues.

Apropos inode reuse: If we find a state to be used where we know for sure we did already reach EOF (rotated log file), then we can tell for sure that the inode was reused and the state must be deleted in favor of the new state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if we keep this 0? Will files be removed based on ttl? Shall we automatically set clean_removed for rotated files?

0 is an invalid setting as I use this number to when creating a slice for rotatedFiles.

Rotated files that are removed are immediately stopped and cleaned up, so we keep number of open files as low as possible and the size of the registry as small as possible. However, I am open to keeping the files open until they stop themselves and let the registry clean the state if you think that is better.

Regex and count seem to be helpful for other log rotation strategies (move files) as well, as it allows us to close files early and remove state from the registry early.
Closing files on EOF reduces the number of FDs we need to keep + while removing state from registry more early helps with inode reuse issues.

Indeed, that's why I enable close_eof and remove the state of rotated files when they are deleted.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0 is an invalid setting as I use this number to when creating a slice for rotatedFiles.

Hm, I wonder if we really the count setting at all. For slices we should use append, but we should always be ready to allow slices to grow. If someone updates the application settings to produce more logs when trying to hunt down an issue, the user should not have to remember to update/change the beats setting as well.

Rotated files that are removed are immediately stopped and cleaned up, so we keep number of open files as low as possible and the size of the registry as small as possible. However, I am open to keeping the files open until they stop themselves and let the registry clean the state if you think that is better.

This behavior should be configurable. Normally this is what close_removed and clean_removed had been used for. The number of file descriptors can indeed grow due to bandwidth limitation, but we don't want to drops logs, just because a file was rotated and we have to wait 1min for ACK from Elasticsearch if the hole environment is low EPS only. More sophisticated alternative is to configure a limit of removed files that should be kept open. By default we would keep open up to 2 removed files for a log-stream for example. Once the limit is reached, we delete the close and clean the files we have kept open the longest (oldest files).

Copy link
Contributor Author

@kvch kvch Jun 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have kept removing and cleaning configurable. Also, removed count.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you still need to update the doc? I still see count being mentioned.

rotation.external.strategy.copytruncate:
suffix_regex: \-\d{6}$
count: 7
dateformat: -20060504
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is dateformat used for? Shouldn't the regex already find the file? Do I still need count in that case?

Given this example format, is it {year}{month}{day} or {year}{day}{month}. Depending on the logger used we might also see formats like {day}-{month}-{year}, even time might be included. If we want ordering, we do require a more 'intelligent' timestamp parsing, thus we can sort based on actual time.

This reminds me, we have a dtfmt package, but unfortunately we never added parsing support. Another use-case for timestamp parsing would be JSON. We have 2 hard coded timestamp formats for JSON parsing, but it would be nice to allow users to configure a custom pattern.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is dateformat used for? Shouldn't the regex already find the file? Do I still need count in that case?

dateformat is used to tell Filebeat how to parse the date at the end of the file so the input can sort the rotated files.

Given this example format, is it {year}{month}{day} or {year}{day}{month}. Depending on the logger used we might also see formats like {day}-{month}-{year}, even time might be included. If we want ordering, we do require a more 'intelligent' timestamp parsing, thus we can sort based on actual time.

Yes, that is why we need dateformat. These are the example settings for the different suffixes you had listed above:

suffix format dateformat
{year}{month}{day} 20060102
{year}{day}{month} 20060201
{day}-{month}-{year} 02-01-2006
{year}{month}{day}{hour}{minute} 200601021514

This reminds me, we have a dtfmt package, but unfortunately we never added parsing support. Another use-case for timestamp parsing would be JSON. We have 2 hard coded timestamp formats for JSON parsing, but it would be nice to allow users to configure a custom pattern.

I agree, I am happy to add it in a follow-up PR.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

200601021514

Wow... I wouldn't call this intuitive :)

{year}{month}{day}{hour}{minute}

I kinda like this more human friendly readable format. dtfmt splits parsing and executing dataformat patterns. Maybe (as another follow up), we want to add this to dtfmt as well... :)

filebeat/input/filestream/copytruncate_prospector.go Outdated Show resolved Hide resolved
filebeat/input/filestream/copytruncate_prospector.go Outdated Show resolved Hide resolved
filebeat/input/filestream/copytruncate_prospector.go Outdated Show resolved Hide resolved
filebeat/input/filestream/copytruncate_prospector.go Outdated Show resolved Hide resolved
if err != nil {
log.Errorf("Cannot continue file, error while getting the information of the original file: %+v", err)
log.Debugf("Starting possibly rotated file from the beginning: %s", fe.NewPath)
hg.Start(ctx, src)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, shall we really try to collect if Stat failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I choose to collect it to make sure we are not losing any logs until the user notices the error message above. However, I am open to skipping the file, I am just worried about the implications.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the reasons err != nil. Most often I would assume that the file has been removed, or we just don't have permissions. Can you think of others. In general, the harvester would also fail to collect, unless we run into some other issue.

I wonder about network shares. Maybe we just failed because of network errors. Shall we really try to restart from offset 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, removing those copytruncated log files is a mistake. But I see where you are coming from. Then what happens when the file shows up in the end? Shall we scan for rotated files again to try to Continue again?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tricky. I would lean towards not collecting and see what happens (if users run into issues).

On the other hand the harvester will fail as well, and if the harvester does not fail we still have the inode check (and in the future maybe some fingerprint?). Given that the harvester should detect if something is odd, we can keep the start here. => Let's keep the code you have written.

@kvch kvch removed the needs_docs label Jun 1, 2021
@mergify
Copy link
Contributor

mergify bot commented Jun 2, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature-filebeat-log-rotation-filestream upstream/feature-filebeat-log-rotation-filestream
git merge upstream/master
git push upstream feature-filebeat-log-rotation-filestream

@kvch kvch force-pushed the feature-filebeat-log-rotation-filestream branch from e11374c to a7775b9 Compare June 2, 2021 10:22
@kvch
Copy link
Contributor Author

kvch commented Jun 2, 2021

@urso The PR is ready for another round of review.

I would like to address these things in a follow-up PR:

  • add integration tests
  • use dtfmt when setting dateformat
  • more efficient sorting

@kvch kvch requested a review from urso June 4, 2021 11:11
@mergify
Copy link
Contributor

mergify bot commented Jun 7, 2021

This pull request is now in conflicts. Could you fix it? 🙏
To fixup this pull request, you can check out it locally. See documentation: https://help.github.com/articles/checking-out-pull-requests-locally/

git fetch upstream
git checkout -b feature-filebeat-log-rotation-filestream upstream/feature-filebeat-log-rotation-filestream
git merge upstream/master
git push upstream feature-filebeat-log-rotation-filestream

@kvch kvch force-pushed the feature-filebeat-log-rotation-filestream branch 2 times, most recently from 52ce161 to 63e8211 Compare June 7, 2021 16:11
@urso
Copy link

urso commented Jun 7, 2021

The jenkins errors suggest that your recent changes did break the build.

switch fe.Op {
case loginp.OpCreate, loginp.OpWrite:
if fe.Op == loginp.OpCreate {
log.Debugf("A new file %s has been found", fe.NewPath)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit; structured logging. We have quite a few debug message here that are quire short and maybe difficult to grep for if the path is interspersed. In order to ease support for grepping and even correlating events for a single source, it would be helpful to create a temporary logger for the current source. The logger would contain the path and if available the inode + internal file ID as context that is always printed. The log message would become A new file has been found.

For example: https://github.com/elastic/beats/blob/master/filebeat/input/log/input.go#L242

That approach did help us to understand some bugs, as it did enable us to correlate log message to a single inode.

identifier, err := newFileIdentifier(config.FileIdentity)
if err != nil {
return nil, nil, fmt.Errorf("error while creating file identifier: %v", err)
return nil, nil, fmt.Errorf("cannot create prospector: %+v", err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use %w in order to capture the error.

@kvch kvch force-pushed the feature-filebeat-log-rotation-filestream branch from e45d20d to 8a8a73c Compare June 24, 2021 11:35
@urso
Copy link

urso commented Jun 24, 2021

LGTM. Please mark the feature as experimental.

@kvch kvch added the backport-v7.14.0 Automated backport with mergify label Jun 24, 2021
@kvch kvch merged commit 1f5198e into elastic:master Jun 24, 2021
mergify bot pushed a commit that referenced this pull request Jun 24, 2021
…external tool in `filestream` input (#23457)

## What does this PR do?

The PR makes the `filestream` log rotation aware to make sure Filebeat can cooperate better with external log rotation tools. The first supported strategy is `copytruncate`.

When `logrotate` rotates e.g. `boot.log` with `copytruncate` the following things happen:
1. all archived files are renamed e.g. `boot.log.2` is renamed `boot.log.3` until `boot.log.1` no longer exists
2. `boot.log` is copied to `boot.log.1`
3. `boot.log` is truncated

You can see my tests on my machine:

Before rotation:
```
root@sleipnir:/home/n# ls -lisaht /var/log/boot.log*
130476 30K -rw------- 1 root root 28K Jan 29 08:59 /var/log/boot.log
130577 36K -rw------- 1 root root 34K Jan 29 08:59 /var/log/boot.log.1
130657 60K -rw------- 1 root root 57K Jan  7 09:51 /var/log/boot.log.2
```
After rotation:
```
root@sleipnir:/home/n# ls -lisaht /var/log/boot.log*
130476   0 -rw------- 1 root root   0 May 25 12:41 /var/log/boot.log
130430 30K -rw------- 1 root root 28K May 25 12:41 /var/log/boot.log.1
130577 36K -rw------- 1 root root 34K Jan 29 08:59 /var/log/boot.log.2
130657 60K -rw------- 1 root root 57K Jan  7 09:51 /var/log/boot.log.3
```

On rotation, the active file is continued and archived files are kept open until EOF is reached.

### Configuration

```yaml
rotation.external.strategy.copytruncate:
  suffix_regex: \.\d$
  count: 10
```

Note: when Filebeat will be able to rotate input logs, its configuration will be under `rotation.internal.*`.

## Why is it important?

Previously, Filebeat was not able to cooperate with external log rotation tools that used `copytruncate` method.

(cherry picked from commit 1f5198e)
kvch added a commit that referenced this pull request Jun 24, 2021
…external tool in `filestream` input (#23457) (#26475)

## What does this PR do?

The PR makes the `filestream` log rotation aware to make sure Filebeat can cooperate better with external log rotation tools. The first supported strategy is `copytruncate`.

When `logrotate` rotates e.g. `boot.log` with `copytruncate` the following things happen:
1. all archived files are renamed e.g. `boot.log.2` is renamed `boot.log.3` until `boot.log.1` no longer exists
2. `boot.log` is copied to `boot.log.1`
3. `boot.log` is truncated

You can see my tests on my machine:

Before rotation:
```
root@sleipnir:/home/n# ls -lisaht /var/log/boot.log*
130476 30K -rw------- 1 root root 28K Jan 29 08:59 /var/log/boot.log
130577 36K -rw------- 1 root root 34K Jan 29 08:59 /var/log/boot.log.1
130657 60K -rw------- 1 root root 57K Jan  7 09:51 /var/log/boot.log.2
```
After rotation:
```
root@sleipnir:/home/n# ls -lisaht /var/log/boot.log*
130476   0 -rw------- 1 root root   0 May 25 12:41 /var/log/boot.log
130430 30K -rw------- 1 root root 28K May 25 12:41 /var/log/boot.log.1
130577 36K -rw------- 1 root root 34K Jan 29 08:59 /var/log/boot.log.2
130657 60K -rw------- 1 root root 57K Jan  7 09:51 /var/log/boot.log.3
```

On rotation, the active file is continued and archived files are kept open until EOF is reached.

### Configuration

```yaml
rotation.external.strategy.copytruncate:
  suffix_regex: \.\d$
  count: 10
```

Note: when Filebeat will be able to rotate input logs, its configuration will be under `rotation.internal.*`.

## Why is it important?

Previously, Filebeat was not able to cooperate with external log rotation tools that used `copytruncate` method.

(cherry picked from commit 1f5198e)

Co-authored-by: Noémi Ványi <kvch@users.noreply.github.com>
mdelapenya added a commit to mdelapenya/beats that referenced this pull request Jun 28, 2021
* master: (32 commits)
  [Metricbeat] Change Account ID to Project ID in `gcp.billing` module (elastic#26412)
  update libbeat fields.ecs.yml file and ecsVersion to 1.10.0 (elastic#26121)
  [Filebeat] Update AWS ELB ingest pipeline (elastic#26441)
  [FIlebeat] add strict_date_optional_time_nanos date format to PanOS module (elastic#26158)
  Fix the irregular and typo on prometheus module. (elastic#25726)
  [Filebeat] Parse additonal debug data fields for Okta module (elastic#25818)
  fix: update MSSQL Server linux image's Docker registry (elastic#26440)
  Update indexing.go godocs (elastic#26408)
  Do not close filestream harvester if an unexpected error is returned when close.on_state_change.* is enabled (elastic#26411)
  Add support for copytruncate method when rotating input logs with an external tool in `filestream` input (elastic#23457)
  Allow fields with ip_range datatype (elastic#26444)
  Add Anomali ThreatStream support to threatintel module (elastic#26350)
  fix: use the right param type (elastic#26469)
  [Automation] Update elastic stack version to 8.0.0-7640093f for testing (elastic#26460)
  Set SM Filebeat modules as GA (elastic#26226)
  Fix rfc5464 date parsing in the syslog input (elastic#26419)
  Add linked account information into billing metricset (elastic#26285)
  [Filebeat] Update HA Proxy log grok patterns (elastic#25835)
  disable metricbeat logstash test_node_stats (elastic#26436)
  chore: pass BEAT_VERSION when running E2E tests (elastic#26291)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-v7.14.0 Automated backport with mergify Team:Elastic-Agent Label for the Agent team
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants