Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

in_tail: Manage tail watchers that are rorate_wait state too #4334

Merged
merged 6 commits into from
Mar 8, 2024

Conversation

ashie
Copy link
Member

@ashie ashie commented Oct 31, 2023

Which issue(s) this PR fixes:
N/A (Follow up for #4327 (comment))

What this PR does / why we need it:
After a tail watcher transitions to rotate_wait state, the rotate_wait timer is no longer managed by in_tail, it might cause unexpected behaviour. e.g.)

  • It's never unwatched when shutdown occurs before rotate_wait passed.
  • Needless rotate_wait timers are executed when it detects more rotations.

This patch fixes such unexpected behaviour.

Docs Changes:
N/A

Release Note:
in_tail: Fix tail watchers in rotate_wait state not being managed.

@ashie ashie force-pushed the in-tail-manage-rotate-wait branch from 0840433 to 99af9a0 Compare October 31, 2023 01:29
@ashie
Copy link
Member Author

ashie commented Oct 31, 2023

* It's never unwatched when shutdown occurs before `rotate_wait` passed.
* Needless `rotate_wait` timers are executed when it detects more rotations.

Although they are unexpected behaviour, probably not harmful.
So we might skip merging this patch into the next release.
This PR is just for remind & discuss this issue.

@ashie ashie force-pushed the in-tail-manage-rotate-wait branch from 99af9a0 to 49e5420 Compare October 31, 2023 01:58
@ashie
Copy link
Member Author

ashie commented Oct 31, 2023

* It's never unwatched when shutdown occurs before `rotate_wait` passed.

BTW I'm suspecting that some unstable in_tail tests are caused by it.

@ashie ashie force-pushed the in-tail-manage-rotate-wait branch 2 times, most recently from 13b91bd to 94e867b Compare October 31, 2023 05:21
Copy link

This PR has been automatically marked as stale because it has been open 30 days with no activity. Remove stale label or comment or this PR will be closed in 7 days

@github-actions github-actions bot added the stale label Nov 30, 2023
@daipom daipom added bug Something isn't working and removed stale labels Nov 30, 2023
@ashie
Copy link
Member Author

ashie commented Dec 20, 2023

Sometimes we get an error regarding unmanaged rotated tail watchers on shutdown:

e.g.)

[error]: #0 [in_tail] closed stream
[error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin/in_tail/position_file.rb:228:in `pos='
[error]: #0 /opt/fluent/lib/ruby/gems/3.2.0/gems/fluentd-1.16.3/lib/fluent/plugin/in_tail/position_file.rb:228:in `block in update_pos'
...

We should merge this fix to v1.16.

@ashie ashie added this to the v1.16.4 milestone Dec 20, 2023
@ashie ashie marked this pull request as ready for review December 20, 2023 02:30
@ashie ashie changed the title in_tail: Manage tail watchers under rorate_wait state in_tail: Manage tail watchers that are rorate_wait state too Dec 20, 2023
@ashie ashie force-pushed the in-tail-manage-rotate-wait branch from 94e867b to 9663478 Compare January 12, 2024 04:36
After a tail watcher transitions to `rotate_wait` state, the
`rotate_wait` timer is no longer managed by in_tail, it might cause
unexpected behaviour. e.g.)

* It's never unwatched when shutdown occurs before `rotate_wait` passed.
* Needless `rotate_wait` timers are executed when it detects more
  rotations.

This patch fixes such unexpected behaviour.

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
@ashie ashie force-pushed the in-tail-manage-rotate-wait branch from 9663478 to ab81c16 Compare January 16, 2024 05:35
@daipom daipom self-requested a review January 24, 2024 10:01
@daipom
Copy link
Contributor

daipom commented Jan 24, 2024

Thanks for this fix!

It's never unwatched when shutdown occurs before rotate_wait passed.

At this point, this fix basically looks good to me.

Needless rotate_wait timers are executed when it detects more rotations.

At this point, we need more modifications.
Maybe they should be done in other PRs, but I'd like to clarify this point at this time.

I think this occurs with follow_inodes, and it makes duplicate TailWatchers and handle-leak.

When a very fast rotation occurs, a TailWatcher in the state of rotate_wait can call update_watcher() again.
That is the issue we should be concerned about.

Without follow_inodes, this check in update_watcher() prevents the duplicate update_watcher() process.

if @pf
pe_inode = pe.read_inode
target_info_from_position_entry = TargetInfo.new(path, pe_inode)
unless pe_inode == @pf[target_info_from_position_entry].read_inode
log.warn "Skip update_watcher because watcher has been already updated by other inotify event",
path: path, inode: pe.read_inode, inode_in_pos_file: @pf[target_info_from_position_entry].read_inode
return
end
end

It is because @pf[target_info_from_position_entry].read_inode returns the inode of the current file without follow_inodes.
So, if the TailWatcher is old, those inodes do not match.

However, with follow_inodes, this check does not work, and executes update_watcher() process duplicately.

This fix will prevent duplicate timer execution for detaching.
However, it is not enough.
We have to prevent the overall duplicate update_watcher() process.
Maybe it should be done in another PR.

=> #4334 (comment)

@daipom
Copy link
Contributor

daipom commented Jan 25, 2024

Needless rotate_wait timers are executed when it detects more rotations.

At this point, we need more modifications. Maybe they should be done in other PRs, but I'd like to clarify this point at this time.

I think this occurs with follow_inodes, and it makes duplicate TailWatchers and handle-leak.

...

However, with follow_inodes, this check does not work, and executes update_watcher() process duplicately.

This fix will prevent duplicate timer execution for detaching. However, it is not enough. We have to prevent the overall duplicate update_watcher() process. Maybe it should be done in another PR.

Hmm, I may be wrong about this.
When using follow_inodes, the following logic prevents making duplicate TailWatcher.

new_position_entry = @pf[new_target_info]
# If `refresh_watcher` find the new file before, this will not be zero.
# In this case, only we have to do is detaching the current tail_watcher.
if new_position_entry.read_inode == 0

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
@daipom
Copy link
Contributor

daipom commented Jan 25, 2024

I have added tests for fast rotation.

@daipom
Copy link
Contributor

daipom commented Jan 25, 2024

@ashie Can I add a commit about the points I am commenting on?

@ashie
Copy link
Member Author

ashie commented Jan 26, 2024

@ashie Can I add a commit about the points I am commenting on?

Of course.

Logic inside `detach_watcher()` should be independent of
`rotate_wait`.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
@daipom
Copy link
Contributor

daipom commented Jan 31, 2024

@ashie Can I add a commit about the points I am commenting on?

Of course.

Done.

The new tests on macOS seem to be unstable. I will check them.

@daipom
Copy link
Contributor

daipom commented Jan 31, 2024

The new tests are unstable on Windows or macOS.
We need to check if we need to fix these tests or if in_tail on Windows or macOS can not collect files that are rotated very fast.

@daipom
Copy link
Contributor

daipom commented Mar 7, 2024

I'm sorry for the interval.
I'm checking again on the new tests on Windows and macOS.

@daipom
Copy link
Contributor

daipom commented Mar 7, 2024

I will merge this if it is confirmed that the newly added tests are stable.
Please let me know if anyone has concerns.

@daipom
Copy link
Contributor

daipom commented Mar 7, 2024

The new tests are now stable.
However, I have noticed that we need to consider rotate_wait 0 case.

@daipom
Copy link
Contributor

daipom commented Mar 8, 2024

However, I have noticed that we need to consider rotate_wait 0 case.

I am concerned about the possibility of the key being deleted before the key is added.
Although I can't confirm such a behavior, I have added the countermeasure just in case.

8b15478

@daipom
Copy link
Contributor

daipom commented Mar 8, 2024

However, I have noticed that we need to consider rotate_wait 0 case.

I am concerned about the possibility of the key being deleted before the key is added. Although I can't confirm such a behavior, I have added the countermeasure just in case.

8b15478

@ashie
Could you briefly check this?

lib/fluent/plugin/in_tail.rb Outdated Show resolved Hide resolved
@daipom daipom force-pushed the in-tail-manage-rotate-wait branch from 8b15478 to f1f7866 Compare March 8, 2024 05:26
@daipom
Copy link
Contributor

daipom commented Mar 8, 2024

I will remove this fix, and check the following codes as well.

https://github.com/fluent/fluentd/pull/4334/files#diff-456fdeb51bc472beb48891caac0d063e0073655dba7ac2b72e6fdc67dc6ac802R272-R275

The callbacks of timers are not called after the Stop phase.
So we don't need these codes.
I will simplify those codes too.

Note: The callbacks of timers are not called after stop

checker = ->(){ @_timer_running }
detacher = ->(watcher){ event_loop_detach(watcher) }
timer = TimerWatcher.new(title, interval, repeat, log, checker, detacher, &block)

def stop
super
@_timer_running = false
end

@callback.call if @checker.call

daipom added 3 commits March 8, 2024 14:58
Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
The comments was added in 76f246a.
At that time, closing was done by event-loop.
Now, the situation is completely different.

Currently, there are comments about the `close_io` option in
`shutdown()` and `close()`.
So, it would be enough to remove these comments.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
`sleep 1s` is too short for `enable_stat_watcher false`.

Signed-off-by: Daijiro Fukuda <fukuda@clear-code.com>
@daipom daipom force-pushed the in-tail-manage-rotate-wait branch from f1f7866 to d28f32e Compare March 8, 2024 05:59
@daipom
Copy link
Contributor

daipom commented Mar 8, 2024

The code for conflicts with timer execution is unnecessary, so I have removed them.
@ashie Thanks and sorry I overlooked this.

Copy link
Member Author

@ashie ashie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM
(It seems that I can't change the status of this pull request, because it's originally mine 🤪)

@daipom
Copy link
Contributor

daipom commented Mar 8, 2024

Thanks! I will merge this soon!

@daipom daipom merged commit 83273e0 into master Mar 8, 2024
16 checks passed
@daipom daipom deleted the in-tail-manage-rotate-wait branch March 8, 2024 09:53
daipom added a commit that referenced this pull request Mar 8, 2024
After a tail watcher transitions to `rotate_wait` state, the
`rotate_wait` timer is no longer managed by in_tail, it might cause
unexpected behaviour. e.g.)

* It's never unwatched when shutdown occurs before `rotate_wait` passed.
* Needless `rotate_wait` timers are executed when it detects more
  rotations.

This patch fixes such unexpected behaviour.

Note: The comment about `detach_watcher` was added in 76f246a.
At that time, closing was done by event-loop.
Now, the situation is completely different, so it should be removed.

---------

Signed-off-by: Takuro Ashie <ashie@clear-code.com>
Co-authored-by: Daijiro Fukuda <fukuda@clear-code.com>
ashie added a commit that referenced this pull request Mar 11, 2024
Backport (v1.16): in_tail: Manage tail watchers that are `rorate_wait` state too (#4334)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants