Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent prospector scans from racing and leaking harvesters #2539

Closed
wants to merge 1 commit into from

Conversation

@elasticsearch-release
Copy link

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

1 similar comment
@elasticsearch-release
Copy link

Jenkins standing by to test this. If you aren't a maintainer, you can ignore this comment. Someone with commit access, please review this and clear it for Jenkins to run.

@ruflin
Copy link
Member

ruflin commented Sep 13, 2016

@viveklak Thanks for taking this on. Can you share some more thoughts on your solution and what the problem is you fixed?

@@ -179,6 +183,7 @@ func (p *ProspectorLog) scan() {
// Decides if previous state exists
if lastState.IsEmpty() {
logp.Debug("prospector", "Start harvester for new file: %s", newState.Source)
p.Prospector.states.Update(newState)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New states should not be persisted before the harvester is started as during the startup of a harvester also some checks will be made and only then the state will be sent to the prospector.

@ruflin
Copy link
Member

ruflin commented Sep 13, 2016

One thing that I could see happen is, that as all harvesters are started in a go routine and only then persist the state, that in case this takes too long to persist all new states, that a scan happens again and a new harvester is started again. This would require a rather low scan_frequency. There are some system tests that try to emulate this and so far it seems this "race" didn't happen. The reason your PR above would prevent this case is not the mutex but the state update. Unfortunately this will bring some other problems.

It would be really great if we could find a way to more or less reliably reproduce this. I will try to write a test case with a very low scan_frequency and already a very large number of files on startup.

@ruflin
Copy link
Member

ruflin commented Sep 13, 2016

I think I can also see how this relates in https://discuss.elastic.co/t/filebeat-5-0-0-alpha5-multiple-harvesters-for-same-file-sending-the-same-messages-over-and-over/59866 to a blocking kafka output. As the through which the state updates are sent is blocked by the output, the new states are not sent form the harvesters which are started. As no states are sent but scans continue, it will start the same harvester for the file again and again as the state was never processed.

@viveklak
Copy link
Author

@ruflin You understood correctly. The core of the problem is that the registration of the harvester with the prospector happens asynchronously (the harvester itself is started in a go routine and sends its state on a shared channel to the prospector). If there is a significant enough delay in the harvester returning its initial state to the prospector, the prospector may start another harvester for the same source. The delay could be due to a large number of existing harvesters competing on the same channel to send state back, a blocking output, etc.

Looks like registration of the initial state within the prospector seems to have helped in my case but as you said, might have other issues.

@ruflin
Copy link
Member

ruflin commented Sep 14, 2016

@viveklak I opened here a PR with a potential fix: #2541

The main downside / upside now is, scans can be blocked if the output is blocked. On the bright side this prevents further harvesters to be started and keep them open. On the less bright side that means that in case output becomes available again, old files will be finished sending and new files were potentially rotated away in the meantime.

It would be great if you could test my PR to see if it helps in your case.

ruflin added a commit to ruflin/beats that referenced this pull request Sep 14, 2016
In case newly started harvesters did not persist their first state before the next scan started, it could have happened that multiple harvesters were started for the same file. This could have been cause by a large number of files or the output blocking.

The problem is solve that the Setup step of the Harvester is now synchronus and blocking the scan. Part of this is also updating the first state of the as part of the prospector.

The side affect of this change is that now a scan is blocking in case the channel is blocked which means the output is probably not responding. If the output is not responding, scans will not continue and new files will not be discovered until output is available again.

The code can be further simplified in the future by merging create/startHarvester. This will be done in a second step to keep backport commit to a minimum.

See also elastic#2539
tsg pushed a commit that referenced this pull request Sep 14, 2016
In case newly started harvesters did not persist their first state before the next scan started, it could have happened that multiple harvesters were started for the same file. This could have been cause by a large number of files or the output blocking.

The problem is solve that the Setup step of the Harvester is now synchronus and blocking the scan. Part of this is also updating the first state of the as part of the prospector.

The side affect of this change is that now a scan is blocking in case the channel is blocked which means the output is probably not responding. If the output is not responding, scans will not continue and new files will not be discovered until output is available again.

The code can be further simplified in the future by merging create/startHarvester. This will be done in a second step to keep backport commit to a minimum.

See also #2539
@viveklak
Copy link
Author

@ruflin awesome! Will update with the results of my test. Thanks for the quick turn around!

@ruflin
Copy link
Member

ruflin commented Sep 15, 2016

I'm closing this PR as #2541 was merged.

@viveklak Keep me posted here about your tests.

@ruflin ruflin closed this Sep 15, 2016
@viveklak
Copy link
Author

@ruflin Ran a test overnight and didn't see the issue anymore. Thanks again for jumping on the problem!

@ruflin
Copy link
Member

ruflin commented Sep 16, 2016

@viveklak Great to hear.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants