Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicate batches in retry queue #5520

Merged
merged 2 commits into from
Nov 14, 2017

Conversation

urso
Copy link

@urso urso commented Nov 6, 2017

On error in Logstash output sender, failed batches might get enqueued two
times. This can lead to multiple resends and ACKs for the same events.

In filebeat/winlogbeat, waiting for ACK from output, at most one ACK is
required. With potentially multiple ACKs (especially with multiple
consecutive IO errors) a deadlock in the outputs ACK handler can occur.

This PR ensures batches can not be returned to the retry queue via 2 code
paths (remove race between competing workers):

  • async output worker does not return events back into retry queue
  • async clients are required to always report retrieable errors via
    callbacks
  • add some more detailed debug logs to the LS output, that can help in
    identifiying ACKed batches still being retried

@urso urso force-pushed the fix/duplicates-in-retry-queue branch 2 times, most recently from 0f981b7 to f7d82c4 Compare November 6, 2017 15:40
@ruflin
Copy link
Contributor

ruflin commented Nov 6, 2017

This change seems to cause some issues: https://travis-ci.org/elastic/beats/jobs/298049838#L642

err: nil,
}

debug("msgref(%p) new: batch=%p, cb=%p", r, &r.batch[0], cb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will create a lot of debug messages. I know that this info is for debugging, but I start to get the feeling we need something in between info and debug, something like info --verbose :-)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, some extra-debug would be nice to have. Normally the debug messages create about 4 new debug messages per batch (2048) events. Not too bad, but super helpful to identify potential issues, should we still face some problems.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel more like a trace level to me, but I am ok with debug since the noise wont be that much.

urso added 2 commits November 7, 2017 14:07
On error in the Logstash output sender, failed batches might get enqueued two
times. This can lead to multiple resends and ACKs for the same events.

In filebeat/winlogbeat, waiting for ACK from output, at most one ACK is
required. With potentially multiple ACKs (especially with multiple
consecutive IO errors) a deadlock in the outputs ACK handler can occur.

This PR ensures batches can not be returned to the retry queue via 2 code
paths (remove race between competing workers):
- async output worker does not return events back into retry queue
- async clients are required to always report retrieable errors via
callbacks
- add some more detailed debug logs to the LS output, that can help in
  identifiying ACKed batches still being retried
@urso urso force-pushed the fix/duplicates-in-retry-queue branch from a65f57a to 22c6540 Compare November 7, 2017 13:08
Copy link
Contributor

@ruflin ruflin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change LGTM but I have a hard time to get the full picture to understand potential side effects of this PR. Would be glad if we could have some additional eyes on this one.

@ruflin ruflin requested a review from andrewkroh November 7, 2017 22:11
}

return err
return w.client.AsyncPublishEvents(w.handleResults(msg), msg.data)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change in semantics is here. We always require the w.client instance to use the callback build via handleResult to report success/failure within a batch. This allows the output client to decide on sync or async error reporting.
The async client as provided by go-lumber does require full async reporting, but the Logstash output did some sync reporting as well, leading to duplicates. PR changes to have go-lumber only trigger the async reporting, indirectly via msgRef.

@ph
Copy link
Contributor

ph commented Nov 8, 2017

@urso I am aware of the possible deadlock, but it could cause also just pure duplicates on retry right? I remember seen a few cases recently concerning duplicates. Do we also have an idea since when this problem occur, It might help debugging future past cases.

@ph ph self-requested a review November 8, 2017 13:54
@urso
Copy link
Author

urso commented Nov 8, 2017

@ph The bug could produce duplicates on retry if pipelining is configured (by default it's disabled), but will always result in a deadlock in filebeat -> filebeat needs to be restarted. Number of total batches transfered before the deadlock depends on the pipelining setting. Adding load-balancing with multiple endpoints can potentially prolong/hide the deadlock, as the other output workers might still publish events.

Copy link
Contributor

@ph ph left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@urso LGTM, took a big longer. I don't see any potential of bad side effect with this PR.

err: nil,
}

debug("msgref(%p) new: batch=%p, cb=%p", r, &r.batch[0], cb)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel more like a trace level to me, but I am ok with debug since the noise wont be that much.

@@ -28,6 +28,8 @@ https://github.com/elastic/beats/compare/v5.6.4...5.6[Check the HEAD diff]

*Affecting all Beats*

- Fix duplicate batches of events in retry queue. {pull}5520[5520]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably mention deadlock, duplicates can be seen in other scenario.

Copy link
Member

@andrewkroh andrewkroh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@ruflin ruflin merged commit f806e3b into elastic:5.6 Nov 14, 2017
@ruflin
Copy link
Contributor

ruflin commented Nov 14, 2017

@urso I merged this one based on all the 👍 The comments by @ph can still be adressed in follow up PR.

@urso urso deleted the fix/duplicates-in-retry-queue branch February 19, 2019 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants