-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix duplicate batches in retry queue #5520
Conversation
0f981b7
to
f7d82c4
Compare
This change seems to cause some issues: https://travis-ci.org/elastic/beats/jobs/298049838#L642 |
err: nil, | ||
} | ||
|
||
debug("msgref(%p) new: batch=%p, cb=%p", r, &r.batch[0], cb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will create a lot of debug messages. I know that this info is for debugging, but I start to get the feeling we need something in between info
and debug
, something like info --verbose
:-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, some extra-debug would be nice to have. Normally the debug messages create about 4 new debug messages per batch (2048) events. Not too bad, but super helpful to identify potential issues, should we still face some problems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel more like a trace
level to me, but I am ok with debug since the noise wont be that much.
On error in the Logstash output sender, failed batches might get enqueued two times. This can lead to multiple resends and ACKs for the same events. In filebeat/winlogbeat, waiting for ACK from output, at most one ACK is required. With potentially multiple ACKs (especially with multiple consecutive IO errors) a deadlock in the outputs ACK handler can occur. This PR ensures batches can not be returned to the retry queue via 2 code paths (remove race between competing workers): - async output worker does not return events back into retry queue - async clients are required to always report retrieable errors via callbacks - add some more detailed debug logs to the LS output, that can help in identifiying ACKed batches still being retried
a65f57a
to
22c6540
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change LGTM but I have a hard time to get the full picture to understand potential side effects of this PR. Would be glad if we could have some additional eyes on this one.
} | ||
|
||
return err | ||
return w.client.AsyncPublishEvents(w.handleResults(msg), msg.data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change in semantics is here. We always require the w.client
instance to use the callback build via handleResult
to report success/failure within a batch. This allows the output client to decide on sync or async error reporting.
The async client as provided by go-lumber
does require full async reporting, but the Logstash output did some sync reporting as well, leading to duplicates. PR changes to have go-lumber
only trigger the async reporting, indirectly via msgRef
.
@urso I am aware of the possible deadlock, but it could cause also just pure duplicates on retry right? I remember seen a few cases recently concerning duplicates. Do we also have an idea since when this problem occur, It might help debugging future past cases. |
@ph The bug could produce duplicates on retry if |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@urso LGTM, took a big longer. I don't see any potential of bad side effect with this PR.
err: nil, | ||
} | ||
|
||
debug("msgref(%p) new: batch=%p, cb=%p", r, &r.batch[0], cb) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel more like a trace
level to me, but I am ok with debug since the noise wont be that much.
@@ -28,6 +28,8 @@ https://github.com/elastic/beats/compare/v5.6.4...5.6[Check the HEAD diff] | |||
|
|||
*Affecting all Beats* | |||
|
|||
- Fix duplicate batches of events in retry queue. {pull}5520[5520] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably mention deadlock, duplicates can be seen in other scenario.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
On error in Logstash output sender, failed batches might get enqueued two
times. This can lead to multiple resends and ACKs for the same events.
In filebeat/winlogbeat, waiting for ACK from output, at most one ACK is
required. With potentially multiple ACKs (especially with multiple
consecutive IO errors) a deadlock in the outputs ACK handler can occur.
This PR ensures batches can not be returned to the retry queue via 2 code
paths (remove race between competing workers):
callbacks
identifiying ACKed batches still being retried