Fix duplicate batches in retry queue #5520

urso · 2017-11-06T15:34:58Z

On error in Logstash output sender, failed batches might get enqueued two
times. This can lead to multiple resends and ACKs for the same events.

In filebeat/winlogbeat, waiting for ACK from output, at most one ACK is
required. With potentially multiple ACKs (especially with multiple
consecutive IO errors) a deadlock in the outputs ACK handler can occur.

This PR ensures batches can not be returned to the retry queue via 2 code
paths (remove race between competing workers):

async output worker does not return events back into retry queue
async clients are required to always report retrieable errors via
callbacks
add some more detailed debug logs to the LS output, that can help in
identifiying ACKed batches still being retried

ruflin · 2017-11-06T21:29:20Z

This change seems to cause some issues: https://travis-ci.org/elastic/beats/jobs/298049838#L642

ruflin · 2017-11-06T21:25:36Z

libbeat/outputs/logstash/async.go

+		err:       nil,
+	}
+
+	debug("msgref(%p) new: batch=%p, cb=%p", r, &r.batch[0], cb)


This will create a lot of debug messages. I know that this info is for debugging, but I start to get the feeling we need something in between info and debug, something like info --verbose :-)

Yeah, some extra-debug would be nice to have. Normally the debug messages create about 4 new debug messages per batch (2048) events. Not too bad, but super helpful to identify potential issues, should we still face some problems.

Feel more like a trace level to me, but I am ok with debug since the noise wont be that much.

On error in the Logstash output sender, failed batches might get enqueued two times. This can lead to multiple resends and ACKs for the same events. In filebeat/winlogbeat, waiting for ACK from output, at most one ACK is required. With potentially multiple ACKs (especially with multiple consecutive IO errors) a deadlock in the outputs ACK handler can occur. This PR ensures batches can not be returned to the retry queue via 2 code paths (remove race between competing workers): - async output worker does not return events back into retry queue - async clients are required to always report retrieable errors via callbacks - add some more detailed debug logs to the LS output, that can help in identifiying ACKed batches still being retried

ruflin

Change LGTM but I have a hard time to get the full picture to understand potential side effects of this PR. Would be glad if we could have some additional eyes on this one.

urso · 2017-11-08T11:48:33Z

libbeat/outputs/mode/lb/async_worker.go

-	}
-
-	return err
+	return w.client.AsyncPublishEvents(w.handleResults(msg), msg.data)


Change in semantics is here. We always require the w.client instance to use the callback build via handleResult to report success/failure within a batch. This allows the output client to decide on sync or async error reporting.
The async client as provided by go-lumber does require full async reporting, but the Logstash output did some sync reporting as well, leading to duplicates. PR changes to have go-lumber only trigger the async reporting, indirectly via msgRef.

ph · 2017-11-08T13:54:09Z

@urso I am aware of the possible deadlock, but it could cause also just pure duplicates on retry right? I remember seen a few cases recently concerning duplicates. Do we also have an idea since when this problem occur, It might help debugging future past cases.

urso · 2017-11-08T14:05:35Z

@ph The bug could produce duplicates on retry if pipelining is configured (by default it's disabled), but will always result in a deadlock in filebeat -> filebeat needs to be restarted. Number of total batches transfered before the deadlock depends on the pipelining setting. Adding load-balancing with multiple endpoints can potentially prolong/hide the deadlock, as the other output workers might still publish events.

ph

@urso LGTM, took a big longer. I don't see any potential of bad side effect with this PR.

ph · 2017-11-08T13:57:47Z

libbeat/outputs/logstash/async.go

+		err:       nil,
+	}
+
+	debug("msgref(%p) new: batch=%p, cb=%p", r, &r.batch[0], cb)


Feel more like a trace level to me, but I am ok with debug since the noise wont be that much.

ph · 2017-11-08T14:13:49Z

CHANGELOG.asciidoc

@@ -28,6 +28,8 @@ https://github.com/elastic/beats/compare/v5.6.4...5.6[Check the HEAD diff]

 *Affecting all Beats*

+- Fix duplicate batches of events in retry queue. {pull}5520[5520]


I would probably mention deadlock, duplicates can be seen in other scenario.

andrewkroh

LGTM.

ruflin · 2017-11-14T00:41:30Z

@urso I merged this one based on all the 👍 The comments by @ph can still be adressed in follow up PR.

urso added bug review labels Nov 6, 2017

urso force-pushed the fix/duplicates-in-retry-queue branch 2 times, most recently from 0f981b7 to f7d82c4 Compare November 6, 2017 15:40

ruflin reviewed Nov 6, 2017

View reviewed changes

urso added 2 commits November 7, 2017 14:07

Update tests to match new semantics

22c6540

urso force-pushed the fix/duplicates-in-retry-queue branch from a65f57a to 22c6540 Compare November 7, 2017 13:08

ruflin approved these changes Nov 7, 2017

View reviewed changes

ruflin requested a review from andrewkroh November 7, 2017 22:11

urso commented Nov 8, 2017

View reviewed changes

ph self-requested a review November 8, 2017 13:54

ph approved these changes Nov 8, 2017

View reviewed changes

andrewkroh approved these changes Nov 13, 2017

View reviewed changes

ruflin merged commit f806e3b into elastic:5.6 Nov 14, 2017

urso deleted the fix/duplicates-in-retry-queue branch February 19, 2019 18:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix duplicate batches in retry queue #5520

Fix duplicate batches in retry queue #5520

urso commented Nov 6, 2017 •

edited

Loading

ruflin commented Nov 6, 2017

ruflin Nov 6, 2017

urso Nov 6, 2017

ph Nov 8, 2017

ruflin left a comment

urso Nov 8, 2017

ph commented Nov 8, 2017

urso commented Nov 8, 2017 •

edited

Loading

ph left a comment

ph Nov 8, 2017

ph Nov 8, 2017

andrewkroh left a comment

ruflin commented Nov 14, 2017

		@@ -28,6 +28,8 @@ https://github.com/elastic/beats/compare/v5.6.4...5.6[Check the HEAD diff]

		Affecting all Beats

		- Fix duplicate batches of events in retry queue. {pull}5520[5520]

Fix duplicate batches in retry queue #5520

Fix duplicate batches in retry queue #5520

Conversation

urso commented Nov 6, 2017 • edited Loading

ruflin commented Nov 6, 2017

ruflin Nov 6, 2017

Choose a reason for hiding this comment

urso Nov 6, 2017

Choose a reason for hiding this comment

ph Nov 8, 2017

Choose a reason for hiding this comment

ruflin left a comment

Choose a reason for hiding this comment

urso Nov 8, 2017

Choose a reason for hiding this comment

ph commented Nov 8, 2017

urso commented Nov 8, 2017 • edited Loading

ph left a comment

Choose a reason for hiding this comment

ph Nov 8, 2017

Choose a reason for hiding this comment

ph Nov 8, 2017

Choose a reason for hiding this comment

andrewkroh left a comment

Choose a reason for hiding this comment

ruflin commented Nov 14, 2017

urso commented Nov 6, 2017 •

edited

Loading

urso commented Nov 8, 2017 •

edited

Loading