Producer hangs when `Retry.Max` is 0 #294

eapache · 2015-02-23T20:10:44Z

(This is also poor behaviour in other configurations in that it can "waste" some of the configured retries, but is otherwise not a problem - it doesn't break ordering or anything.)

The leaderDispatcher function only knows to check for a new leader when it receives a message with a new retry value, but there are at least two cases (possibly more?) where for various reasons the flusher fails but never sends a retriable message to the leaderDispatcher:

all of the messages are at their retry limit, and get returned to the user instead
there is a topic/partition led by that broker, but with no messages in the current batch, when the broker connection dies

In both of the above cases, we can end up with a leaderDispatcher whose broker is toast, but doesn't know it and so doesn't try and get a new one.

In the normal case, the next message (not at the retry limit) will get spun around anyways and kick things off the way it should; this is less than ideal (we waste a retry for the message(s) in question) but not a very severe problem as by default we have three retries configured and typically only need one.

However, in the case where someone sets MaxRetries to 0, every subsequent message is immediately returned to the user and the producer ends up stuck.

The text was updated successfully, but these errors were encountered:

eapache · 2015-02-23T20:13:18Z

The following test will pass/fail this issue, it should be copied into producer_test.go when this is fixed.

edit: moved into test suite with the appropriate Skip call

wvanbergen · 2015-02-25T20:22:04Z

I feel the solution should be that we have a way to actively invalidate (remove) metadata as soon as we encounter a problem, so the subsequent message has no choice but to request fresh metadata. Maybe we should invalidate all cached metadata for a broker in disconnectBroker?

eapache · 2015-02-25T20:27:13Z

The problem is not actually in the client - the metadata in the client is correctly invalidated in this case. The problem is that the leaderDispatcher keeps a local "cache" of its brokerProducer (via the local output variable), and it relies on a "retried" message to come back in order to invalidate that cache.

eapache · 2015-02-25T20:37:29Z

Once #300 happens there are several easy ways to solve this. One of them is to fetch the brokerProducer on every iteration of the leaderDispatcher (which would require tweaking how we reference-count those slightly). Another is to send a dummy "trigger" message (using flags) around on failure to all the affected leadersDispatchers, even if they didn't have a message in the batch.

Skip it until the bug is fixed, but at least it will keep up with the API changes now. Before it was inline in the ticket, and was falling behind.

Add a test for bug #294

wkuranowski · 2015-08-13T14:14:14Z

Hello. So it looks like my issue #509 is a duplicate of this bug. Do you have any plan to fix that issue?

Now I am trying to configure built-in retry mechanism, but I have a question. Are you retrying network and circuit breaker errors? I am using a round robin partitioner and it will be great if you can automatically reassign partition in case of circuit breaker or network error.

eapache · 2015-08-13T14:19:35Z

Do you have any plan to fix that issue?

Yes, it's just slow going as it mostly depends on the refactoring in #300. That is almost done however.

Are you retrying network and circuit breaker errors?

The producer retries only errors resulting from actual Produce requests. Errors from metadata requests and similar are not retried in the producer because they are already retried at the client level. Errors from circuit-breakers are not retried, because that would defeat the entire purpose of having the circuit-breaker.

wkuranowski · 2015-08-13T14:38:50Z

Errors from metadata requests and similar are not retried in the producer because they are already retried at the client level.

And what about network errors when producing a message? Is this the same case?

Errors from circuit-breakers are not retried, because that would defeat the entire purpose of having the circuit-breaker.

You can automatically assign an available partition when using random or round robin partitioner. Ordering is not important for that messages.

eapache · 2015-08-13T14:53:08Z

And what about network errors when producing a message?

Network errors when producing a message are retried.

You can automatically assign an available partition when using random or round robin partitioner. Ordering is not important for that messages.

Hmm, yes that would be a good improvement. Perhaps file a separate enhancement request to track this idea.

varun06 · 2019-02-20T15:14:33Z

Does anyone know if this is still an issue. Can we close it now? @bai @eapache

bai · 2019-02-20T15:26:33Z

Hmmm, doesn't seem so — I'm going to close it but feel free to reopen if you think otherwise 🙌

eapache added bug producer labels Feb 23, 2015

sirupsen mentioned this issue Feb 23, 2015

Resiliency tests #271

Closed

6 tasks

eapache mentioned this issue Feb 24, 2015

Split producer goroutines into sub-structs #300

Closed

eapache added a commit that referenced this issue Mar 10, 2015

Add a test for bug #294

437cd2f

Skip it until the bug is fixed, but at least it will keep up with the API changes now. Before it was inline in the ticket, and was falling behind.

eapache added a commit that referenced this issue Mar 10, 2015

Merge pull request #328 from Shopify/more-skipped-tests

f732a1b

Add a test for bug #294

eapache mentioned this issue Aug 13, 2015

Issues with Producer.Retry.Max = 0 #509

Closed

eapache changed the title ~~Producer hangs when MaxRetries is 0~~ Producer hangs when Retries.Max is 0 Aug 13, 2015

eapache changed the title ~~Producer hangs when Retries.Max is 0~~ Producer hangs when Retry.Max is 0 Aug 13, 2015

eapache mentioned this issue Oct 15, 2015

Producer refactor 3 #551

Merged

urso mentioned this issue Apr 28, 2016

Filebeat losing connection to Kafka elastic/beats#1432

Closed

framiere mentioned this issue May 20, 2016

Error writing to output [Kafka]: Failed to send kafka message influxdata/telegraf#1239

Closed

eapache mentioned this issue Jun 2, 2016

Restarted brokers are kept with a broken connection (producer) #665

Closed

bai closed this as completed Feb 20, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Producer hangs when `Retry.Max` is 0 #294

Producer hangs when `Retry.Max` is 0 #294

eapache commented Feb 23, 2015

eapache commented Feb 23, 2015

wvanbergen commented Feb 25, 2015

eapache commented Feb 25, 2015

eapache commented Feb 25, 2015

wkuranowski commented Aug 13, 2015

eapache commented Aug 13, 2015

wkuranowski commented Aug 13, 2015

eapache commented Aug 13, 2015

varun06 commented Feb 20, 2019

bai commented Feb 20, 2019

Producer hangs when Retry.Max is 0 #294

Producer hangs when Retry.Max is 0 #294

Comments

eapache commented Feb 23, 2015

eapache commented Feb 23, 2015

wvanbergen commented Feb 25, 2015

eapache commented Feb 25, 2015

eapache commented Feb 25, 2015

wkuranowski commented Aug 13, 2015

eapache commented Aug 13, 2015

wkuranowski commented Aug 13, 2015

eapache commented Aug 13, 2015

varun06 commented Feb 20, 2019

bai commented Feb 20, 2019

Producer hangs when `Retry.Max` is 0 #294

Producer hangs when `Retry.Max` is 0 #294