-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Producer hangs when Retry.Max
is 0
#294
Comments
The following test will pass/fail this issue, it should be copied into edit: moved into test suite with the appropriate |
I feel the solution should be that we have a way to actively invalidate (remove) metadata as soon as we encounter a problem, so the subsequent message has no choice but to request fresh metadata. Maybe we should invalidate all cached metadata for a broker in |
The problem is not actually in the client - the metadata in the client is correctly invalidated in this case. The problem is that the |
Once #300 happens there are several easy ways to solve this. One of them is to fetch the |
Skip it until the bug is fixed, but at least it will keep up with the API changes now. Before it was inline in the ticket, and was falling behind.
MaxRetries
is 0Retries.Max
is 0
Retries.Max
is 0Retry.Max
is 0
Hello. So it looks like my issue #509 is a duplicate of this bug. Do you have any plan to fix that issue? Now I am trying to configure built-in retry mechanism, but I have a question. Are you retrying network and circuit breaker errors? I am using a round robin partitioner and it will be great if you can automatically reassign partition in case of circuit breaker or network error. |
Yes, it's just slow going as it mostly depends on the refactoring in #300. That is almost done however.
The producer retries only errors resulting from actual Produce requests. Errors from metadata requests and similar are not retried in the producer because they are already retried at the client level. Errors from circuit-breakers are not retried, because that would defeat the entire purpose of having the circuit-breaker. |
And what about network errors when producing a message? Is this the same case?
You can automatically assign an available partition when using random or round robin partitioner. Ordering is not important for that messages. |
Network errors when producing a message are retried.
Hmm, yes that would be a good improvement. Perhaps file a separate enhancement request to track this idea. |
Hmmm, doesn't seem so — I'm going to close it but feel free to reopen if you think otherwise 🙌 |
(This is also poor behaviour in other configurations in that it can "waste" some of the configured retries, but is otherwise not a problem - it doesn't break ordering or anything.)
The
leaderDispatcher
function only knows to check for a new leader when it receives a message with a new retry value, but there are at least two cases (possibly more?) where for various reasons theflusher
fails but never sends a retriable message to the leaderDispatcher:In both of the above cases, we can end up with a
leaderDispatcher
whose broker is toast, but doesn't know it and so doesn't try and get a new one.In the normal case, the next message (not at the retry limit) will get spun around anyways and kick things off the way it should; this is less than ideal (we waste a retry for the message(s) in question) but not a very severe problem as by default we have three retries configured and typically only need one.
However, in the case where someone sets
MaxRetries
to 0, every subsequent message is immediately returned to the user and the producer ends up stuck.The text was updated successfully, but these errors were encountered: