-
Notifications
You must be signed in to change notification settings - Fork 228
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consumers are getting dropped off on offset commit failures during prolonged kafka rebalancing #394
Comments
Hi @garyrussell , @chemicL , @pderop , @artembilan |
The logic there is like this:
So, that How is your use-case handled in a non-reactive scenario? See
So, as you said: if we fail to call Try with The removal of the mentioned code is not a solution, though. |
Hi @artembilan , We have used doOnError operator to log the error, but it was not found in the logs. Retry also didn't worked. .doOnError(KafkaConsumer::logEventConsumptionFailure)
.retryWhen(getRetryStrategy())
.onErrorResume(KafkaConsumer::handleErrorOnEventConsumption)
.
.
.repeat() Non reactive example KafkaConsumer < String, String > consumer = new KafkaConsumer < > (consumerConfig);
consumer.subscribe(Arrays.asList(topic));
while (true) {
try {
ConsumerRecords < String, String > records = consumer.poll(pollTimeout);
for (ConsumerRecord < String, String > record: records) {
log.info("offset = %d, key = %s, value = %s%n", record.offset(), record.key(), record.value());
}
} catch (Throwable e) {
log.error("KafkaConsumerTask exited with exception: ", e);
try {
// clean up and shutdown the consumer if it exists
if (consumer != null) {
consumer.close();
logger.info("KafkaConsumerTask exception: consumer closed");
}
// wait for few seconds
Thread.sleep(60000);
// recreate the consumer
consumer = new KafkaConsumer < > (consumerConfig);
consumer.subscribe(Arrays.asList(topic));
logger.info("KafkaConsumerTask exception: consumer started again");
} catch (InterruptedException ie) {
log.error("KafkaConsumerTask thread interrupted: ", ie);
}
}
} We have created a custom actuator endpoint that checks the value of the isConsumerActive flag and restarts the consumer if it is false. However, in this case, "Kafka consumer got terminated" was not found in the log, and isConsumerActive was not set to false. doOnTerminate(()-> {
isConsumerActive = false;
log.error("Kafka consumer got terminated");}) Attaching logs
|
OK. So you do loop yourself, but you still ask something to be done in this library. Otherwise, please, elaborate what exactly we could do over here, but really not that dropping for the |
Hi @artembilan , I will add onErrorContinue operator but my doubt is that we have doOnError operator in place but the message
|
OK. Any chances to have a simple project from you to let us to reproduce and play with on our side? |
Sure @artembilan One way I reproduced this issue in our staging environment earlier was by pushing a few million messages to the Kafka topic, having 100 pods connected to it, and performing continuous rolling restarts, which prolonged the rebalancing process. I solved the issue by adding retry and repeat but in this special case, flow is not even going to doOnError operator |
Yeah... That is not what I'm going to do here locally. |
@artembilan |
Expected Behavior
When rebalancing occurs during a commit and all offset commit retries are exhausted, the Reactor Kafka library should poll again and process uncommitted messages. The Kafka consumer should not be dropped and should continue processing the next batch of messages.
Actual Behavior
We have a distributed system with a Kafka topic containing 200 partitions and consumers. Due to network issues, latency, or other reasons, rebalancing may be triggered. If an offset commit fails during rebalancing and the rebalancing continues beyond the retry period, Kafka consumers are removed from the consumer group.
I reviewed the Reactor Kafka library and found that the asyncCleanup in the withHandler method stops the Consumer Event Loop. In a non-reactive Kafka consumer implementation, there is usually an infinite loop for poll(), where exceptions are caught, and the consumer continues to process the next set of messages. However, in reactive Kafka, the consumer event loop itself gets closed.
I have used repeat, retry workaround and increased offset commit retry attempts, but still it is not working.
Steps to Reproduce
Kafka Properties
kafka.session.timeout.ms=300000
kafka.heartbeat.interval.ms=30000
kafka.request.timeout.ms=180000
kafka.max.poll.records=500
kafka.max.poll.interval.ms=300000
kafka consumer retry config
max.commit.attempts=200
commit.retry.interval=5000
max.delay.rebalance.ms=240000
Logs
Possible Solution
The Consumer Event Loop should not be closed during cleanup. Instead, it should continue polling for messages.
Your Environment
netty
, ...): Spring Boot Webflux - 3.2.7java -version
): 17The text was updated successfully, but these errors were encountered: