Automatically destory/recreate consumer client when connection is stale #899

tatianafrank · 2020-06-17T19:00:21Z

Description

I have a script that runs an infinite while loop that continually consumes messages from a kafka topic using v1.4 of the confluent-kafka library. Sometimes, either due to too many connection timeout errors or some unknown issue, I will need to stop/restart the entire script so that the kafka client object can be destroyed/recreated and start working again. My question is if there is a dynamic way to detect when the client object needs to be recreated in the code without having to manually detect, stop, restart the entire script.

How to reproduce

Checklist

Please provide the following information:

confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()): 1.4.0
Apache Kafka broker version:
Client configuration: {...}
Operating system:
Provide client logs (with 'debug': '..' as necessary)
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

mhowlett · 2020-06-19T17:19:58Z

Intended behavior is the client should automatically recover from all problems where it's possible to do so. If it doesn't then there's a bug. I'm not 100% sure if there's any known outstanding issues related to this (there doesn't seem to be any related fixes post 1.4.0 in the librdkafka release notes). Any more information you can provide would be appreciated (debug logs, though they are very verbose, so difficult to get in prod if this is a very intermittent issue).
cc: @edenhill

edenhill · 2020-07-07T11:05:45Z

This might be "produce/consume hang after partition goes away and comes back, such as when a topic is deleted and re-created." which is fixed in v1.4.2:
https://github.com/edenhill/librdkafka/blob/master/CHANGELOG.md#librdkafka-v142

tatianafrank · 2020-07-08T19:22:12Z

ok I deployed 1.4.2 so hopefully that helps.. Ill have to wait and see if the issue appears again

holyachon · 2020-07-20T07:27:40Z

@tatianafrank Hello. I think i have the same issue in v1.4.2. Does the issue be fixed after version up?

tatianafrank · 2020-07-22T21:11:32Z

Saw this error a few times before the connection went stale again
ERROR: Consumer error: KafkaError{code=_TRANSPORT,val=-195,str="FindCoordinator response error: Local: Broker transport failure"}
Our kafka topic was down and then it was rebooted but instead of the service using confluent client going back to working on its own we had to restart the service because the consumer just stopped trying to connect.

mhowlett added the investigate further It's unclear what the issue is at this time but there is enough interest to look into it label Jun 19, 2020

amotl mentioned this issue Oct 15, 2020

Robustness and resiliency on Azure confluentinc/librdkafka#3109

Open

edenhill closed this as completed Jan 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatically destory/recreate consumer client when connection is stale #899

Automatically destory/recreate consumer client when connection is stale #899

tatianafrank commented Jun 17, 2020

mhowlett commented Jun 19, 2020

edenhill commented Jul 7, 2020

tatianafrank commented Jul 8, 2020

holyachon commented Jul 20, 2020 •

edited

Loading

tatianafrank commented Jul 22, 2020

Automatically destory/recreate consumer client when connection is stale #899

Automatically destory/recreate consumer client when connection is stale #899

Comments

tatianafrank commented Jun 17, 2020

Description

How to reproduce

Checklist

mhowlett commented Jun 19, 2020

edenhill commented Jul 7, 2020

tatianafrank commented Jul 8, 2020

holyachon commented Jul 20, 2020 • edited Loading

tatianafrank commented Jul 22, 2020

holyachon commented Jul 20, 2020 •

edited

Loading