Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically destory/recreate consumer client when connection is stale #899

Closed
7 tasks
tatianafrank opened this issue Jun 17, 2020 · 5 comments
Closed
7 tasks
Labels
investigate further It's unclear what the issue is at this time but there is enough interest to look into it

Comments

@tatianafrank
Copy link

Description

I have a script that runs an infinite while loop that continually consumes messages from a kafka topic using v1.4 of the confluent-kafka library. Sometimes, either due to too many connection timeout errors or some unknown issue, I will need to stop/restart the entire script so that the kafka client object can be destroyed/recreated and start working again. My question is if there is a dynamic way to detect when the client object needs to be recreated in the code without having to manually detect, stop, restart the entire script.

How to reproduce

Checklist

Please provide the following information:

  • confluent-kafka-python and librdkafka version (confluent_kafka.version() and confluent_kafka.libversion()): 1.4.0
  • Apache Kafka broker version:
  • Client configuration: {...}
  • Operating system:
  • Provide client logs (with 'debug': '..' as necessary)
  • Provide broker log excerpts
  • Critical issue
@mhowlett
Copy link
Contributor

Intended behavior is the client should automatically recover from all problems where it's possible to do so. If it doesn't then there's a bug. I'm not 100% sure if there's any known outstanding issues related to this (there doesn't seem to be any related fixes post 1.4.0 in the librdkafka release notes). Any more information you can provide would be appreciated (debug logs, though they are very verbose, so difficult to get in prod if this is a very intermittent issue).
cc: @edenhill

@mhowlett mhowlett added the investigate further It's unclear what the issue is at this time but there is enough interest to look into it label Jun 19, 2020
@edenhill
Copy link
Contributor

edenhill commented Jul 7, 2020

This might be "produce/consume hang after partition goes away and comes back, such as when a topic is deleted and re-created." which is fixed in v1.4.2:
https://github.com/edenhill/librdkafka/blob/master/CHANGELOG.md#librdkafka-v142

@tatianafrank
Copy link
Author

ok I deployed 1.4.2 so hopefully that helps.. Ill have to wait and see if the issue appears again

@holyachon
Copy link

holyachon commented Jul 20, 2020

@tatianafrank Hello. I think i have the same issue in v1.4.2. Does the issue be fixed after version up?

@tatianafrank
Copy link
Author

Saw this error a few times before the connection went stale again
ERROR: Consumer error: KafkaError{code=_TRANSPORT,val=-195,str="FindCoordinator response error: Local: Broker transport failure"}
Our kafka topic was down and then it was rebooted but instead of the service using confluent client going back to working on its own we had to restart the service because the consumer just stopped trying to connect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigate further It's unclear what the issue is at this time but there is enough interest to look into it
Projects
None yet
Development

No branches or pull requests

4 participants