-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rd_kafka_query_watermark_offsets API hang forever #2588
Comments
Hi, diff --git a/src/rdkafka.c b/src/rdkafka.c
index 86d347f8..797ee937 100644
--- a/src/rdkafka.c
+++ b/src/rdkafka.c
@@ -2592,6 +2592,8 @@ rd_kafka_query_watermark_offsets (rd_kafka_t *rk, const char *topic,
struct rd_kafka_partition_leader *leader;
rd_list_t leaders;
rd_kafka_resp_err_t err;
+ int tmout;
+ int cnt;
partitions = rd_kafka_topic_partition_list_new(1);
rktpar = rd_kafka_topic_partition_list_add(partitions,
@@ -2641,10 +2643,12 @@ rd_kafka_query_watermark_offsets (rd_kafka_t *rk, const char *topic,
/* Wait for reply (or timeout) */
while (state.err == RD_KAFKA_RESP_ERR__IN_PROGRESS &&
- rd_kafka_q_serve(rkq, 100, 0, RD_KAFKA_Q_CB_CALLBACK,
- rd_kafka_poll_cb, NULL) !=
- RD_KAFKA_OP_RES_YIELD)
- ;
+ !rd_timeout_expired((tmout = rd_timeout_remains(ts_end)))){
+ cnt = rd_kafka_q_serve(rkq, tmout, 0, RD_KAFKA_Q_CB_CALLBACK,
+ rd_kafka_poll_cb, NULL);
+ if (cnt == RD_KAFKA_OP_RES_YIELD)
+ break;
+ }
rd_kafka_q_destroy_owner(rkq);
Does it make sense, or it involves other risk? |
@edenhill |
@edenhill |
I've encountered a similar issue, it appears |
Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ
Description
rd_kafka_query_watermark_offsets API will hang forever when the kafka cluster network encounter access restriction(network isolation)
How to reproduce
I could reproduce this problem with latest librdkafka version
launch 2 vm/docker instances(my local os is centos 6). A, B
install confluent-oss at instance A, start kafka with 3 broker services
create a topic "test" for kafka with 3 partitions and replication-factor equal to 1, each broker should have a unique partition Id, assuming the "test" topic is with the following compositions:
at instance B, deploy the test program
main.go.zip
enable iptable service at instance A, just reject instance B's accessing for port 9095
Now run test program at instance B(test API QueryWatermarkOffsets), and it will hang(the partitionId 2's broker is alive but is not accessible for instanceB)
If we use the OffsetsForTimes API, the program could exit when timeout
conclusion:
I think the issue could be easily reproduced when a partitionId's leader(broker) is isolated.
The infinite looping code is here,
IMPORTANT: Always try to reproduce the issue on the latest released version (see https://github.com/edenhill/librdkafka/releases), if it can't be reproduced on the latest version the issue has been fixed.
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
v0.11.6
confluent-oss-5.0.0-2.11
"session.timeout.ms": 10000
centos 6
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: