-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Kafka consumer (v1.4.2) gets stuck (NOT_COORDINATOR) whilst rejoining a group after a broker rolling update #2944
Comments
This is a fantastic bug report! 💯 |
Broker 0 responds to FindCoordinator requests by saying that broker 0 (itself) is the coordinator for the given group, but on JoinGroup it responds with Not coordinator. It would be useful to have this reproduced with debug=all to see exactly which requests are being sent to which broker. |
The other interesting difference between this issue and #2933 is that this one seems to rely on the auto-commit, whereas I couldn't reproduce #2933 with auto-commit enabled. |
Thanks for the quick response! I have been trying all day to reproduce this with the same binary (same artifact) with
This shouldn't be possible (apart from a bug in Kafka itself) because of how the rollout is done: a broker instance (a kubernetes pod) is terminated. Only then a new one is created, with a new IP and the same persistent volume attached to the new pod. However, whilst the IPs will differ, the hostname and FQDN will be the same -- is it possible that there's some client-side caching / temporary blacklisting that confuses rdkafka? although that should be fine as the broker.address.ttl is the default 1 second. The advertised listeners on 9092 and 9094 are IP-based but on our usual listener (which I can't use in my test binary) they are FQDN-based. So far I have ruled this out because I was able to get that one consumer stuck on :9092 and :9094 but please let me know if I've missed something.
But would this be the case over many hours however? We monitor the Broker's health and there are no issues after the rollout. The JoinGroup error starts happening whilst there are still underreplicated partitions (replication factor 3, min.in.sync=2), which includes __consumer_offsets, I assume but those catch up within a minute or two after the last broker has been booted. As I said, I managed to reproduce it once. However, this actually happened with librdkafka 1.2.1! I think it's possible over the half a year we've used 1.2.1 that we've had a small handful of occurrences where this could have been the root cause without us noticing; however, this is nowhere near as frequent as is with 1.4.2 - which we observed more or less daily for 1-2 consumers. So perhaps something changed in newer versions to make this more probable. Also, what I've captured, if I'm reading it correctly, is whilst the consumer is in state "up"; the original consumer state seems to be "join-state started". Do you think this could result in NOT_COORDINATOR errors during heartbeats / commits? As during my experiments, I have now started seeing the 1.2.1 environments report such (but no JoinGroup request errors in those environments). Here's a breakdown of the rollout which I captured (for some irrelevant reasons, the rollout in this env was slower today ..) The only interesting bits I've managed to find in the sea of broker logs are around broker-2 handing over being coordinator and soon after the newly recovered broker-0 is reporting that the CG is empty. For our real consumer groups which we've detected stuck, I haven't been able to confirm/reject whether the below is always emitted but I have seen it; also, in some cases I have seen the group declared dead, too. Broker logs:
some consumer logs with
full consumer logs (577k lines): Hopefully this helps. |
Hi @edenhill , is the above useful or do you need more information? Thanks! |
If the broker id changes but the address:port remains the same then librdkafka will associate that address:port with the old broker id until metadata is refreshed (up to topic.metadata.refresh.interval.ms), so this definitely sounds like it could be the issue here. |
The broker ID doesn't change AFAIK as the new instance reuses the storage volume of the old instance and therefore adopts that old ID. |
Does this mean that the old IP might be reused for a broker with a different broker id? |
It's definitely possible but I think unlikely as the IP CIDR blocks are large; from looking at the last 10+ rollouts, this hasn't happened. I'll keep an eye on that and report if it correlates, however. |
I am hitting the same issue with 1.4.2 and its happening only in our prod environment. Its been difficult to collect logs with debug level since we have too many client instances running. @dimpavloff, is this issue still repro'ing for you? Did you add any resiliency on the client side? @edenhill, any suggestions here. |
@ajbarb Not sure about 1.5.2 but in 1.4.2 yes was getting the issue. I implement a reconnect when this is detected. |
How did you detect this issue on the client side? Do you restart the client or just reconnect the broker connections under the hood? |
seems to be related to https://issues.apache.org/jira/browse/KAFKA-9752 |
I have been able to confirm that the client actually connects to wrong broker IP when the consumer group coordinator switch is happening due to cached broker address which is pointing to the earlier broker. Broker/ip 12-25-2020 13:32:48[ProcessId: 63772] Level: Debug, Message: [thrd:main]: GroupCoordinator/1685982883: Broker nodename changed from "100.126.14.175:9092" to "" I was able to repro this locally with these steps:
The problem seems to be in this code below (rdkafka_broker.c) where in if client observes multiple group coordinator switch within broker.address.ttl time then it uses cached address:
@edenhill, what would be the right fix here? |
@ajbarb , looks like a very nice investigation, thanks for this! One thing that slightly worries me is that setting the value to Btw, here's a reminder of the env we have, hopefully useful for debugging:
|
Yep |
EDIT2: Nevermind, it turned out to be due to some race conditions on our side, in the partitions revoked / assigned callbacks. |
I hit some similar error-loop situation where I'd see these debug logs for the consumer:
It might be apparent these are AWS MSK brokers. What's not obvious is that the MSK cluster lives in a different VPC and I'm connecting using private DNS resolution for those hostnames to resolve VPC Private Links that are exposed to my consumer. A blog post called Secure connectivity patterns to access Amazon MSK across AWS Regions My error situation was a mistake on my part: the Route53 private domain aliases for each of the bootstrap servers was wrong for all 3 VPC endpoint DNS routes. As a result, some connectivity appeared to be fine while joining a group would fail in a loop; the reasoning is that the DNS aliases in my case were actually pointing to the wrong destination advertised by the broker for the group coordinator. As such, the client would connect to The problem melted away when I straightened out DNS aliases to match the broker addresses' VPC endpoints, as one might expect. I'm sharing this on this thread since the hints about DNS and IP changes caused me to triple-check my endpoint configurations, and indeed the issue was that the aliases were mixed up (b-1 was resolving b-3, etc). |
Hi,
First off, my company is a long-time user of rdkafka, thanks for all the work!
tl;dr We've had consumers stall indefinitely after rolling update of Kafka brokers with "not coordinator" errors returned to JoinGroup requests. The issue is present in 1.4.0 and 1.4.2 but not in 1.2.0 and happens fairly often after a rollout. This seems similar to #2791 but that should have been fixed in 1.4.2 according to that thread. Further, we've tried not subscribing to rebalance events without any difference. The issue may be related to autocommits. This looks similar to #2933, too; however, there the reporter says the issue does not exist in 1.4.0 whilst for us it does, as far as we can tell, and also there's no SESSTMOUT reported in their logs.
Description
Background:
We use the Go and Python bindings for rdkafka; we build these ourselves, including rdkafka which we compile against
musl
. The Python and Go configuration we have is almost identical with the main exception that in Go we useenable.auto.commit
and make calls toStoreOffsets
whilst in Python we callconsumer.commit()
(in either case, the goal is at-least-once delivery to the business code).I mention this because this issue seems to have never been detected in our Python consumers whilst it frequently happens to the Go ones (we do have more Go consumers, however). In Go, we periodically use Poll(), just like in Python.
We perform rolling updates of our 3 Kafka brokers fairly regularly. The whole rollout takes about 12-13 minutes -- ~30s to shutdown, ~30s to attach the persistent volume to the new instance (a Kubernetes pod), 180s grace period before moving on to the next one. (This process could be improved to be signal-based and to be mindful of the current controller :) ).
The Issue:
With rdkafka 1.2, we can see the brokers handing over the coordinator role and the clients rejoining the consumer groups -- sometimes there's temporarily NOT_COORDINATOR responses which can be expected during the rollout. Here is an example of the broker JoinGroup responses during a rolling update:
Here's the same rollout but on an environment with rdkafka 1.4 clients:
Note the pre-existing "NOT COORDINATOR" errors already present and how these persist long after the 3 JoinGroup successful responses peaks (matching the 3 brokers getting replaced). These errors persist until the consumers which are stuck get restarted. The way we identify the stuck consumers is by having a goroutine reporting the length of
consumer.Assignment()
and if that is0
for a prolonged period of time (this is adjusted for other factors that could lead to 0 assignment, ofc). In the logs of these affected consumers, there's always a%4\|1593263797.864\|SESSTMOUT\|rdkafka#consumer-1\| [thrd:main]: Consumer group session timed out (in join-state started) after 10488 ms without a successful response from the group coordinator (broker 0, last error was Broker: Not coordinator): revoking assignment and rejoining group
or a variation of this withlast error was Success
The issue (which I assume starts with SESSTMOUT ) seems to happen a bit after the last broker has been replaced. It seems that the NOT_COORDINATOR is always / most of the time returned by the first broker that has been replaced during the rollout. To me, it seems that the consumer still thinks the coordinator is that broker , missing the fact that the last broker to be replaced has successfully rejoined the cluster and became the group coordinator in the meantime. However, this may just be heavily correlated due the constant speed at which the rollout happens and how that interacts with some consumer timeout parameters.
Of course, SESSTMOUT alone isn't the issue as long as rdkafka recovers from it. Our rdkafka 1.4 Python consumers do recover from it, it seems, which is why we've been suspicious of: our Go code, the
confluent-kafka-go
, and rdkafka. We've ruled out the first two and confirmed it's the latter one by reverting to 1.2 where the issue disappeared.How to reproduce
Have an idle consumer (single consumer in a CG is enough), then trigger a rolling update.
What we have noticed is that the issue does not occur if this consumer does not have any offsets stored (i.e all offsets are -1001) ; otherwise, the consumer can get stuck after a broker rolling update. This is why I am suspricious of autocommit with StoreOffsets versus explicitly calling commit()
We've ruled out the following:
confluent-kafka-go
wrapper. However, rolling it back to 1.1 (whilst still building it against rdkafka 1.4) did not make a differencego.application.rebalance.enable
and either a pass anil
callback or a rebalanceCb function when calling Go'sconsumer.Subscribe()
. In all cases, this did not make a differencetopic.metadata.refresh.interval.ms
to something like 150s did not make a difference (as expected, as that shouldn't be necessary).Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
1.4.0
,1.4.2
2.4.1
(withlog.message.format.version=2.1
)Python (no issue observed):
Go (issue observed):
alpine linux container
, kernel:4.14.165
debug=..
as necessary) from librdkafka:Using
debug=consumer,cgrp,topic,metadata
for a consumer that has StoredOffsets, starting right before a rolling update starts.Below is an excerpt , here's a full gist (3.5k lines) https://gist.github.com/dimpavloff/f7c8ee9975eb96b1bfe89ec385d78178
The text was updated successfully, but these errors were encountered: