Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"empty" ProducerMessage may lead to the producer hang #2150

Closed
hxiaodon opened this issue Feb 16, 2022 · 1 comment
Closed

"empty" ProducerMessage may lead to the producer hang #2150

hxiaodon opened this issue Feb 16, 2022 · 1 comment

Comments

@hxiaodon
Copy link

@slaunay
with my local laptop(mac OS 10.15.7), I think I produced the hang problem again :). I suppose that your fix do work and cover many scenarios which make hang problem's occurring more and more harder

Thanks for providing those details.
I am not sure if this is linked to the current issue (deadlock regression) but something is not right indeed and we should probably create another issue for that particular scenario.

Here are some things I found:

  • the brokerProducer is blocked sending a success:
3 @ 0x1038d16 0x1006925 0x10064dd 0x1301574 0x130038a 0x1326142 0x12ff72f 0x12ff65f 0x12fe85f 0x132e9de 0x10688a1
#	0x1301573	github.com/Shopify/sarama.(*asyncProducer).returnSuccesses+0xb3		/Users/tiger/go/pkg/mod/github.com/slaunay/sarama@v1.31.2-0.20220211051606-f1bc44e541ee/async_producer.go:1155
#	0x1300389	github.com/Shopify/sarama.(*brokerProducer).handleSuccess.func1+0x469	/Users/tiger/go/pkg/mod/github.com/slaunay/sarama@v1.31.2-0.20220211051606-f1bc44e541ee/async_producer.go:974
#	0x1326141	github.com/Shopify/sarama.(*produceSet).eachPartition+0x101		/Users/tiger/go/pkg/mod/github.com/slaunay/sarama@v1.31.2-0.20220211051606-f1bc44e541ee/produce_set.go:211
#	0x12ff72e	github.com/Shopify/sarama.(*brokerProducer).handleSuccess+0x8e		/Users/tiger/go/pkg/mod/github.com/slaunay/sarama@v1.31.2-0.20220211051606-f1bc44e541ee/async_producer.go:950
#	0x12ff65e	github.com/Shopify/sarama.(*brokerProducer).handleResponse+0x3e		/Users/tiger/go/pkg/mod/github.com/slaunay/sarama@v1.31.2-0.20220211051606-f1bc44e541ee/async_producer.go:938
#	0x12fe85e	github.com/Shopify/sarama.(*brokerProducer).run+0x1be			/Users/tiger/go/pkg/mod/github.com/slaunay/sarama@v1.31.2-0.20220211051606-f1bc44e541ee/async_producer.go:872
#	0x132e9dd	github.com/Shopify/sarama.withRecover+0x3d				/Users/tiger/go/pkg/mod/github.com/slaunay/sarama@v1.31.2-0.20220211051606-f1bc44e541ee/utils.go:43

https://github.com/Shopify/sarama/blob/f1bc44e541eecf45f935b97db6a457740aaa073e/async_producer.go#L1155

  • the syncProducer successes goroutine is blocked forwarding a success (and therefore blocking the brokerProducer):
1 @ 0x1038d16 0x10065f6 0x10064dd 0x132da0d 0x132e9de 0x10688a1
#	0x132da0c	github.com/Shopify/sarama.(*syncProducer).handleSuccesses+0x8c	/Users/tiger/go/pkg/mod/github.com/slaunay/sarama@v1.31.2-0.20220211051606-f1bc44e541ee/sync_producer.go:132
#	0x132e9dd	github.com/Shopify/sarama.withRecover+0x3d			/Users/tiger/go/pkg/mod/github.com/slaunay/sarama@v1.31.2-0.20220211051606-f1bc44e541ee/utils.go:43

https://github.com/Shopify/sarama/blob/f1bc44e541eecf45f935b97db6a457740aaa073e/sync_producer.go#L132

  • one producer goroutine is blocked waiting for the current success/error:
8 @ 0x1038d16 0x10077cc 0x10071f8 0x132d910 0x1334d50 0x1334a0c 0x10688a1
#	0x132d90f	github.com/Shopify/sarama.(*syncProducer).SendMessage+0x8f	/Users/tiger/go/pkg/mod/github.com/slaunay/sarama@v1.31.2-0.20220211051606-f1bc44e541ee/sync_producer.go:96
#	0x1334d4f	main.(*SyncProducer).SendMessage+0x22f				/Users/tiger/Downloads/working/kafka-cluster/producer/producer.go:41
#	0x1334a0b	main.main.func3+0x1cb						/Users/tiger/Downloads/working/kafka-cluster/producer/main.go:59

https://github.com/Shopify/sarama/blob/f1bc44e541eecf45f935b97db6a457740aaa073e/sync_producer.go#L96

Now what is really interesting is that the expectation field on a ProducerMessage used by the syncProducer is a channel that is always buffered with a capacity of 1.
So it should never block the the syncProducer successes goroutine in theory but it seems to be the case.

The null key null value record you see in the topic makes me think that:

  • a mostly "empty" ProducerMessage ends up being sent to the remote broker
  • the same message with a nil expectation field ends up being sent as a success to syncProducer successes goroutine
  • sending to a nil channel blocks forever therefore blocking the brokerProducer and preventing more records to be produced

Now such "empty" ProducerMessage used by the AsyncProducer can be:

As fin messages are used during retries, it might the root cause of the hanging if somehow they escape the AsyncProducer and ends up on the broker and in a success channel.

It would be great to confirm this is the case and ideally have a simple unit test for that scenario.
If you reduce connections.max.idle.ms and Net.ReadTimeout you might be able to reproduce it faster.

@hxiaodon Would you mind creating another issue?

Originally posted by @slaunay in #2133 (comment)

@hxiaodon
Copy link
Author

@slaunay
I create this new issue to track the new hang problem created at the closed PR 2133

niamster added a commit to niamster/sarama that referenced this issue Mar 20, 2022
@dnwe dnwe closed this as completed in d5f076b Mar 30, 2022
dnwe added a commit that referenced this issue Mar 30, 2022
dnwe added a commit that referenced this issue Mar 30, 2022
docmerlin added a commit to influxdata/kapacitor that referenced this issue May 24, 2022
docmerlin added a commit to influxdata/kapacitor that referenced this issue May 24, 2022
* fix(kafka): updated kafka client to fix a bug in the library

This bug can cause the producer to hang IBM/sarama#2150

* chore: go mod tidy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant