Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

brokerProducer is nil, panic #2184

Closed
simon28082 opened this issue Mar 20, 2022 · 19 comments
Closed

brokerProducer is nil, panic #2184

simon28082 opened this issue Mar 20, 2022 · 19 comments

Comments

@simon28082
Copy link

simon28082 commented Mar 20, 2022

Versions
Sarama Kafka Go
1.32.0 2.8.7 1.18
Problem Description

This is not an inevitable situation, but it happens many times
This is a crash triggered by brokerProducer=nil
Test scenario:
A message is sent every second, and the network connection is interrupted during the sending process. During the process, the producer will asynchronously close the producer, and Panic will occur in some cases in about 3-5 minutes
In github.com/Shopify/sarama/async_producer.go 560 lines

if msg.retries > pp.highWatermark {
	// a new, higher, retry level; handle it and then back off
	pp.newHighWatermark(msg.retries)
	pp.backoff(msg.retries)
}

func (pp *partitionProducer) newHighWatermark(hwm int) {
	Logger.Printf("producer/leader/%s/%d state change to [retrying-%d]\n", pp.topic, pp.partition, hwm)
	pp.highWatermark = hwm

	
	pp.retryState[pp.highWatermark].expectChaser = true
	pp.parent.inFlight.Add(1)
//  Note: This is going to cause panic, because brokerProducer=nil
	pp.brokerProducer.input <- &ProducerMessage{Topic: pp.topic, Partition: pp.partition, flags: fin, retries: pp.highWatermark - 1}


	Logger.Printf("producer/leader/%s/%d abandoning broker %d\n", pp.topic, pp.partition, pp.leader.ID())
	pp.parent.unrefBrokerProducer(pp.leader, pp.brokerProducer)
	pp.brokerProducer = nil
}

I also took a quick look at the Dispatch method, but couldn't figure out the specific reason
This problem, only in 1.32.0, I have tested other versions without this problem

@niamster
Copy link
Contributor

Hi @crcms Can you please test this with #2182 ? Thank you!

@slaunay
Copy link
Contributor

slaunay commented Mar 25, 2022

Line 560 for 1.32.0 does not seem to dereference a brokerProducer:
https://github.com/Shopify/sarama/blob/9baf7a651a68631e143039fafd7a7bb13ebdceca/async_producer.go#L560

Would you be able to post the panic trace, confirm the version and ideally provide some Sarama logs?

Also what do you mean by During the process, the producer will asynchronously close the producer?

@simon28082
Copy link
Author

@slaunay I am just disconnecting the network during the running process.

@slaunay
Copy link
Contributor

slaunay commented Mar 31, 2022

@crcms if you can share the panic trace that would go a long way.

@simon28082
Copy link
Author

@slaunay This is a stack of panic.

goroutine 58 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc000218600, 0x1)
        vendor/github.com/Shopify/sarama/async_producer.go:618 +0x19d
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc000218600)
        vendor/github.com/Shopify/sarama/async_producer.go:562 +0x4f7
github.com/Shopify/sarama.withRecover(0x0)
        vendor/github.com/Shopify/sarama/utils.go:43 +0x3e
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
        vendor/github.com/Shopify/sarama/async_producer.go:513 +0x1fb

This is a must-have BUG, as long as you interrupt the network in the running process, this problem will occur about 3-5 minutes, and the number of retries set by Producer is the default.

@simon28082
Copy link
Author

Hi @crcms Can you please test this with #2182 ? Thank you!

Already testing, still has question

@slaunay
Copy link
Contributor

slaunay commented Apr 1, 2022

Thanks @crcms, the stack points us to line 618 for the nil derefencing vs line 560:
https://github.com/Shopify/sarama/blob/9baf7a651a68631e143039fafd7a7bb13ebdceca/async_producer.go#L615-L619

One scenario that would lead to the brokerProducer being nil prior to calling newHighWatermark would be if the following case <-pp.brokerProducer.abandoned matches:
https://github.com/Shopify/sarama/blob/9baf7a651a68631e143039fafd7a7bb13ebdceca/async_producer.go#L547-L558

We need a unit test case to reproduce the issue in order to implement the proper fix.
Such test case can be used as well to confirm whether it still happens with #2182.

@simon28082
Copy link
Author

@slaunay thank you very much, but the issue #2182 I used it, and it's still problematic

@NewbieOrange
Copy link

NewbieOrange commented Jun 15, 2022

We are having the same issue, any updates on this?

edit: sarama v1.34.0 (go 1.19) with kafka v2.4.1

@qitta
Copy link

qitta commented Oct 4, 2022

As this seems to be related, on our production system (sarama v1.36.0, go 1.19; kafka v2.8.0) a pod just crashed with:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x877330]

goroutine 1168 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0x40008e38c0, 0x2)
	/go/pkg/mod/github.com/!shopify/sarama@v1.36.0/async_producer.go:620 +0x1e0
github.com/Shopify/sarama.(*partitionProducer).dispatch(0x40008e38c0)
	/go/pkg/mod/github.com/!shopify/sarama@v1.36.0/async_producer.go:564 +0x4a4
github.com/Shopify/sarama.withRecover(0xdf5190?)
	/go/pkg/mod/github.com/!shopify/sarama@v1.36.0/utils.go:43 +0x40
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
	/go/pkg/mod/github.com/!shopify/sarama@v1.36.0/async_producer.go:515 +0x208

Are there any updates on this?

@crcms You mentioned that this is a must have BUG? So is this panic expected behavior if there is a connection issue for some minutes?

@yusronhanan
Copy link

As this seems to be related, on our production system (sarama v1.36.0, go 1.19; kafka v2.8.0) a pod just crashed with:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x877330]

goroutine 1168 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0x40008e38c0, 0x2)
	/go/pkg/mod/github.com/!shopify/sarama@v1.36.0/async_producer.go:620 +0x1e0
github.com/Shopify/sarama.(*partitionProducer).dispatch(0x40008e38c0)
	/go/pkg/mod/github.com/!shopify/sarama@v1.36.0/async_producer.go:564 +0x4a4
github.com/Shopify/sarama.withRecover(0xdf5190?)
	/go/pkg/mod/github.com/!shopify/sarama@v1.36.0/utils.go:43 +0x40
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
	/go/pkg/mod/github.com/!shopify/sarama@v1.36.0/async_producer.go:515 +0x208

Are there any updates on this?

@crcms You mentioned that this is a must have BUG? So is this panic expected behavior if there is a connection issue for some minutes?

I face the same issue

@thothothotho
Copy link

got the same issue with sarama@1.37.2 on prod when we lost our kafka:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xfca3b0]
goroutine 469 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0x40006bc5a0, 0x1)
        /home/jenkins/go/pkg/mod/github.com/!shopify/sarama@v1.37.2/async_producer.go:708 +0x1e0
github.com/Shopify/sarama.(*partitionProducer).dispatch(0x40006bc5a0)
        /home/jenkins/go/pkg/mod/github.com/!shopify/sarama@v1.37.2/async_producer.go:648 +0x4a4
github.com/Shopify/sarama.withRecover(0x0?)
        /home/jenkins/go/pkg/mod/github.com/!shopify/sarama@v1.37.2/utils.go:43 +0x40
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
        /home/jenkins/go/pkg/mod/github.com/!shopify/sarama@v1.37.2/async_producer.go:599 +0x208

@Quuxplusone
Copy link

We also see this segfault when the broker goes away. Sarama version 1.31.1 (2022-02-01).
Issue #2322 is a duplicate of this issue #2184, right? Can we mark one or the other as a duplicate?

Dec 10 11:07:19 dut2 OUR: panic: runtime error: invalid memory address or nil pointer dereference
Dec 10 11:07:19 dut2 OUR: [signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1865bfd]
Dec 10 11:07:19 dut2 OUR: goroutine 131 [running]:
Dec 10 11:07:19 dut2 OUR: github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc001361380, 0x1)
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/async_producer.go:618 +0x19d
Dec 10 11:07:19 dut2 OUR: github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc001361380)
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/async_producer.go:562 +0x50a
Dec 10 11:07:19 dut2 OUR: github.com/Shopify/sarama.withRecover(0xc0004a5400?)
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/utils.go:43 +0x3e
Dec 10 11:07:19 dut2 OUR: created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
Dec 10 11:07:19 dut2 OUR: /u0/REDACTED/source/telegraf-1.22.1/vendor/github.com/Shopify/sarama/async_producer.go:513 +0x1f6
Dec 10 11:07:19 dut2 systemd: OUR.service: main process exited, code=exited, status=2/INVALIDARGUMENT

@flgg2yybb
Copy link

flgg2yybb commented Dec 15, 2022

Just met the issue on sarama v1.29.1, any updates to solve it?

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x1ded78f]

goroutine 799 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc0045287e0, 0x1)
	/root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/async_producer.go:617 +0x1cf
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc0045287e0)
	/root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/async_producer.go:561 +0x589
github.com/Shopify/sarama.withRecover(0xc003fcde00)
	/root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/utils.go:43 +0x49
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
	/root/go/src/git.garena.com/shopee/seller-server/crm/crm-proactive-push/vendor/github.com/Shopify/sarama/async_producer.go:512 +0x1aa

@edoger
Copy link

edoger commented Dec 19, 2022

I have a currently feasible mitigation solution: first set the panic handler, rebuild all producers when any panic is caught, and because the panic handler is set, the producer will not automatically close when the producer internally crashes, and the input channel will be fast is filled and blocked, a producer rebuild is also required when this is detected.

@asddongmen
Copy link

got the same issue:
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0x12b2609]

goroutine 74332 [running]:
github.com/Shopify/sarama.(*partitionProducer).newHighWatermark(0xc0c18b9140, 0x3)
	github.com/Shopify/sarama@v1.36.0/async_producer.go:620 +0x1a9
github.com/Shopify/sarama.(*partitionProducer).dispatch(0xc0c18b9140)
	github.com/Shopify/sarama@v1.36.0/async_producer.go:564 +0x537
github.com/Shopify/sarama.withRecover(0xc1ea43e580?)
	github.com/Shopify/sarama@v1.36.0/utils.go:43 +0x3e
created by github.com/Shopify/sarama.(*asyncProducer).newPartitionProducer
	github.com/Shopify/sarama@v1.36.0/async_producer.go:515 +0x1ea

@asddongmen
Copy link

seems this issue was fixed by:fix: add nil check while producer is retrying (

@github-actions
Copy link

Thank you for taking the time to raise this issue. However, it has not had any activity on it in the past 90 days and will be closed in 30 days if no updates occur.
Please check if the main branch has already resolved the issue since it was raised. If you believe the issue is still valid and you would like input from the maintainers then please comment to ask for it to be reviewed.

@github-actions github-actions bot added the stale Issues and pull requests without any recent activity label Aug 21, 2023
@dnwe
Copy link
Collaborator

dnwe commented Aug 21, 2023

Closing as fixed by #2387

@dnwe dnwe closed this as completed Aug 21, 2023
@dnwe dnwe removed the stale Issues and pull requests without any recent activity label Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests