-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
panic during rapid send of many large-sized payloads: "panic: send on closed channel" #239
Comments
Thanks @jinfengnarvar feedback, It seems that we did not close the channel properly somewhere, causing the application to continue writing data to a closed channel. |
I will try to reproduce the issue, you are also welcome to provide a reproducible code |
@wolfstudy thx. i'm also trying to find a (more) deterministic way to repro it - right now it only occurs in our prod env service that processes multiple large files at the same time. crash occurs frequently but not in a deterministic way. also i'm wondering if anyone has any insight of what's the repeated blocks of "failed to write...; connection closed...; reconnecting to broker....;" failure logs indicate? Apparently these logs are from within the pulsar go client, you guys happen to use my question is we're in the midst of writing to producer partitions, and what could usually cause the connection sever? broker overloaded (due to volume or payload size?) Anything on the pulsar server side can be configured to reveal the connection drop failure reason? |
A bit more finding after having placing some logs in our code:
Notes:
Now the logging we've obtained:
As you can see:
|
Update, enabled more logging and found that the payload we're sending right before the crash occurred is 27MB big!! That object is incorrect, shouldn't be that big. That said, do we want to make the pulsar go client more robust to fail us and return us some err, instead of panic/crash? |
This means that the current connection between the client and the broker is closed, and the client attempts to establish a connection with the broker again within the
The log level is
So the root of the problem is the failure of |
By default, the maximum message allowed by the broker is 5MB, as follows:
The best way is that we should check the size of the message in the go client, if it exceeds 5MB we will not process it. Follow the java client logic: https://github.com/apache/pulsar/blob/master/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ClientCnx.java#L294-L302 |
Yes having a size check in the client library is good. But I'm still concerned about the crash. Even without the check, when we (our code) ask the client to send something illegal (in this case an over-the-limit payload), it should fail gracefully with no panic. Seems like there is some racing condition in the client caused this panic. |
Yes, agree with you, even if we checked the size of the message, the error |
Fixes #239 ### Motivation As @wolfstudy pointed out here #239 (comment) There is a race on callbacks of `pendingReqs` when closing the connection while the run loop is still running, which will lead to calling a callback up to 2 times: https://github.com/apache/pulsar-client-go/blob/e7f1673350f208b5063823282d14906d70d66904/pulsar/internal/connection.go#L669-L671 ### Modifications Introducing a `runLoopStoppedCh` to make sure that the run loop has already stopped when cleaning callbacks of `pendingReqs` in the `Close()`
Signed-off-by: xiaolongran <xiaolongran@tencent.com> ### Motivation In `internalSendRequest`, We will add the request to be sent to the `pendingReqs` map, even when the current connection status is `connectionClosed`, we will append the request, which will cause the current request's callback to be called twice First: ``` func (c *connection) internalSendRequest(req *request) { c.pendingLock.Lock() if req.id != nil { c.pendingReqs[*req.id] = req } c.pendingLock.Unlock() if c.getState() == connectionClosed { c.log.Warnf("internalSendRequest failed for connectionClosed") // In Here, call req.callback ************* if req.callback != nil { req.callback(req.cmd, ErrConnectionClosed) } } else { c.writeCommand(req.cmd) } } ``` Twice: ``` func (c *connection) run() { // All reads come from the reader goroutine go c.reader.readFromConnection() go c.runPingCheck() c.log.Debugf("Connection run starting with request capacity=%d queued=%d", cap(c.incomingRequestsCh), len(c.incomingRequestsCh)) defer func() { // all the accesses to the pendingReqs should be happened in this run loop thread, // including the final cleanup, to avoid the issue #239 c.pendingLock.Lock() for id, req := range c.pendingReqs { // In Here, call req.callback ********** req.callback(nil, errors.New("connection closed")) delete(c.pendingReqs, id) } c.pendingLock.Unlock() c.Close() }() .... } ``` In fact, when the current connection is in the `connectionClosed` state, we don’t need to append the request to the `pendingReqs` map, so we don’t need to process the request when it’s closed. ### Modifications When the connection is closed, the current request to be sent is not added to the `pendingReqs` map.
Expected behavior
Sending large number of large payloads via
Producer.Send()
shouldn't panic/crashActual behavior
Need some insights from you guys if this rings any bell or not. We've been using pure golang client for a few weeks now with good results until we just recently onboarded a new customer, who sends in large order files each of which contains many large records.
Typically record after protobuf encoding is about 150KB to 300KB. And we have millions of these records. Although the send is called in a single thread right now, when we start rapidly calling Send, our logs suddenly shows this:
Steps to reproduce
Not sure how this can be repro'ed on your side. I'm reaching out to your rep's slack channel within our company to investigate as well.
System configuration
Pulsar version: github.com/apache/pulsar-client-go v0.0.0-20200316114055-4dc7855f99dc
Go version: 1.12
The text was updated successfully, but these errors were encountered: