Clean callbacks of connection after run loop stopped (#239) #248

rueian · 2020-05-13T18:43:28Z

Fixes #239

Motivation

As @wolfstudy pointed out here #239 (comment)

There is a race on callbacks of pendingReqs when closing the connection while the run loop is still running, which will lead to calling a callback up to 2 times:

pulsar-client-go/pulsar/internal/connection.go

Lines 669 to 671 in e7f1673

    
           for _, req := range c.pendingReqs { 
        
           	req.callback(nil, errors.New("connection closed")) 
        
           }

Modifications

Introducing a runLoopStoppedCh to make sure that the run loop has already stopped when cleaning callbacks of pendingReqs in the Close()

cckellogg · 2020-05-14T00:15:49Z

pulsar/internal/connection.go

 	c.pingTicker.Stop()
 	c.pingCheckTicker.Stop()

 	for _, listener := range c.listeners {
 		listener.ConnectionClosed()
 	}

-	for _, req := range c.pendingReqs {
+	if c.runLoopStoppedCh != nil {


can you reuse the close channel here? Why create another channel?

Do you mean reusing the closeCh?

It seems that the closeCh is a signal to instruct the run() loop to exit. However we would not know when the run() has finished its job without a dedicated signal.

Therefore, the runLoopStoppedCh is the dedicated signal and is responsible to indicating the run() loop stopped.
When runLoopStoppedCh closed, we are able do further cleanup resources shared with the run() loop.

Why does the run loop have to exit before removing the pending callback? How could a callback be called twice?

As @wolfstudy pointed out, the problem comes from theinternalWriteData may call on Close()
https://github.com/apache/pulsar-client-go/blob/master/pulsar/internal/connection.go#L331

While the run loop will go to next iteration no matter that the Close() is called inside the loop or from another goroutine. At that point, the run loop can go into incomingCmdCh branch even if the c.closeCh had been closed.

If the run loop goes into the incomingCmdCh branch, then the callbacks in the pendingReqs being called in the Close() will be called second time.

So the issue is connection.Close() is called multiple times?

After Close() is the connection struct ever used again?

If we want to ensure Close() is only called once can we use a sync.Once and do something similar to what the consumers do?

https://github.com/apache/pulsar-client-go/blob/master/pulsar/consumer_impl.go#L362

I'm not sure if order is needed (pendingReqs close callbacks must happen before the others c.listeners and c.consumerHandlers callbacks) maybe @merlimat would know better.

I think moving just the pendingReqs into the run loops is needed. The other structures are protected with locks. Have you tried moving the pendingReqs code from Close() to the run loop and then running some tests?

Yes, moving just the pendingReqs into the run loops will still pass the existing tests, but it is really hard to write new race tests without to be flaky.

I would suggest we keep cleanup code at only one place, and they would be better triggered by the run loop. However the Close() would become an async operation.

Hi @merlimat, would you have any feedback?

After tracing the call path of the Close(), I found that it is called directly by the Close() in the client_impl.go.

When performing a graceful shutdown, although normally we will first call the producer.Close()/consumer.Close() and then call client.Close() at the end. I think it is still a good idea to keep the Close() as a sync operation to remain similar behavior between the producer.Close()/consumer.Close().

We could move the cleanup code of c.pendingReqs, c.listeners and c.consumerHandlers into the run loop. However in that case, I would suggest keeping the runLoopStoppedCh channel to inform the Close() that the cleanup process has finished.

What do you think? @cckellogg

Do you see the run loop ever getting stuck for a period of time when close is called? I think we should probably only pendingReqs cleanup in the run loop since that map is only accessed in that go routine . We could keep the others where they are since they are protected by locks. We just need to add some comments on why this is the case and why the runLoopStoppedCh is needed.

Thoughts?

Do you see the run loop ever getting stuck for a period of time when close is called?

No, they are separated originally. To keep the original behavior, the runLoopStoppedCh is not needed if we move the pendingReqs cleanup at the end of run loop.

I just updated the PR which moves pendingReqs cleanup into the run loop and adds some notices.

Please let me know if it looks good to you.

wolfstudy · 2020-06-09T01:22:56Z

The change LGTM +1, ping @cckellogg @merlimat PTAL

cckellogg · 2020-06-09T07:06:38Z

pulsar/internal/connection.go

-	if c.cnx != nil {
-		c.cnx.Close()
-	}
+	c.TriggerClose()


If a client calls this won't it trigger another call to Close() if the run loop is still going. This would cause all the listeners to be triggered again. Are there potential issues with that?

This method is wrapped with a mutex and a c.state == connectionClosed check.
I think another call to Close() will have no effect.

wolfstudy · 2020-06-11T10:12:28Z

Move the milestones to 0.2.0

rueian force-pushed the call-pending-callback-after-run-loop branch 2 times, most recently from a6b5467 to 38a16c2 Compare May 13, 2020 19:22

cckellogg reviewed May 14, 2020

View reviewed changes

Call pending callbacks of connection after run loop stopped (apache#239)

9702aff

rueian force-pushed the call-pending-callback-after-run-loop branch from 38a16c2 to 9702aff Compare May 14, 2020 10:14

wolfstudy assigned rueian May 15, 2020

wolfstudy requested review from merlimat and wolfstudy May 15, 2020 05:37

Fixed race condition in TriggerClose with sync.Once

682abb4

rueian changed the title ~~Call pending callbacks of connection after run loop stopped (#239)~~ Clean pending callbacks of connection after run loop stopped (#239) May 19, 2020

rueian changed the title ~~Clean pending callbacks of connection after run loop stopped (#239)~~ Clean callbacks of connection after run loop stopped (#239) May 19, 2020

Move cleanup of connection.pendingReqs to the end of the run()

533441b

wolfstudy added this to the 0.1.1 milestone Jun 9, 2020

wolfstudy approved these changes Jun 9, 2020

View reviewed changes

cckellogg reviewed Jun 9, 2020

View reviewed changes

wolfstudy modified the milestones: 0.1.1, 0.2.0 Jun 11, 2020

wolfstudy merged commit 5d57012 into apache:master Aug 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clean callbacks of connection after run loop stopped (#239) #248

Clean callbacks of connection after run loop stopped (#239) #248

rueian commented May 13, 2020 •

edited

Loading

cckellogg May 14, 2020

rueian May 14, 2020 •

edited

Loading

cckellogg May 14, 2020

rueian May 14, 2020 •

edited

Loading

cckellogg May 16, 2020

cckellogg May 18, 2020

rueian May 18, 2020

rueian May 22, 2020

cckellogg May 25, 2020

rueian May 27, 2020

wolfstudy commented Jun 9, 2020 •

edited

Loading

cckellogg Jun 9, 2020 •

edited

Loading

rueian Jun 28, 2020

wolfstudy commented Jun 11, 2020

	for _, req := range c.pendingReqs {
	req.callback(nil, errors.New("connection closed"))
	}

Clean callbacks of connection after run loop stopped (#239) #248

Clean callbacks of connection after run loop stopped (#239) #248

Conversation

rueian commented May 13, 2020 • edited Loading

Motivation

Modifications

Choose a reason for hiding this comment

rueian May 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rueian May 14, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wolfstudy commented Jun 9, 2020 • edited Loading

cckellogg Jun 9, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wolfstudy commented Jun 11, 2020

rueian commented May 13, 2020 •

edited

Loading

rueian May 14, 2020 •

edited

Loading

rueian May 14, 2020 •

edited

Loading

wolfstudy commented Jun 9, 2020 •

edited

Loading

cckellogg Jun 9, 2020 •

edited

Loading