Simpler offset management, fixed minor race #1127

dim · 2018-07-10T09:28:01Z

This is a necessary prerequisite for #1099, please also see the inline comments in the diff.

dim

@eapache pls see comments below

dim · 2018-07-10T09:31:34Z

offset_manager.go

+	om.pomsMu.Lock()
+	defer om.pomsMu.Unlock()
+
+	for _, topicManagers := range om.poms {


Should global commit errors be propagated to each individual POM or just to the first one?

🤔 all of them still, I think

dim · 2018-07-10T09:32:53Z

offset_manager_test.go

-		CoordinatorHost: "127.0.0.1",
-		CoordinatorPort: newCoordinator.Port(),
-	})
+	// No error, no need to refresh coordinator


I changed the logic to only RefreshCoordinator after errors and not on every commit attempt.

dim · 2018-07-10T09:36:33Z

offset_manager.go

-		pom.lock.Lock()
-		defer pom.lock.Unlock()
-
-		for pom.dirty {


So I think this has actually never worked properly, as you are creating a deadlock here. The state is blocked as long as pom.dirty, but the flag will never actually change to false nor will pom.clean ever fire, because pom.updateCommitted will never be able to acquire pom.lock.Lock()

🤦‍♂️ nice find

eapache · 2018-07-11T18:52:34Z

offset_manager.go

-	poms map[string]map[int32]*partitionOffsetManager
-	boms map[*Broker]*brokerOffsetManager
+	broker   *Broker
+	brokerMu sync.RWMutex


For code style consistency I'd prefer these be called brokerLock, pomsLock, etc.

eapache · 2018-07-11T18:59:24Z

offset_manager.go

-		pom.lock.Lock()
-		defer pom.lock.Unlock()
-
-		for pom.dirty {


🤦‍♂️ nice find

eapache · 2018-07-11T19:00:26Z

offset_manager.go


-	delete(om.boms, bom.broker)
+	for _, topicManagers := range om.poms {


Just for organization I'd prefer handleResponse or something as a separate method to mirror constructRequest

eapache · 2018-07-11T19:01:11Z

offset_manager.go

+		om.asyncClosePOMs()
+
+		// flush one last time
+		for retries := om.conf.Metadata.Retry.Max; true; {


I think it makes more sense to define a separate set of Retry configurations in Consumer.Offsets?

I'm not sure, the main reason why this would fail is because of a coordinator change which means we need to refresh metadata and try again. I can add more config options but it seems an overkill.

Hmm, good point, that wasn't obvious. However, the metadata refresh itself will retry this many times internal to the client, so by using this value here you're technically enabling the square of it as retries? You're also not sleeping the way the Metadata.Retry struct might imply.

I still think this might make the most sense as Consumer.Offsets.MaxShutdownAttempts or something. It's not the same thing as refreshing metadata.

eapache · 2018-07-11T19:03:12Z

offset_manager.go

+		select {
+		case <-om.ticker.C:
+			om.flushToBroker()
+			om.releasePOMs(false)


It took me a while to figure out what this was doing, and I'm not much a fan of this pattern. An updateSubscriptions channel (as in the old BOM) seems easier to follow and less likely to result in races/deadlocks than polling a piece of state in all the POMs?

I thought about updateSubscriptions but it's not making things any more trivial in my opinion. The problem is that the eviction of POMs is closely linked to the flush cycles and we still need to coordinate those on OffsetManager.Close, i.e. keep retrying flush until either Retry.Max is reached or until all POMs are clean. I don't see how to implement this with an updateSubscriptions channel, we would have to do something like:

func (om *offsetManager) Close() error { om.closeOnce.Do(func() { // exit the mainLoop close(om.closing) <-om.closed // mark all POMs as closed om.asyncClosePOMs() deadline := time.NewTimer(10*time.Second) // just an example defer deadline.Stop() retry := time.NewTicker(100*time.Millisecond) // just an example defer retry.Stop() FlushLoop: for { om.flushToBroker() select { case <-deadline.C: break FlushLoop case pom := <-om.updateSubscription: om.pomsLock.Lock() delete(om.poms[pom.topic], pom.partition) if len(om.poms[pom.topic]) == 0 { delete(om.poms, pom.topic) } remaining := len(om.poms) om.pomsLock.Unlock() if remaining == 0 { break FlushLoop } case <-retry.C: } } // TODO: abandon any remaining POMs om.brokerLock.Lock() om.broker = nil om.brokerLock.Unlock() }) return nil }

Even with helper method, I don't think this is easier to follow than the polling.

For OffsetManager.Close I think you're overthinking things. Instead of exiting the mainloop and them manually looping flush, just call pom.AsyncClose() on all of them (as you already do), set a flag and then wait for a waitgroup. The mainloop can respect the flag to count down the number of remaining flushes.

Letting mainLoop doing the work after calling Close() could delay things significantly, depending on CommitInterval. I'm not a fan of this, I prefer exiting the mainLoop to attempt an immediate final flush (with instant retries). The current approach will also simplify the integration with ClusterClient.

OK. It's not clear to me how this will simplify the ClusterClient integration, but I don't feel too strongly about this.

eapache · 2018-07-11T19:04:25Z

offset_manager.go

+			case ErrNotLeaderForPartition, ErrLeaderNotAvailable,
+				ErrConsumerCoordinatorNotAvailable, ErrNotCoordinatorForConsumer:
+				// not a critical error, we just need to redispatch
+				om.releaseCoordinator(broker)


In cases like this I wonder if we should be doing something to trigger another dispatch before the next timer tick? There's no point in waiting really.

I think it's not the end of the world if we miss a cycle and wait for mainLoop to trigger the next flush. At the same time, we currently already retry immediately if triggered by Close

eapache · 2018-07-11T19:04:41Z

offset_manager.go

+	om.pomsMu.Lock()
+	defer om.pomsMu.Unlock()
+
+	for _, topicManagers := range om.poms {


🤔 all of them still, I think

eapache · 2018-07-19T14:22:23Z

I'm pretty happy with this now, if it suits your needs. CI has detected a race condition though.

eapache · 2018-07-20T17:21:05Z

Nice, thanks!

Simpler offset management, fixed minor race

c1a36ee

dim commented Jul 10, 2018

View reviewed changes

Correct coordinator checkin

2a96a24

eapache reviewed Jul 11, 2018

View reviewed changes

dim added 2 commits July 12, 2018 12:16

Addressing feeback

151daaa

Add custom offset commit retry config

9f431f4

Fix test race

433beec

eapache merged commit c5b44e3 into IBM:master Jul 20, 2018

dim deleted the feature/simpler-offset-manager branch September 12, 2018 15:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simpler offset management, fixed minor race #1127

Simpler offset management, fixed minor race #1127

dim commented Jul 10, 2018

dim left a comment •

edited

Loading

dim Jul 10, 2018

eapache Jul 11, 2018

dim Jul 10, 2018

dim Jul 10, 2018

eapache Jul 11, 2018

eapache Jul 11, 2018

eapache Jul 11, 2018

eapache Jul 11, 2018

eapache Jul 11, 2018

dim Jul 12, 2018

eapache Jul 12, 2018

eapache Jul 11, 2018

dim Jul 12, 2018

eapache Jul 12, 2018

dim Jul 16, 2018

eapache Jul 19, 2018

eapache Jul 11, 2018

dim Jul 12, 2018

eapache Jul 11, 2018

eapache commented Jul 19, 2018

eapache commented Jul 20, 2018


		delete(om.boms, bom.broker)
		for _, topicManagers := range om.poms {

Simpler offset management, fixed minor race #1127

Simpler offset management, fixed minor race #1127

Conversation

dim commented Jul 10, 2018

dim left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eapache commented Jul 19, 2018

eapache commented Jul 20, 2018

dim left a comment •

edited

Loading