Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: prevent deadlock between subscription manager and consumer goroutines #2194

Merged
merged 4 commits into from
Apr 13, 2022

Conversation

niamster
Copy link
Contributor

Since subscription manager was improved to batch the subscriptions (see dadcd80)
it created a deadlock in the case when new subscription are added after a rebalance.

After a rebalance, the subscription is stuck on pushing to the wait channel, which is drained only if there are no active subscriptions.

…tionConsumer threads

since subscription manager was improved to batch subscriptions (see IBM@dadcd80)
it created a deadlock in the case when new subscription are added after a rebalance
@niamster niamster changed the title consumer: avoid obscure sync between subscriptionManager and subscriptionConsumer threads consumer: fix deadlock between subscription Manager and Consumer threads Mar 23, 2022
@lavoiesl
Copy link
Contributor

I was trying to troubleshoot another bug and wrote this:

import (
	"context"
	"fmt"
	"math/rand"
	"strconv"
	"testing"
	"time"

	"github.com/stretchr/testify/require"
)

type inlineHandler func(sess ConsumerGroupSession, claim ConsumerGroupClaim) error

func (inlineHandler) Setup(_ ConsumerGroupSession) error   { return nil }
func (inlineHandler) Cleanup(_ ConsumerGroupSession) error { return nil }
func (h inlineHandler) ConsumeClaim(sess ConsumerGroupSession, claim ConsumerGroupClaim) error {
	return h(sess, claim)
}

func TestConsumerGroupRebalance(t *testing.T) {
	const topic = "test-topic"

	ctx, cancel := context.WithCancel(context.Background())
	defer cancel()

	config := NewConfig()
	config.Producer.Return.Successes = true
	config.Consumer.Offsets.Initial = OffsetOldest
	c, err := NewClient([]string{"localhost:9092"}, config)
	require.NoError(t, err)

	admin, err := NewClusterAdminFromClient(c)
	require.NoError(t, err)

	err = admin.DeleteTopic(topic)
	t.Log(err)

	err = admin.CreateTopic(topic, &TopicDetail{NumPartitions: 4, ReplicationFactor: 1}, false)
	t.Log(err)

	p, err := NewSyncProducerFromClient(c)
	require.NoError(t, err)

	g, err := NewConsumerGroupFromClient("group", c)
	require.NoError(t, err)

	timer := time.NewTimer(1 * time.Minute)
	defer timer.Stop()

	resets := make(chan struct{})

	go func() {
		for {
			select {
			case <-ctx.Done():
				return
			case <-timer.C:
				timer.Reset(time.Duration(rand.Float64()*100_000) * time.Microsecond)
				resets <- struct{}{}
			}
		}
	}()

	ch := make(chan string)
	for i := 0; i < 10; i++ {
		go func(g ConsumerGroup) {
			for {
				if ctx.Err() != nil { // context was cancelled
					t.Log("consumer-group stopped")
					break
				}

				err := g.Consume(ctx, []string{topic}, inlineHandler(func(sess ConsumerGroupSession, claim ConsumerGroupClaim) error {
					for {
						select {
						case <-resets:
							return nil
						case <-ctx.Done():
							return ctx.Err()
						case message, ok := <-claim.Messages():
							if !ok {
								return nil
							}
							ch <- string(message.Value)
							sess.MarkMessage(message, "")
						}
					}
				}))

				t.Logf("consumer-group disconnected: %s", err)
			}
		}(g)
	}

	time.Sleep(2 * time.Second)
	t.Log("booted")
	timer.Reset(0)

	n := 1000

	go func() {
		for i := 0; i < n; i++ {
			time.Sleep(time.Duration(rand.Float64()*1_000) * time.Microsecond)
			m := &ProducerMessage{
				Topic: topic,
				Value: StringEncoder(strconv.FormatInt(int64(i), 10)),
			}
			_, _, err := p.SendMessage(m)
			require.NoError(t, err)
			//p.Input() <- m
		}
	}()

	m := make(map[string]struct{}, n)
	for len(m) < n {
		line := <-ch
		m[line] = struct{}{}
		t.Logf("received: %s, count: %d", line, len(m))
	}

	cancel()

	err = g.Close()
	require.NoError(t, err)

	err = p.Close()
	require.NoError(t, err)

	err = c.Close()
	require.NoError(t, err)
}

It introduces chaos, randomly sleeping and exiting claims to trigger rebalances.
The expected behaviour is to receive all the 1000 messages, with a bunch of random disconnects while consuming:

    consumer_group_test.go:110: booted
...
    consumer_group_test.go:143: received: 185, count: 186
    consumer_group_test.go:143: received: 186, count: 187
    consumer_group_test.go:103: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:103: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:143: received: 187, count: 188
    consumer_group_test.go:143: received: 189, count: 189
    consumer_group_test.go:103: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:103: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:143: received: 188, count: 190
    consumer_group_test.go:143: received: 193, count: 191
    consumer_group_test.go:143: received: 196, count: 192
...
    consumer_group_test.go:103: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:82: consumer-group stopped
    consumer_group_test.go:103: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:82: consumer-group stopped

When broken, it just gets stuck in a disconnect loop:

    consumer_group_test.go:110: booted
    consumer_group_test.go:104: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:104: consumer-group disconnected: %!s(<nil>)
    consumer_group_test.go:104: consumer-group disconnected: %!s(<nil>)
...

I verified that:

  • It works fine on v1.32.0
  • It fails on latest main (f07b7b8)
  • It works fine on latest main, but with dadcd80 reverted

However, I tried applying this patch and it didn't seem to fix it, it just gets stuck in the "disconnected" loop

@niamster
Copy link
Contributor Author

Thanks @lavoiesl for writing a unit test ❤️ . I will try to reproduce the problem with it and see what is wrong.

@niamster
Copy link
Contributor Author

Thanks @lavoiesl for writing a unit test ❤️ . I will try to reproduce the problem with it and see what is wrong.

@lavoiesl If I understand correctly cg.Consume should not be called multiple times because it uses the same member ID (and some internals of the cg structures are shared). I'm afraid that you might be seeing different problem.
Were you able to reproduce this issue with a real program with and without this patch?

Since I'm back to this PR I will try to write a functional test that illustrates original problem I was trying to fix.

@niamster
Copy link
Contributor Author

Unfortunately I was not able to create a simple test case that could robustly reproduce the issue with consumer groups.
However I was able to steadily reproduce with a generic consumer.

With Kafka launched with the help of docker-compose, if I run

TOXIPROXY_ADDR=localhost:8474 KAFKA_VERSION=3.1.0 DEBUG=true go test -v -timeout 20s -tags functional -run TestConsumerGroupDeadlock ./

I can reproduce the deadlock:

goroutine 53 [chan send]:
github.com/Shopify/sarama.(*brokerConsumer).subscriptionManager(0xc0001ba5f0)
        /Users/dmytro.milinevsky/dev/sarama/consumer.go:915 +0x266
github.com/Shopify/sarama.withRecover(0x0)
        /Users/dmytro.milinevsky/dev/sarama/utils.go:43 +0x3e
created by github.com/Shopify/sarama.(*consumer).newBrokerConsumer
        /Users/dmytro.milinevsky/dev/sarama/consumer.go:869 +0x19f

@pavius @dnwe can you please review this PR 🙏 ?

@lavoiesl
Copy link
Contributor

Fair point, here's a similar version, but with different clients:

	p, err := NewSyncProducerFromClient(c)
	require.NoError(t, err)

	n := 1000

	ch := make(chan string, n)
	for i := 0; i < 27; i++ {
		go func(i int) {
			time.Sleep(time.Duration(i) * time.Second)
			for {
				if ctx.Err() != nil { // context was cancelled
					t.Log("consumer-group stopped")
					break
				}

				config := NewConfig()
				config.ClientID = fmt.Sprintf("test-%d", i)
				config.Consumer.Offsets.Initial = OffsetOldest
				config.Consumer.Group.Heartbeat.Interval = 100 * time.Millisecond
				config.Consumer.Group.Rebalance.Timeout = 100 * time.Millisecond
				g, err := NewConsumerGroup([]string{"sarama.railgun:9092"}, "group", config)
				require.NoError(t, err)

				alive := true

				for alive {
					err = g.Consume(ctx, []string{topic}, inlineHandler(func(sess ConsumerGroupSession, claim ConsumerGroupClaim) error {
						for {
							select {
							case <-ctx.Done():
								return ctx.Err()
							case message, ok := <-claim.Messages():
								if !ok {
									return nil
								}
								if rand.Float64() < 0.05 {
									alive = false
									return errors.New("reset")
								}
								ch <- fmt.Sprintf("%d-%s", i, message.Value)
								sess.MarkMessage(message, "")
							}
						}
					}))
					t.Logf("consumer-group disconnected: %s", err)
				}

				require.NoError(t, g.Close())
			}
		}(i)
	}

	time.Sleep(5 * time.Second)
	t.Log("booted")

	go func() {
		for i := 0; i < n; i++ {
			m := &ProducerMessage{
				Topic: topic,
				Value: StringEncoder(strconv.FormatInt(int64(i), 10)),
			}
			_, _, err := p.SendMessage(m)
			t.Logf("send message %d: %s", i, err)
		}
	}()

	m := make(map[string]struct{}, n)
	for len(m) < n {
		line := <-ch
		m[line] = struct{}{}
		t.Logf("received: %s, count: %d", line, len(m))
	}

	cancel()

	err = p.Close()
	require.NoError(t, err)

	err = c.Close()
	require.NoError(t, err)

It eventually succeeds, but with zillions of disconnects.
On my machine, the test completes in 13s on 1.32.0, but in 1m40s on latest main.


No, I do not have a real-world example of this happening, I don't have any production apps that use sarama's unreleased versions

@niamster
Copy link
Contributor Author

@lavoiesl thanks! I think that extra delay I've introduced (250ms) causes the test to run slower. I think it's fair to lower the delay to, say 100ms.
I will try to think what other kind of sync mechanism it's possible to implement to ensure that it's not easy to break it with future code changes. Although I'm afraid that it would require bigger refactoring of this code.
Let's see what the maintainers of Sarama tell us :), what do you think?

@niamster
Copy link
Contributor Author

@lavoiesl I think I've spotted another regression - the loop creates a new timer each time new subscription is added, thereby the overall "wait" time is increased by 250ms.
I've pushed another commit that fixes this, additionally I've reduced nap time to 100ms (I think that's reasonable).
Can you please give a try with your test? Thanks!

@dnwe dnwe force-pushed the fix-consumer-subscription-manager branch from 21d1414 to e0c263a Compare April 13, 2022 21:34
@dnwe dnwe added the fix label Apr 13, 2022
@dnwe dnwe changed the title consumer: fix deadlock between subscription Manager and Consumer threads fix: prevent deadlock between subscription manager and consumer goroutines Apr 13, 2022
Copy link
Collaborator

@dnwe dnwe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — the changes look reasonable to me now, lets go ahead and merge this and continue to perform testing on main prior to cutting the next release

@dnwe dnwe merged commit 8f8d8da into IBM:main Apr 13, 2022
@niamster
Copy link
Contributor Author

LGTM — the changes look reasonable to me now, lets go ahead and merge this and continue to perform testing on main prior to cutting the next release

Thanks a lot Dominic

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants