Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic in memberlist client #3732

Closed
Logiraptor opened this issue Dec 14, 2022 · 5 comments · Fixed by #3746
Closed

Panic in memberlist client #3732

Logiraptor opened this issue Dec 14, 2022 · 5 comments · Fixed by #3746
Labels
bug Something isn't working memberlist

Comments

@Logiraptor
Copy link
Contributor

Logiraptor commented Dec 14, 2022

Describe the bug

Observed a store-gateway panic during a recent internal incident:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x10 pc=0xb19a04]

goroutine 291 [running]:
github.com/hashicorp/memberlist.(*TransmitLimitedQueue).queueBroadcast(0x0, {0x2e90340, 0xc000691bc0}, 0x0)
/backend-enterprise/vendor/github.com/hashicorp/memberlist/queue.go:183 +0x64
github.com/hashicorp/memberlist.(*TransmitLimitedQueue).QueueBroadcast(...)
/backend-enterprise/vendor/github.com/hashicorp/memberlist/queue.go:165
github.com/grafana/dskit/kv/memberlist.(*KV).queueBroadcast(0xc00086b000, {0xc000c4fc10?, 0xc000c4fc10?}, {0xc0006c4130?, 0xc000383b80?, 0x4a?}, 0x72?, {0xc0004a79d0, 0x63, 0x63})
/backend-enterprise/vendor/github.com/grafana/dskit/kv/memberlist/memberlist_client.go:1098 +0x191
github.com/grafana/dskit/kv/memberlist.(*KV).broadcastNewValue(0xc00086b000, {0xc000c4fc10, 0xb}, {0x2e94900?, 0xc00011cdc0}, 0x1, {0x2e90310, 0xc000873020})
/backend-enterprise/vendor/github.com/grafana/dskit/kv/memberlist/memberlist_client.go:982 +0x813
github.com/grafana/dskit/kv/memberlist.(*KV).processValueUpdate(0xc00086b000, 0xc0006919e0, {0xc000c4fc10, 0xb})
/backend-enterprise/vendor/github.com/grafana/dskit/kv/memberlist/memberlist_client.go:1073 +0x3ab
created by github.com/grafana/dskit/kv/memberlist.(*KV).getKeyWorkerChannel
/backend-enterprise/vendor/github.com/grafana/dskit/kv/memberlist/memberlist_client.go:1036 +0x16a

To Reproduce

Not totally sure how to reproduce yet. This occurred during a period of extreme pressure on the read path. We also observed an overall increase in network-related failures during this time, so it's possible that's related.

Expected behavior

The store-gateway should not panic

Environment

  • Infrastructure: Kubernetes
  • Deployment tool: jsonnet
  • Observed in both r215 and r216 releases

Additional Context

(Grafana-internal link) https://ops.grafana-ops.net/a/grafana-incident-app/incidents/462

@pracucci pracucci added bug Something isn't working memberlist labels Dec 15, 2022
@pracucci
Copy link
Collaborator

github.com/hashicorp/memberlist.(*TransmitLimitedQueue).queueBroadcast(0x0, {0x2e90340, 0xc000691bc0}, 0x0)
/backend-enterprise/vendor/github.com/hashicorp/memberlist/queue.go:183 +0x64

In r215, at line 183 there's this:

q.mu.Lock()

Given q.mu can never be nil, I guess the function queueBroadcast() was called on a nil instance of TransmitLimitedQueue (the q receiver).

So I think we have to look at dskit memberlist.KV implementation, particularly this call:
https://github.com/grafana/dskit/blob/7cb51905074143259de6916be4bf3b54af47bdcc/kv/memberlist/memberlist_client.go#L1098

Can m.broadcasts be nil? m.broadcasts gets initialized in the KV.starting(), and never reset back to nil:
https://github.com/grafana/dskit/blob/7cb51905074143259de6916be4bf3b54af47bdcc/kv/memberlist/memberlist_client.go#L461-L464

According to this theory, I'm wondering if it could happen that we call KV.queueBroadcast() before KV.starting() has returned. The KV.queueBroadcast() is either called when we KV.CAS() or when we receive an update from a peer (and we rebroadcast it).

The KV.CAS() has a protection from that:
https://github.com/grafana/dskit/blob/7cb51905074143259de6916be4bf3b54af47bdcc/kv/memberlist/memberlist_client.go#L896-L900

Also KV.MergeRemoteState() has a protection to wait until initialization has done before proceeding merging a remote state:
https://github.com/grafana/dskit/blob/7cb51905074143259de6916be4bf3b54af47bdcc/kv/memberlist/memberlist_client.go#L1186-L1189

However, I can't see such protection in KV.NotifyMsg() (which in turns route the message to a worker, which then could call broadcastNewValue()). KV.NotifyMsg() is called when a single message is received from a peer.

Conclusion: If my analysis is correct, then I think we need a protection in KV.NotifyMsg() too, but I also think triggering this race condition should be very difficult, so it's likely the issue is somewhere else.

@pstibrany You're the memberlist expert. Could you take a look too, please?

@colega
Copy link
Contributor

colega commented Dec 15, 2022

I guess the function queueBroadcast() was called on a nil instance of TransmitLimitedQueue (the q receiver).

Yes, the stacktrace says the receiver is 0x0:

github.com/hashicorp/memberlist.(*TransmitLimitedQueue).queueBroadcast(0x0, {0x2e90340, 0xc000691bc0}, 0x0)
/backend-enterprise/vendor/github.com/hashicorp/memberlist/queue.go:183 +0x64

@pstibrany
Copy link
Member

pstibrany commented Dec 15, 2022

Conclusion: If my analysis is correct, then I think we need a protection in KV.NotifyMsg() too, but I also think triggering this race condition should be very difficult, so it's likely the issue is somewhere else.

I agree with you. Call to m.initWG.Wait() which protects m.broadcasts used to be there, but got lost in grafana/dskit#110 (I reviewed it 😞)

@pstibrany
Copy link
Member

I agree with you. Call to m.initWG.Wait() which protects m.broadcasts used to be there, but got lost in grafana/dskit#110 (I reviewed it 😞)

I'm working on a fix in dskit.

@pstibrany
Copy link
Member

Fix: grafana/dskit#244

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working memberlist
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants