Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug Report: Deadlock in health check and topology watcher #16994

Closed
GuptaManan100 opened this issue Oct 18, 2024 · 0 comments · Fixed by #16995
Closed

Bug Report: Deadlock in health check and topology watcher #16994

GuptaManan100 opened this issue Oct 18, 2024 · 0 comments · Fixed by #16995

Comments

@GuptaManan100
Copy link
Member

Overview of the Issue

We found out a case where the topology watcher and the health check can deadlock -

these two pieces of logic interact:
https://github.com/vitessio/vitess/blob/release-20.0/go/vt/discovery/topology_watcher.go#L126-L130
https://github.com/vitessio/vitess/blob/release-20.0/go/vt/discovery/healthcheck.go#L550

loadTablets() in the topology watcher can call into healthcheck methods that need to acquire its mutex and then updateHealth method can get stuck holding that same mutex waiting for a chan send on the loadTabletsTrigger that the topology watcher goroutine is supposed to be draining.

Here are the stack traces for the deadlock -

goroutine 10 [chan send]:
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).updateHealth(0x140004eb680, 0x140004a3ef0, 0x140001bf260, 0x0, 0x0)
	/Users/manangupta/vitess/go/vt/discovery/healthcheck.go:542 +0x480

goroutine 11 [sync.Mutex.Lock]:
sync.runtime_SemacquireMutex(0x2?, 0x0?, 0x101c10870?)
	/Users/manangupta/.gvm/pkgsets/go1.23.0/global/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.2.darwin-arm64/src/runtime/sema.go:95 +0x28
sync.(*Mutex).lockSlow(0x140004eb6a8)
	/Users/manangupta/.gvm/pkgsets/go1.23.0/global/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.2.darwin-arm64/src/sync/mutex.go:173 +0x174
sync.(*Mutex).Lock(...)
	/Users/manangupta/.gvm/pkgsets/go1.23.0/global/pkg/mod/golang.org/toolchain@v0.0.1-go1.23.2.darwin-arm64/src/sync/mutex.go:92
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).deleteTablet(0x140004eb680, 0x140000d0840)
	/Users/manangupta/vitess/go/vt/discovery/healthcheck.go:455 +0xf8
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).RemoveTablet(...)
	/Users/manangupta/vitess/go/vt/discovery/healthcheck.go:444
vitess.io/vitess/go/vt/discovery.(*HealthCheckImpl).ReplaceTablet(0x140004eb680, 0x140005565d0?, 0x1400021c000)
	/Users/manangupta/vitess/go/vt/discovery/healthcheck.go:449 +0x28
vitess.io/vitess/go/vt/discovery.(*TopologyWatcher).loadTablets(0x140000d06c0)

This problem has existed since the trigger logic was introduced in #12893.

Reproduction Steps

Run Vitess health check and topology watcher and hope to see the deadlock, or write a unit test :3!

Binary Version

v18 onwards

Operating System and Environment details

-

Log Fragments

No response

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant