-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
raft/c: fix an indefinite hang in transfer leadership #24404
Conversation
/ci-repeat 1 |
1 similar comment
/ci-repeat 1 |
/ci-repeat 1 |
/ci-repeat 1 |
3845f24
to
b21b15a
Compare
/ci-repeat 1 |
b21b15a
to
36b86f1
Compare
Retry command for Build#59208please wait until all jobs are finished before running the slash command
|
non flaky failures in https://buildkite.com/redpanda/redpanda/builds/59208#0193900d-362c-4a1e-ac0e-951f86073cd9:
non flaky failures in https://buildkite.com/redpanda/redpanda/builds/59208#0193900d-362e-4948-ba03-87047c5cbc15:
non flaky failures in https://buildkite.com/redpanda/redpanda/builds/59208#01939012-94b1-4626-870d-7023ed923a22:
non flaky failures in https://buildkite.com/redpanda/redpanda/builds/59208#01939012-94b3-4515-9a2c-ea0f2873d859:
|
/ci-repeat 1 |
This typically happens when there is a stepdown and the downstream consumers like recovery need to know about it.
36b86f1
to
d0a562b
Compare
/backport v24.3.x |
/backport v24.2.x |
/backport v24.1.x |
This PR fixes a timeout in raft leadership transfer request. The timeout is caused by a race condition in raft that results in a stuck recovery_stm, even after losing leadership. Exact sequence of events below:
This PR signals the CV upon step down to unblock the recovery_stm which then detects it is has to shutdown due to a step down (loss of leadership) and this automatically unblocks the transfer leadership request which was waiting for it.
An unrelated (but deeper) issue this PR avoids fixing is why the step down happens when a transfer is already in progress. In this case it was initiated by an STM invariant that requires a step down upon first failure (required for correctness). One can argue that it is not the right behavior to step down when a transfer is already in progress. The fix for it is likely invasive and has correctness implications at the state machine level (eg: idempotency), so avoiding for now.
Backports Required
Release Notes