rm_stm/idempotency: fix the producer lock scope #16706

bharathv · 2024-02-26T05:55:42Z

In case of a replication error of current sequence, the code issues a manual leader step down to prevent the subsequent requests making progress as that violates idempotency guarantees.

This is done by holding a mutex while the request is in progress. The mutex is incorrectly released before issuing a step down in such cases which may theoretically let other requests make progress before step down is actually issued, the race sequence looks like this

seq=5 replication_error
seq=6 makes progress
seq=5 issues a stepdown

This bug was identified by just eyeballing the code but couldn't be verified due to lack of trace logs in many partitions test. Seems like something that should be tightened regardless.

Deployed the patch on a 3 node cluster with 500MB/s OMB run, no noticeable perf changes.

Backports Required

Release Notes

Bug Fixes

Fixes a plausible correctness issue with idempotent requests during replication failures.

In case of a replication error of current sequence, the code issues a manual leader step down to prevent the subsequent requests making progress as that violates idempotency guarantees. This is done by holding a mutex while the request is in progress. The mutex is incorrectly released before issuing a step down in such cases which may theoretically let other requests make progress before step down is actually issued, the race sequence looks like this seq=5 replication_error seq=6 makes progress seq=5 issues a stepdown This bug was identified by just eyeballing the code but couldn't be verified due to lack of trace logs in many partitions test. Seems like something that should be tightened regardless. Deployed the patch on a 3 node cluster with 500MB/s OMB run, no noticeable perf changes.

nvartolomei · 2024-02-26T09:58:26Z

Fixes #16657?

mmaslankaprv · 2024-02-26T12:38:48Z

src/v/cluster/rm_stm.cc

@@ -1182,6 +1181,7 @@ ss::future<result<kafka_result>> rm_stm::do_idempotent_replicate(
        req_ptr->set_value<ret_t>(result.error());
        co_return result.error();
    }
+    units.return_all();


it looks like this release is still to early as the step down happens on line 1129 outside of do_idempotent_replicate should we keep the units alive there ?

I think it does? do_idempotent_replicate() takes a reference to the units as the input parameter.

now i see it, it is a bit confusing

bharathv · 2024-02-26T16:26:23Z

Fixes #16657?

Thats the hope but I"m not 100% convinced (due to lack of diagnostics).. I can loop your test a few more times on this and see what happens.

vbotbuildovich · 2024-02-27T15:59:39Z

/backport v23.3.x

bharathv requested a review from mmaslankaprv February 26, 2024 05:55

github-actions bot added the area/redpanda label Feb 26, 2024

mmaslankaprv reviewed Feb 26, 2024

View reviewed changes

bharathv requested a review from mmaslankaprv February 26, 2024 16:26

mmaslankaprv approved these changes Feb 27, 2024

View reviewed changes

bharathv merged commit a316400 into redpanda-data:dev Feb 27, 2024
17 checks passed

vbotbuildovich mentioned this pull request Feb 27, 2024

[v23.3.x] rm_stm/idempotency: fix the producer lock scope #16749

Merged

bharathv deleted the fix_idem branch February 27, 2024 16:25

bharathv mentioned this pull request Mar 5, 2024

Revert "rm_stm/idempotency: fix the producer lock scope" #16907

Merged

renovate bot mentioned this pull request May 4, 2024

feat(github-release)!: Update redpanda-operator to v24.1.6 otosky/home-ops#1232

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

rm_stm/idempotency: fix the producer lock scope #16706

rm_stm/idempotency: fix the producer lock scope #16706

bharathv commented Feb 26, 2024

nvartolomei commented Feb 26, 2024

mmaslankaprv Feb 26, 2024

bharathv Feb 26, 2024

mmaslankaprv Feb 27, 2024

bharathv commented Feb 26, 2024

vbotbuildovich commented Feb 27, 2024

rm_stm/idempotency: fix the producer lock scope #16706

rm_stm/idempotency: fix the producer lock scope #16706

Conversation

bharathv commented Feb 26, 2024

Backports Required

Release Notes

Bug Fixes

nvartolomei commented Feb 26, 2024

mmaslankaprv Feb 26, 2024

Choose a reason for hiding this comment

bharathv Feb 26, 2024

Choose a reason for hiding this comment

mmaslankaprv Feb 27, 2024

Choose a reason for hiding this comment

bharathv commented Feb 26, 2024

vbotbuildovich commented Feb 27, 2024