Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rm_stm/idempotency: fix the producer lock scope #16706

Merged
merged 1 commit into from
Feb 27, 2024

Conversation

bharathv
Copy link
Contributor

In case of a replication error of current sequence, the code issues a manual leader step down to prevent the subsequent requests making progress as that violates idempotency guarantees.

This is done by holding a mutex while the request is in progress. The mutex is incorrectly released before issuing a step down in such cases which may theoretically let other requests make progress before step down is actually issued, the race sequence looks like this

seq=5 replication_error
seq=6 makes progress
seq=5 issues a stepdown

This bug was identified by just eyeballing the code but couldn't be verified due to lack of trace logs in many partitions test. Seems like something that should be tightened regardless.

Deployed the patch on a 3 node cluster with 500MB/s OMB run, no noticeable perf changes.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x
  • v23.1.x

Release Notes

Bug Fixes

  • Fixes a plausible correctness issue with idempotent requests during replication failures.

In case of a replication error of current sequence, the code issues a
manual leader step down to prevent the subsequent requests making
progress as that violates idempotency guarantees.

This is done by holding a mutex while the request is in progress. The
mutex is incorrectly released before issuing a step down in such cases
which may theoretically let other requests make progress before step
down is actually issued, the race sequence looks like this

seq=5 replication_error
seq=6 makes progress
seq=5 issues a stepdown

This bug was identified by just eyeballing the code but couldn't be
verified due to lack of trace logs in many partitions test. Seems like
something that should be tightened regardless.

Deployed the patch on a 3 node cluster with 500MB/s OMB run, no
noticeable perf changes.
@nvartolomei
Copy link
Contributor

Fixes #16657?

@@ -1182,6 +1181,7 @@ ss::future<result<kafka_result>> rm_stm::do_idempotent_replicate(
req_ptr->set_value<ret_t>(result.error());
co_return result.error();
}
units.return_all();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like this release is still to early as the step down happens on line 1129 outside of do_idempotent_replicate should we keep the units alive there ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it does? do_idempotent_replicate() takes a reference to the units as the input parameter.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

now i see it, it is a bit confusing

@bharathv
Copy link
Contributor Author

Fixes #16657?

Thats the hope but I"m not 100% convinced (due to lack of diagnostics).. I can loop your test a few more times on this and see what happens.

@bharathv bharathv merged commit a316400 into redpanda-data:dev Feb 27, 2024
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants