-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix loop of OffsetForLeaderEpoch requests on quick leader changes #4433
Conversation
Thanks for the PR! We have a pretty reliable scenario for triggering #4425 (we do a reassignment on a small topic so that there are at least two partition state changes in short succession, the first and then all subsequent OffsetForLeaderEpoch requests fail with FencedLeaderEpoch). We ran the consumers from this branch and confirmed that we don't see the issue with this patch. |
Thanks for confirming it with some independent testing @mjd95! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Just a question and a nit.
tests/0139-offset_validation_mock.c
Outdated
rd_kafka_mock_broker_push_request_error_rtts( | ||
mcluster, 2, RD_KAFKAP_OffsetForLeaderEpoch, 1, | ||
RD_KAFKA_RESP_ERR_KAFKA_STORAGE_ERROR, 900); | ||
|
||
rd_kafka_mock_broker_push_request_error_rtts( | ||
mcluster, 2, RD_KAFKAP_OffsetForLeaderEpoch, 1, | ||
RD_KAFKA_RESP_ERR_NO_ERROR, 1000); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge these two.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
* | ||
* See #4425. | ||
*/ | ||
static void do_test_two_leader_changes(void) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm that this test fails with the old code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, it fails with a timeout after we enter the infinite loop
* upstream/master: librdkafka v2.3.0 (confluentinc#4455) Fix for idempotent producer fatal errors, triggered after a possibly persisted message state (confluentinc#4438) Move can_q_contain_fetched_msgs inside q_serve (confluentinc#4431) [KIP-580] Exponential Backoff with Mock Broker Changes to Automate Testing. (confluentinc#4422) Update only the mklove version of OpenSSL to 3.0.11 (confluentinc#4454) Permanent errors during offset validation should be retried (confluentinc#4447) Increased flexver request size for Metadata request to include topic_id size (confluentinc#4453) Fix loop of OffsetForLeaderEpoch requests on quick leader changes (confluentinc#4433) Fix for stored offsets not being committed if they lacked the leader epoch (confluentinc#4442) Add leader epoch to control messages (confluentinc#4434) Refactored tmpabuf and fixed an insufficient buffer allocation (confluentinc#4449) Work around KIP-700 restrictions for DescribeCluster [KIP-430] [admin] KIP-430: Add authorized operations to describe API Fix segfault if assignor state is NULL, (confluentinc#4381)
@emasab Do you have an ETA on when 2.3 will be released? We believe this might the issue we are experiencing. |
Fixes #4425