Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix stepping down on timeout #24590

Open
wants to merge 7 commits into
base: dev
Choose a base branch
from

Conversation

mmaslankaprv
Copy link
Member

@mmaslankaprv mmaslankaprv commented Dec 17, 2024

When follower is busy it may fail fast processing full heartbeat
requests sent by the leader. In this case a follower RPC handler sets
the follower_busy result in heartbeat_reply. Leader should still treat
a follower replica as online in this case. The replica hosting node must
be online to reply with the follower_busy error.

This way we prevent to eager leader step downs when follower replicas
are slow.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v24.3.x
  • v24.2.x
  • v24.1.x

Release Notes

Improvements

  • stable leadership under load

@vbotbuildovich
Copy link
Collaborator

Retry command for Build#59862

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/scaling_up_test.py::ScalingUpTest.test_fast_node_addition
tests/rptest/tests/datalake/partition_movement_test.py::PartitionMovementTest.test_cross_core_movements@{"cloud_storage_type":1}

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Dec 17, 2024

CI test results

test results on build#59862
test_id test_kind job_url test_status passed
coordinator_rpunit.coordinator_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05c-4ce5-a326-067904a6399d FAIL 0/2
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
distributed_kv_stm_tests_rpunit.distributed_kv_stm_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
gtest_archival_rpunit.gtest_archival_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
gtest_raft_rpunit.gtest_raft_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
id_allocator_stm_test_rpunit.id_allocator_stm_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
partition_properties_stm_test_rpunit.partition_properties_stm_test_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59862#0193d591-faa4-44b3-86d0-308c5f1678be FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59862#0193d5a4-9a48-4d45-8031-616226f6505a FLAKY 4/6
rptest.tests.scaling_up_test.ScalingUpTest.test_fast_node_addition ducktape https://buildkite.com/redpanda/redpanda/builds/59862#0193d591-faa4-44b3-86d0-308c5f1678be FAIL 0/1
tm_stm_tests_rpunit.tm_stm_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59862#0193d54c-b05e-47b4-b171-e0d90039066a FAIL 0/2
test results on build#59902
test_id test_kind job_url test_status passed
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59902#0193d8cb-b161-4c06-ada3-c542ebe6df9a FAIL 0/2
datalake_translation_tests_rpunit.datalake_translation_tests_rpunit unit https://buildkite.com/redpanda/redpanda/builds/59902#0193d8cb-b162-4156-9369-57ef78449e35 FAIL 0/2
rptest.tests.cloud_retention_test.CloudRetentionTest.test_cloud_retention.max_consume_rate_mb=None.cloud_storage_type=CloudStorageType.ABS ducktape https://buildkite.com/redpanda/redpanda/builds/59902#0193d913-3ccd-4237-a5c3-aa10b1fd3682 FAIL 0/6
rptest.tests.datalake.partition_movement_test.PartitionMovementTest.test_cross_core_movements.cloud_storage_type=CloudStorageType.S3 ducktape https://buildkite.com/redpanda/redpanda/builds/59902#0193d925-6325-4c4e-a39e-b2f4af0d69c2 FLAKY 4/6

The `raft::reply_result::follower_busy` is indicating that the follower
was unable to process the heartbeat fast enough to generate a response.
Renaming the reply from `timeout` will make it less confusing for the
reader and differentiate the error code from an RPC timeout.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Signed-off-by: Michał Maślanka <michal@redpanda.com>
Wired raft RPC service handler into Raft fixture to make the tests more
accurate and cover the service code with tests.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Propagating timeout to the node sending RPC request is crucial for
accurate testing of Raft implementation.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
Added a wrapper around the `storage::log` allowing us to inject storage
layer failures in Raft fixture tests.

Signed-off-by: Michał Maślanka <michal@redpanda.com>
When follower is busy it may fail fast processing full heartbeat
requests sent by the leader. In this case a follower RPC handler sets
the `follower_busy` result in heartbeat_reply. Leader should still treat
a follower replica as online in this case. The replica hosting node must
be online to reply with the `follower_busy` error.

This way we prevent to eager leader step downs when follower replicas
are slow.
Signed-off-by: Michał Maślanka <michal@redpanda.com>
Signed-off-by: Michał Maślanka <michal@redpanda.com>
@mmaslankaprv mmaslankaprv force-pushed the fix-stepping-down-on-timeout branch from c321e29 to e203f89 Compare December 18, 2024 08:02
@vbotbuildovich
Copy link
Collaborator

Retry command for Build#59902

please wait until all jobs are finished before running the slash command

/ci-repeat 1
tests/rptest/tests/cloud_retention_test.py::CloudRetentionTest.test_cloud_retention@{"cloud_storage_type":2,"max_consume_rate_mb":null}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants