[v24.2.x] CORE-8394 cluster: consider shard0 reserve in check_cluster_limits #24462

vbotbuildovich · 2024-12-05T21:15:44Z

Backport of PR #24378 and #24409

vbotbuildovich · 2024-12-05T23:02:04Z

the below tests from https://buildkite.com/redpanda/redpanda/builds/59322#019398af-4577-4ca0-898b-9406fa159cf7 have failed and will be retried

partition_allocator_tests_rpunit

the below tests from https://buildkite.com/redpanda/redpanda/builds/59322#019398af-457c-4781-aec1-b1e977d9f5df have failed and will be retried

partition_allocator_tests_rpunit
tx_compaction_tests_rpunit

vbotbuildovich · 2024-12-06T00:41:09Z

ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/59322#019398f7-bc6f-4017-9bba-18c8712abed3
ducktape was retried in https://buildkite.com/redpanda/redpanda/builds/59483#0193ac40-b226-4ea9-9f34-018489984908

pgellert · 2024-12-09T15:05:55Z

I've cherry-picked the changes of PR#24409 as well since the changes don't pass CI without it and it makes sense to backport both.

Improve the user error feedback when the `topic_partitions_reserve_shard0` cluster config is used and a user tried to allocate a topic that is above the partition limits. Previously this check was only considered as part of the `max_final_capacity` hard constraint, which meant that the kafka error message was more vague (No nodes are available to perform allocation after hard constraints were solved) and there were no clear broker logs to indicate this. Now this is also considered inside `check_cluster_limits` which leads to more specific error messages on both the kafka api (unable to create topic with 20 partitions due to hardware constraints) and in broker logs: ``` WARN 2024-11-29 13:18:13,907 [shard 0:main] cluster - partition_allocator.cc:183 - Refusing to create 20 partitions as total partition count 20 would exceed the core-based limit 18 (per-shard limit: 20, shard0 reservation: 2) ``` (cherry picked from commit b632190)

(cherry picked from commit cccb53d)

Pure refactor. Extract for reuse in the next commit. (cherry picked from commit 4b4f6a2)

Internal topics are excluded from checks to prevent allocation failures when creating them. This is to ensure that lazy-allocated internal topics (eg. the transactions topic) can always be created. This excludes them from the global `check_cluster_limits`. There has already been a fixture test to effectively test that internal topics are excluded from the limit checks, however, it erroniously relied on the fact that the shard0 reservations were not considered in `check_cluster_limits` to allow the test to pass. (See `allocation_over_capacity` and the previous commit.) This adds a new test to validate that internal topics can be created even with partitions that are above the global shard0 reservation. (cherry picked from commit 19bc4f2)

vbotbuildovich added this to the v24.2.x-next milestone Dec 5, 2024

vbotbuildovich added the kind/backport PRs targeting a stable branch label Dec 5, 2024

vbotbuildovich requested a review from pgellert December 5, 2024 21:15

github-actions bot added the area/redpanda label Dec 5, 2024

pgellert self-assigned this Dec 9, 2024

pgellert added 4 commits December 9, 2024 19:52

cluster/test: configure reserve_shard0 in fixtures

bbe2cb9

(cherry picked from commit cccb53d)

cluster: extract allocation_node::is_internal_topic

bc61df1

Pure refactor. Extract for reuse in the next commit. (cherry picked from commit 4b4f6a2)

pgellert force-pushed the backport-pr-24378-v24.2.x-590 branch from 725dfe7 to 5fe9620 Compare December 9, 2024 19:52

pgellert requested review from bashtanov and IoannisRP December 9, 2024 19:53

IoannisRP approved these changes Dec 9, 2024

View reviewed changes

pgellert merged commit ee0c765 into redpanda-data:v24.2.x Dec 10, 2024
16 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v24.2.x] CORE-8394 cluster: consider shard0 reserve in check_cluster_limits #24462

[v24.2.x] CORE-8394 cluster: consider shard0 reserve in check_cluster_limits #24462

vbotbuildovich commented Dec 5, 2024 •

edited by pgellert

Loading

vbotbuildovich commented Dec 5, 2024 •

edited

Loading

vbotbuildovich commented Dec 6, 2024 •

edited

Loading

pgellert commented Dec 9, 2024

[v24.2.x] CORE-8394 cluster: consider shard0 reserve in check_cluster_limits #24462

[v24.2.x] CORE-8394 cluster: consider shard0 reserve in check_cluster_limits #24462

Conversation

vbotbuildovich commented Dec 5, 2024 • edited by pgellert Loading

vbotbuildovich commented Dec 5, 2024 • edited Loading

vbotbuildovich commented Dec 6, 2024 • edited Loading

pgellert commented Dec 9, 2024

vbotbuildovich commented Dec 5, 2024 •

edited by pgellert

Loading

vbotbuildovich commented Dec 5, 2024 •

edited

Loading

vbotbuildovich commented Dec 6, 2024 •

edited

Loading