Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[v24.2.x] CORE-8394 cluster: consider shard0 reserve in check_cluster_limits #24462

Merged

Conversation

vbotbuildovich
Copy link
Collaborator

@vbotbuildovich vbotbuildovich commented Dec 5, 2024

Backport of PR #24378 and #24409

@vbotbuildovich vbotbuildovich added this to the v24.2.x-next milestone Dec 5, 2024
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Dec 5, 2024
@vbotbuildovich
Copy link
Collaborator Author

vbotbuildovich commented Dec 5, 2024

the below tests from https://buildkite.com/redpanda/redpanda/builds/59322#019398af-4577-4ca0-898b-9406fa159cf7 have failed and will be retried

partition_allocator_tests_rpunit

the below tests from https://buildkite.com/redpanda/redpanda/builds/59322#019398af-457c-4781-aec1-b1e977d9f5df have failed and will be retried

partition_allocator_tests_rpunit
tx_compaction_tests_rpunit

@vbotbuildovich
Copy link
Collaborator Author

vbotbuildovich commented Dec 6, 2024

@pgellert
Copy link
Contributor

pgellert commented Dec 9, 2024

I've cherry-picked the changes of PR#24409 as well since the changes don't pass CI without it and it makes sense to backport both.

@pgellert pgellert self-assigned this Dec 9, 2024
Improve the user error feedback when the
`topic_partitions_reserve_shard0` cluster config is used and a user
tried to allocate a topic that is above the partition limits.

Previously this check was only considered as part of the
`max_final_capacity` hard constraint, which meant that the kafka error
message was more vague (No nodes are available to perform allocation
after hard constraints were solved) and there were no clear broker logs
to indicate this.

Now this is also considered inside `check_cluster_limits` which leads to
more specific error messages on both the kafka api (unable to create
topic with 20 partitions due to hardware constraints) and in broker
logs:

```
WARN  2024-11-29 13:18:13,907 [shard 0:main] cluster - partition_allocator.cc:183 - Refusing to create 20 partitions as total partition count 20 would exceed the core-based limit 18 (per-shard limit: 20, shard0 reservation: 2)
```

(cherry picked from commit b632190)
Pure refactor. Extract for reuse in the next commit.

(cherry picked from commit 4b4f6a2)
Internal topics are excluded from checks to prevent allocation failures
when creating them. This is to ensure that lazy-allocated internal
topics (eg. the transactions topic) can always be created.

This excludes them from the global `check_cluster_limits`. There has
already been a fixture test to effectively test that internal topics are
excluded from the limit checks, however, it erroniously relied on the
fact that the shard0 reservations were not considered in
`check_cluster_limits` to allow the test to pass. (See
`allocation_over_capacity` and the previous commit.)

This adds a new test to validate that internal topics can be created
even with partitions that are above the global shard0 reservation.

(cherry picked from commit 19bc4f2)
@pgellert pgellert force-pushed the backport-pr-24378-v24.2.x-590 branch from 725dfe7 to 5fe9620 Compare December 9, 2024 19:52
@pgellert pgellert merged commit ee0c765 into redpanda-data:v24.2.x Dec 10, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/redpanda kind/backport PRs targeting a stable branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants