Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cluster: invoke config_frontend methods on controller shard #17088

Merged
merged 2 commits into from
Mar 19, 2024

Conversation

pgellert
Copy link
Contributor

@pgellert pgellert commented Mar 14, 2024

Various places in the code were calling do_patch directly without regard
to the requirement that do_patch has to be called on the controller
shard.

This caused a fixture test to fail because it tried to invoke do_patch
on all shards and this violates the assertion in do_patch of
config_frontend.cc, causing it to fail with the error message "Must be
called on version_shard".

This fixes it by changing config_fronter::patch() to invoke do_patch on
the controller shard, and moving all calls of do_patch to call patch
instead.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v23.3.x
  • v23.2.x

Release Notes

Bug Fixes

  • Fixes a bug of config_frontend methods getting called on shards other than the controller shard.

@pgellert
Copy link
Contributor Author

I am not sure why this test started failing now, because I can't see any recent changes that would explain it. But based on the code it seems to me that this is how we should fix the failing test.

@vbotbuildovich
Copy link
Collaborator

vbotbuildovich commented Mar 14, 2024

@pgellert pgellert force-pushed the fix-fixture-test-invoke branch 2 times, most recently from dfc93a2 to 79fd14d Compare March 14, 2024 12:58
@pgellert pgellert changed the title kafka/test: invoke patch on controller shard cluster: invoke do_patch on controller shard Mar 14, 2024
Various places in the code were calling do_patch directly without regard
to the requirement that do_patch has to be called on the controller
shard.

This caused a fixture test to fail because it tried to invoke do_patch
on all shards and this violates the assertion in do_patch of
config_frontend.cc, causing it to fail with the error message "Must be
called on version_shard".

This fixes it by changing config_fronter::patch() to invoke do_patch on
the controller shard, and moving all calls of do_patch to call patch
instead.
@pgellert pgellert force-pushed the fix-fixture-test-invoke branch from 79fd14d to 28ed6b7 Compare March 14, 2024 13:26
@pgellert pgellert changed the title cluster: invoke do_patch on controller shard cluster: invoke config_frontend methods on controller shard Mar 14, 2024
@pgellert pgellert requested a review from BenPope March 14, 2024 13:27
Just like patch, we also have to call do_set_next_version on the same
shard because config_frontend::set_next_version might be called from a
background fiber.
@pgellert pgellert force-pushed the fix-fixture-test-invoke branch from 28ed6b7 to ed04fc3 Compare March 14, 2024 13:29
@pgellert pgellert self-assigned this Mar 14, 2024
Copy link
Member

@BenPope BenPope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Maybe @dotnwat has an opinion, too?

It might be worth backporting this, the change in the metrics reporter might not be inconsequential.

@pgellert
Copy link
Contributor Author

Makes sense. I've updated the description now to make this a backport + added a bug-fix description.

@pgellert pgellert requested a review from dotnwat March 15, 2024 15:22
co_return co_await do_patch(std::move(update), timeout);
co_return co_await container().invoke_on(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are we switching cores here to invoke do_patch rather than requiring the caller to invoke patch on the correct core, like in cluster::service::config_update?

to be clear, i'm not saying it should be one way or another. but i do think it should be consistent. i'd probably choose whichever pattern aligned with a majority of the callers, and then add a comment expressing the expectation on callers (if there are any), and an assertion to check it.

Copy link
Member

@BenPope BenPope Mar 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It clearly has been used incorrectly; I can't imagine a downside of making do_patch private and dispatching to the correct core within patch, but I may have missed something.

The primary advantage is that it makes the API harder to misuse.

As far as I can tell, requiring the caller to know which core to call it on is error prone and has no advantage.

I may be missing something, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with what Ben said above.

This also improves consistency locally, by making all of the public methods of config_frontend callable from any shard, not just set_status.

Globally across *_frontend.h public methods, there is an inconsistency where some methods are callable from any shard while some aren't. I would think that since invoke_on seems cheap, we should standardise making it the *_frontend classes' responsibility to delegate work to the correct core. But that's a larger undertaking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, allowing the method to be invoked on any core is great.

Internally we should also be consistent:

    auto leader = _leaders.local().get_leader(model::controller_ntp);
    if (!leader) {
        co_return patch_result{
          .errc = errc::no_leader_controller, .version = config_version_unset};
    }
    if (leader == _self) {
        co_return co_await do_patch(std::move(update), timeout);
        co_return co_await container().invoke_on(

here we are combining state from different cores. it's benign in this case, but in general, it should be consistent.

@pgellert pgellert requested a review from dotnwat March 18, 2024 13:00
@pgellert pgellert merged commit 09304e5 into redpanda-data:dev Mar 19, 2024
17 checks passed
@vbotbuildovich
Copy link
Collaborator

/backport v23.3.x

@vbotbuildovich
Copy link
Collaborator

/backport v23.2.x

@vbotbuildovich
Copy link
Collaborator

Failed to create a backport PR to v23.2.x branch. I tried:

git remote add upstream https://github.com/redpanda-data/redpanda.git
git fetch --all
git checkout -b backport-pr-17088-v23.2.x-498 remotes/upstream/v23.2.x
git cherry-pick -x b7202f11b98f555a5403c77ff65f91e5a2f24f67 ed04fc3eba2b2e66ed68861d84ecb0a26bc2055a

Workflow run logs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants