-
Notifications
You must be signed in to change notification settings - Fork 593
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster: fix reactor stalls during shutdown #5151
Conversation
co_await ss::parallel_for_each( | ||
partitions, [this](auto& e) { return do_shutdown(e.second); }); | ||
co_await ss::max_concurrent_for_each( | ||
partitions, 1024, [this](auto& e) { return do_shutdown(e.second); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1024 is an arbitrary number, but this feels like something that isn't worth creating a full configuration property for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is hot! i like this. cc: @travisdowns
src/v/cluster/partition_manager.cc
Outdated
{ | ||
auto current = _raft_table.begin(); | ||
while (current != _raft_table.end()) { | ||
current = _raft_table.erase(current, ++current); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this safe?
erase(current, ++current)
both modifies current and uses it in another argument which I believe is indeterminately sequenced (this changed in C++17 from being UB to "indeterminately sequenced" but that stronger guarantee isn't very useful here). That is you might end up with (current + 1), (current + 1)
(empty range) or (current), (current + 1)
(what you want).
In any case why use this range erase overload at all over simply _raft_table.erase(current++)
?
src/v/cluster/partition_manager.cc
Outdated
@@ -162,10 +163,26 @@ ss::future<> partition_manager::stop_partitions() { | |||
co_await _gate.close(); | |||
// prevent partitions from being accessed | |||
auto partitions = std::exchange(_ntp_table, {}); | |||
_raft_table.clear(); | |||
|
|||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea here is to break up the destructor by basically putting a yield point between destroying every element, right?
We use this 2x here already and maybe elsewhere, could be worth wrapping it up in a utility function?
src/v/cluster/partition_manager.cc
Outdated
auto current = _raft_table.begin(); | ||
while (current != _raft_table.end()) { | ||
current = _raft_table.erase(current, ++current); | ||
co_await ss::coroutine::maybe_yield(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder what the order-of-magnitude cost of calling this for every element is. If we had a helper it could do N elements before doing a yield (though you'd need an estimate from the call about how expensive every deletion is to get N right).
src/v/cluster/partition_manager.cc
Outdated
auto current = partitions.begin(); | ||
while (current != partitions.end()) { | ||
current = partitions.erase(current, ++current); | ||
co_await ss::coroutine::maybe_yield(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thinking about current and ++current and erase return value seems overly complicated?
while (!partitions.empty()) {
partitions.erase(partitions.begin());
co_await ss::coroutine::maybe_yield();
}
Revised this to create an async_clear helper (for flat_hash_map) that avoids the repetition -- this is basically revisiting #4860 now that we have a few more usage sites, whereas at the time of that PR we were only using the helper one place. Calling maybe_yield every iteration wasn't very expensive (it was only checking a boolean for whether any other tasks were waiting), but we can call it less often by batching up our erases into ranges + only calling maybe_yield for each range we erase. This also gives the underlying container a chance to apply any efficiencies they may have for bulk erases vs. single element erases. |
This is for clearing large containers without causing reactor stalls.
These objects are all potentially very large, so: - Must not destruct them in one shot (the overhead of all the item destructors is enough to cause an issue) - Must not use ss::parallel_for_each, it's unsafe on large collections.
This ran into the llvm templates+coroutines bug llvm/llvm-project#49689, so I've wrapped the async_clear helper into a class to work around that (seems to work when compiling locally at least) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Cover letter
These objects are all potentially very large, so:
of all the item destructors is enough to cause
an issue)
on large collections.
Release notes