cluster: fix reactor stalls during shutdown #5151

jcsp · 2022-06-16T18:16:44Z

Cover letter

These objects are all potentially very large, so:

Must not destruct them in one shot (the overhead
of all the item destructors is enough to cause
an issue)
Must not use ss::parallel_for_each, it's unsafe
on large collections.

Release notes

none

jcsp · 2022-06-17T15:41:30Z

src/v/cluster/partition_manager.cc

-    co_await ss::parallel_for_each(
-      partitions, [this](auto& e) { return do_shutdown(e.second); });
+    co_await ss::max_concurrent_for_each(
+      partitions, 1024, [this](auto& e) { return do_shutdown(e.second); });


1024 is an arbitrary number, but this feels like something that isn't worth creating a full configuration property for.

this is hot! i like this. cc: @travisdowns

travisdowns · 2022-06-18T05:36:16Z

src/v/cluster/partition_manager.cc

+    {
+        auto current = _raft_table.begin();
+        while (current != _raft_table.end()) {
+            current = _raft_table.erase(current, ++current);


Is this safe?

erase(current, ++current) both modifies current and uses it in another argument which I believe is indeterminately sequenced (this changed in C++17 from being UB to "indeterminately sequenced" but that stronger guarantee isn't very useful here). That is you might end up with (current + 1), (current + 1) (empty range) or (current), (current + 1) (what you want).

In any case why use this range erase overload at all over simply _raft_table.erase(current++)?

travisdowns · 2022-06-18T05:37:58Z

src/v/cluster/partition_manager.cc

@@ -162,10 +163,26 @@ ss::future<> partition_manager::stop_partitions() {
    co_await _gate.close();
    // prevent partitions from being accessed
    auto partitions = std::exchange(_ntp_table, {});
-    _raft_table.clear();
+
+    {


The idea here is to break up the destructor by basically putting a yield point between destroying every element, right?

We use this 2x here already and maybe elsewhere, could be worth wrapping it up in a utility function?

travisdowns · 2022-06-18T05:39:31Z

src/v/cluster/partition_manager.cc

+        auto current = _raft_table.begin();
+        while (current != _raft_table.end()) {
+            current = _raft_table.erase(current, ++current);
+            co_await ss::coroutine::maybe_yield();


I wonder what the order-of-magnitude cost of calling this for every element is. If we had a helper it could do N elements before doing a yield (though you'd need an estimate from the call about how expensive every deletion is to get N right).

dotnwat · 2022-06-29T02:18:46Z

src/v/cluster/partition_manager.cc

+        auto current = partitions.begin();
+        while (current != partitions.end()) {
+            current = partitions.erase(current, ++current);
+            co_await ss::coroutine::maybe_yield();
+        }


thinking about current and ++current and erase return value seems overly complicated?

while (!partitions.empty()) { partitions.erase(partitions.begin()); co_await ss::coroutine::maybe_yield(); }

jcsp · 2022-07-04T12:32:42Z

Revised this to create an async_clear helper (for flat_hash_map) that avoids the repetition -- this is basically revisiting #4860 now that we have a few more usage sites, whereas at the time of that PR we were only using the helper one place.

Calling maybe_yield every iteration wasn't very expensive (it was only checking a boolean for whether any other tasks were waiting), but we can call it less often by batching up our erases into ranges + only calling maybe_yield for each range we erase. This also gives the underlying container a chance to apply any efficiencies they may have for bulk erases vs. single element erases.

src/v/ssx/async-clear.h

This is for clearing large containers without causing reactor stalls.

These objects are all potentially very large, so: - Must not destruct them in one shot (the overhead of all the item destructors is enough to cause an issue) - Must not use ss::parallel_for_each, it's unsafe on large collections.

jcsp · 2022-07-04T15:04:38Z

This ran into the llvm templates+coroutines bug llvm/llvm-project#49689, so I've wrapped the async_clear helper into a class to work around that (seems to work when compiling locally at least)

BenPope

LGTM

jcsp · 2022-07-04T20:00:35Z

CI failures are:

dotnwat

lgtm

github-actions bot added the area/redpanda label Jun 16, 2022

jcsp commented Jun 17, 2022

View reviewed changes

jcsp marked this pull request as ready for review June 17, 2022 15:41

jcsp requested review from mmaslankaprv, ztlpn and VadimPlh as code owners June 17, 2022 15:41

travisdowns reviewed Jun 18, 2022

View reviewed changes

dotnwat reviewed Jun 29, 2022

View reviewed changes

jcsp force-pushed the controller-stalls branch from ebbb2ee to 49fb8d4 Compare July 4, 2022 12:29

jcsp requested review from LenaAn and BenPope as code owners July 4, 2022 12:29

BenPope reviewed Jul 4, 2022

View reviewed changes

src/v/ssx/async-clear.h Outdated Show resolved Hide resolved

src/v/ssx/async-clear.h Outdated Show resolved Hide resolved

BenPope reviewed Jul 4, 2022

View reviewed changes

src/v/ssx/async-clear.h Show resolved Hide resolved

jcsp added 3 commits July 4, 2022 16:02

ssx: add async_clear helper

4eeafad

This is for clearing large containers without causing reactor stalls.

storage: replace async_clear_logs with ssx::async_clear

e9852bc

cluster: fix reactor stalls during shutdown

3f34d6e

These objects are all potentially very large, so: - Must not destruct them in one shot (the overhead of all the item destructors is enough to cause an issue) - Must not use ss::parallel_for_each, it's unsafe on large collections.

jcsp force-pushed the controller-stalls branch from 49fb8d4 to 3f34d6e Compare July 4, 2022 15:03

BenPope approved these changes Jul 4, 2022

View reviewed changes

jcsp merged commit 7d762f7 into redpanda-data:dev Jul 4, 2022

jcsp deleted the controller-stalls branch July 4, 2022 20:00

dotnwat reviewed Jul 4, 2022

View reviewed changes

mmedenjak added kind/enhance New feature or request performance labels Jul 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cluster: fix reactor stalls during shutdown #5151

cluster: fix reactor stalls during shutdown #5151

jcsp commented Jun 16, 2022

jcsp Jun 17, 2022

emaxerrno Jun 17, 2022

travisdowns Jun 18, 2022

travisdowns Jun 18, 2022

travisdowns Jun 18, 2022

dotnwat Jun 29, 2022

jcsp commented Jul 4, 2022

jcsp commented Jul 4, 2022

BenPope left a comment

jcsp commented Jul 4, 2022

dotnwat left a comment

cluster: fix reactor stalls during shutdown #5151

cluster: fix reactor stalls during shutdown #5151

Conversation

jcsp commented Jun 16, 2022

Cover letter

Release notes

jcsp Jun 17, 2022

Choose a reason for hiding this comment

emaxerrno Jun 17, 2022

Choose a reason for hiding this comment

travisdowns Jun 18, 2022

Choose a reason for hiding this comment

travisdowns Jun 18, 2022

Choose a reason for hiding this comment

travisdowns Jun 18, 2022

Choose a reason for hiding this comment

dotnwat Jun 29, 2022

Choose a reason for hiding this comment

jcsp commented Jul 4, 2022

jcsp commented Jul 4, 2022

BenPope left a comment

Choose a reason for hiding this comment

jcsp commented Jul 4, 2022

dotnwat left a comment

Choose a reason for hiding this comment