Implement `ssx::sharded_abort_source` using `std::shared_ptr` #16534

ballard26 · 2024-02-08T00:16:22Z

The current implementation of ssx::sharded_abort_source has two potential race conditions;

If it is stopped then .local() is no longer safe to call since .stop() will delete all local objects. However, other shards do not have a way of knowing if the abort_source was stopped, hence no way of knowing if .local() is safe to call or not. The new implementation avoids this issue by using a shared_ptr to delete the local objects rather than .stop().
If the parent abort_source aborts then the subscription will make cross shard calls to every shard-local abort_source. However, on the original shard the ssx::sharded_abort_source may be stopped and the shard-local abort_source's deleted before the cross shard calls finish. Resulting in a use-after-free violation. The new implementation fixes this in the same way as it did the last issue.

May fix #14149

Backports Required

Release Notes

none

travisdowns · 2024-02-08T00:31:58Z

src/v/ssx/abort_source.h

+    }
+
+    // Subscribes to the parent abort_source.
+    ss::future<> start(ss::abort_source& parent) {


suggestion: this method can return void now

(so it could probably all be the constructor, which avoids other problems like calling start twice)

Was keeping it as ss::future<> to reduce the number of code changes introduced by this PR. Though being able to move everything to the constructor would be nice and avoid some pitfalls. Let me give it a shot.

Removed the start method and moved everything to the constructor.

dotnwat · 2024-02-08T00:33:48Z

src/v/ssx/abort_source.h

-    std::optional<ss::abort_source::subscription> _sub;
+    struct sharded_abort_source_internal {
+        ss::shard_id orig_shard;
+        std::vector<ss::abort_source> as;


I haven't thought much yet about the general strategy of this PR.

But if we end up going with this general approach, would it be worth allocating these abort sources so that they avoid false sharing, which presumably is happening each core is indexing into its own slot in this vector?

Taking things further you could probably roll your own reference counting that either avoided the atomics (using x-core messages) or used some relaxed / sloppy counting techniques, but it might be overkill.

which presumably is happening each core is indexing into its own slot in this vector?

False sharing requires mutation: if two (or more) threads access objects on the same cache line but those accesses are all only reads the cache line is just shared out among all the cores which is an efficient and common pattern.

So there could be some false sharing here, but only when the source is actually aborted, which is already a fairly expensive operation.

but only when the source is actually aborted

ahh, thanks. of course!

travisdowns · 2024-02-08T00:47:49Z

src/v/ssx/abort_source.h

+ * A sharded version of `ss::abort_source` that allows any shard to request an
+ * abort on all other shards.
+ *
+ * Note that is class's members are wrapped in a shared_ptr. So the object


Good idea to comment on this but I think it could be a bit clearer, maybe something along the lines of:

Sharing one instance of this class across threads raises difficult lifetime and thread safety issues, so the intended usage pattern for this class is to create it on one shard, then make a copy of this original object on every shard where it will be used. All internal state is wrapped in a shared pointer so these copies reference the same underlying state.

Good point, switched to the recommended comment.

travisdowns

oblig comment

The previous implementation of `ssx::sharded_abort_source` had two potential race conditions; - If it is stopped then `.local()` is no longer safe to call since `.stop()` will delete all local objects. However, other shards do not have a way of knowing if the abort_source was stopped, hence no way of knowing if `.local()` is safe to call or not. The new implementation avoids this issue by using a shared_ptr to delete the local objects rather than `.stop()`. - If the parent abort_source aborts then the subscription will make cross shard calls to every shard-local abort_source. However, on the original shard the `ssx::sharded_abort_source` may be stopped and the shard-local abort_source's deleted before the cross shard calls finish. Resulting in a use-after-free violation. The new implementation fixes this in the same way as it did the last issue.

BenPope

Looks good to me.

Any reason not to use ss::optimized_optional?

--- a/src/v/ssx/abort_source.h
+++ b/src/v/ssx/abort_source.h
@@ -39,15 +39,12 @@ public:
         std::vector<ss::abort_source>(ss::smp::count),
         std::nullopt)) {
         auto dex = parent.get_default_exception();
-        auto sub = parent.subscribe(
+        _internal->sub = parent.subscribe(
           [as = *this,
            dex](std::optional<std::exception_ptr> const& ex) mutable noexcept {
               dex = ex.value_or(dex);
               ssx::background = as.request_abort_ex(dex);
           });
-        if (sub) {
-            _internal->sub.emplace(std::move(*sub));
-        }
     }
 
     // Returns a reference to an abort_source local to the calling shard
@@ -102,7 +99,7 @@ public:
           ss::this_shard_id() == _internal->orig_shard,
           "sharded_abort_source must be stopped on its original shard");
 
-        _internal->sub.reset();
+        _internal->sub = std::nullopt;
         return request_abort();
     }
 
@@ -110,7 +107,7 @@ private:
     struct sharded_abort_source_internal {
         ss::shard_id orig_shard;
         std::vector<ss::abort_source> as;
-        std::optional<ss::abort_source::subscription> sub;
+        ss::optimized_optional<ss::abort_source::subscription> sub;
     };
 
     std::shared_ptr<sharded_abort_source_internal> _internal;```

ballard26 requested review from dotnwat, BenPope, travisdowns and StephanDollberg February 8, 2024 00:16

github-actions bot added the area/redpanda label Feb 8, 2024

travisdowns reviewed Feb 8, 2024

View reviewed changes

dotnwat reviewed Feb 8, 2024

View reviewed changes

travisdowns reviewed Feb 8, 2024

View reviewed changes

ballard26 force-pushed the safe-sub-as branch from 022bcf4 to f6f257a Compare March 28, 2024 19:22

ballard26 requested a review from travisdowns March 28, 2024 19:22

ballard26 force-pushed the safe-sub-as branch from f6f257a to 53f0cc6 Compare March 28, 2024 20:06

BenPope reviewed Apr 23, 2024

View reviewed changes

ballard26 marked this pull request as draft July 19, 2024 17:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `ssx::sharded_abort_source` using `std::shared_ptr` #16534

Implement `ssx::sharded_abort_source` using `std::shared_ptr` #16534

ballard26 commented Feb 8, 2024 •

edited

Loading

travisdowns Feb 8, 2024

travisdowns Feb 8, 2024

ballard26 Mar 28, 2024

ballard26 Mar 28, 2024

dotnwat Feb 8, 2024

travisdowns Feb 8, 2024 •

edited

Loading

dotnwat Feb 8, 2024

travisdowns Feb 8, 2024

ballard26 Mar 28, 2024

travisdowns left a comment

BenPope left a comment

Implement ssx::sharded_abort_source using std::shared_ptr #16534

Are you sure you want to change the base?

Implement ssx::sharded_abort_source using std::shared_ptr #16534

Conversation

ballard26 commented Feb 8, 2024 • edited Loading

Backports Required

Release Notes

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisdowns Feb 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

travisdowns left a comment

Choose a reason for hiding this comment

BenPope left a comment

Choose a reason for hiding this comment

Implement `ssx::sharded_abort_source` using `std::shared_ptr` #16534

Implement `ssx::sharded_abort_source` using `std::shared_ptr` #16534

ballard26 commented Feb 8, 2024 •

edited

Loading

travisdowns Feb 8, 2024 •

edited

Loading