Implement async update method. Improve the performance of update by parallelising reads. #2087

vasil-pashov · 2024-12-19T15:28:38Z

Reference Issues/PRs

Implement async_update_impl function which returns a future. The synchronous version for update just calls it and waits for the future just like append does.

This keeps most of the code for update the same, however instead of calling .get on futures it will chain then and return a future. In the process of doing this the reads needed by update were made in parallel. Thus the regular update will have improved performance.

Slight refactor of C++ unit tests, using std::array instead of std::vector for fixed size collections and placing const and constexpr specifiers. No functional changes.

What does this implement or fix?

Any other comments?

Checklist

Checklist for code changes...

Have you updated the relevant docstrings, documentation and copyright notice?
Is this contribution tested against all ArcticDB's features?
Do all exceptions introduced raise appropriate error messages?
Are API changes highlighted in the PR description?
Is the PR labelled as enhancement or bug so it appears in autogenerated release notes?

…rage Test refactor: * Using std::array instead of std::vector * Place const and constexpr

alexowens90 · 2024-12-20T16:31:12Z

cpp/arcticdb/version/version_core.cpp

    auto versioned_item = VersionedItem(to_atom(std::move(version_key)));
-    ARCTICDB_DEBUG(log::version(), "updated stream_id: {} , version_id: {}", stream_id, update_info.next_version_id_);
+    ARCTICDB_DEBUG(log::version(), "updated stream_id: {} , version_id: {}", frame->desc.id(), update_info.next_version_id_);


Is this set on the input frame? If not, can use the stream ID from the version_key

alexowens90 · 2024-12-20T16:34:33Z

cpp/arcticdb/pipeline/index_segment_reader.cpp

@@ -20,8 +20,13 @@ using namespace arcticdb::proto::descriptors;
 namespace arcticdb::pipelines::index {

 IndexSegmentReader get_index_reader(const AtomKey &prev_index, const std::shared_ptr<Store> &store) {
-    auto [key, seg] = store->read_sync(prev_index);
-    return index::IndexSegmentReader{std::move(seg)};
+    return async_get_index_reader(prev_index, store).get();


I think we should leave this implementation using read_sync, so that the scheduling overhead can be avoided if necessary

alexowens90 · 2024-12-20T16:36:56Z

cpp/arcticdb/pipeline/read_pipeline.hpp

@@ -61,8 +61,7 @@ void foreach_active_bit(const util::BitSet &bs, C &&visitor) {
    }
 }

-template<typename ContainerType>
-std::vector<SliceAndKey> filter_index(const ContainerType &container, std::optional<CombinedQuery<ContainerType>> &&query) {
+inline std::vector<SliceAndKey> filter_index(const index::IndexSegmentReader& container, std::optional<CombinedQuery<index::IndexSegmentReader>> &&query) {


I assume the template parameter always resolved to IndexSegmentReader? Maybe rename container to match as well now?

alexowens90 · 2024-12-20T16:39:15Z

cpp/arcticdb/version/version_core.cpp

-                        std::back_inserter(unaffected_keys));
-
-    util::check(affected_keys.size() + unaffected_keys.size() == index_segment_reader.size(), "Unaffected vs affected keys split was inconsistent {} + {} != {}",
+    return index::async_get_index_reader(*(update_info.previous_index_key_), store).thenValue([=](index::IndexSegmentReader&& index_segment_reader) {


Why are we capturing everything by copy?

The whole implementation is inside async_get_index_reader and we need to propagate the input params. I used it as a shorthand. In theory, options can be moved but it's consisted only of PODs so it won't do anything.

alexowens90 · 2024-12-20T16:43:48Z

cpp/arcticdb/version/version_core.cpp

+            frame,
+            get_slicing_policy(options, *frame),
+            IndexPartialKey{stream_id, update_info.next_version_id_}, store
+        ).thenValue([


slice_and_write finishes on the IO executor, but I think we want to be on CPU for the next task

alexowens90 · 2024-12-20T16:51:38Z

cpp/arcticdb/pipeline/write_frame.cpp

-        std::move(output)
-    );
-    return SliceAndKey{std::move(new_slice), std::get<AtomKey>(std::move(fut_key).get())};
+    return store->read(existing.key()).thenValue([=](std::pair<VariantKey, SegmentInMemory>&& key_segment) {


Capture by copy?

Could also be a thenValueInline

Same as #2087 (comment) the implementation is in the lambda and it uses all variables. So I need to pass them to the future. Can't capture by ref as they'll die when this is put in the queue and the function returns.

alexowens90 · 2024-12-20T16:53:41Z

cpp/arcticdb/pipeline/write_frame.cpp

+        const RowRange affected_row_range = partial_rewrite_row_range(segment, index_range, affected_part);
+        const int64_t num_rows = affected_row_range.end() - affected_row_range.start();
+        if (num_rows <= 0) {
+            return folly::Future<std::optional<SliceAndKey>>{std::nullopt};


I don't think this needs folly::Future, if you specify the return type of the lambda can probably just return std::nullopt

alexowens90 · 2024-12-20T16:58:40Z

cpp/arcticdb/version/version_core.cpp

+    IndexRange original;
+};
+
+folly::Future<AtomKey> async_update_impl(


This methods a bit of a monster now, if there's a clean way to break it up a bit it should get more readable

I couldn't think way to split things in logically rather than pulling them for the sake of splitting. I'll give it another try, now that everything compiles and all tests are passing there will be less unknowns.

vasil-pashov added 15 commits December 12, 2024 14:24

Fix compilation errors

f583136

Add async_update_impl and update_impl functions

1242013

Async version of rewrite_partial_segment

05b489f

Make reading of keys parallel in update

6c0d5c2

WIP async_update

bc55784

Async update compiling and tests passing

6443c08

Change chaining structure

646cc1d

Use thenValueInline in async_get_index_reader

229d597

Use const ref for is_timeseries_index

7043683

Use then value inline for filtering existing slices

d87193e

Replace variable with fn call

3a9cec2

Merge branch 'master' into vasil.pashov/batch_read

27cca12

Fix off-by-one error in computing index intersection with the update …

083f4be

…rage Test refactor: * Using std::array instead of std::vector * Place const and constexpr

Fix warning in relase due to debug macro

c459c73

Fix compilation errors

05e7965

vasil-pashov marked this pull request as ready for review December 20, 2024 14:54

vasil-pashov requested review from alexowens90, willdealtry and poodlewars as code owners December 20, 2024 14:54

alexowens90 requested changes Dec 20, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement async update method. Improve the performance of update by parallelising reads. #2087

Implement async update method. Improve the performance of update by parallelising reads. #2087

vasil-pashov commented Dec 19, 2024 •

edited

Loading

alexowens90 Dec 20, 2024

alexowens90 Dec 20, 2024

alexowens90 Dec 20, 2024

alexowens90 Dec 20, 2024

vasil-pashov Dec 20, 2024

alexowens90 Dec 20, 2024

alexowens90 Dec 20, 2024

alexowens90 Dec 20, 2024

vasil-pashov Dec 20, 2024

alexowens90 Dec 20, 2024

alexowens90 Dec 20, 2024

vasil-pashov Dec 20, 2024

Implement async update method. Improve the performance of update by parallelising reads. #2087

Are you sure you want to change the base?

Implement async update method. Improve the performance of update by parallelising reads. #2087

Conversation

vasil-pashov commented Dec 19, 2024 • edited Loading

Reference Issues/PRs

What does this implement or fix?

Any other comments?

Checklist

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vasil-pashov commented Dec 19, 2024 •

edited

Loading