Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Work on WASM support, part 1/2: Remove the asynchronous code in indexer. #1312

Closed
wants to merge 1 commit into from

Conversation

StyMaar
Copy link
Contributor

@StyMaar StyMaar commented Mar 15, 2022

Warning: This PR is far from a mergeable state, it's intended as a PoC and a support for technical discussions.

Motivations:

In order to port Tantivy on wasm (including indexing), there are two main issues in the current Tantivy implementation:

  • the fact that segment_updater.rs currently uses asynchronous code, which rely on futures::executor to work. This is what's addressed in this pull-request.
  • The work being done in parallel in several thread in index_writer.rs (not the subject of this PR)

What this PR does:

In this PR, I removed all asynchronous code and made the code synchronous instead. Which at first glance should not affect the behavior in any way since the current code then uses futures::executor::block_on on top of the asynchronous code.

But, when discussing with @fmassot about it, he questioned whether this would alter the logic:

Interesting, I did not thought about replacing async code by sync code for segment update operations. While reading the code, I discovered something important brought by the segment updater thread: it is in charge of processing all the segment updates operation one by one.
Cf.

// The segment update runner is in charge of processing all
// of the `SegmentUpdate`s.
//
// All this processing happens on a single thread
// consuming a common queue.
//

One operation can do several things. Let's see the different operations:

  • schedule_add_segment : it's only adding a segment to the segment manager and considering merge options (it may trigger a task that will be executed in merge dedicated threads).
  • schedule_commit: this operation is trickier, it is doing delete purge, segment commit, save of metas, garbage collect.
  • schedule_garbage_collect: garbage collect
  • end_merge: this operation seems complex too

Using sync instead of async changes this logic.

I'm not entirely sure I understands what he meant, or that he understood what I had in mind, so that's why I'm creating this pull-request, so we can discuss about it more concretely.

Note: What must be done before merging could be considered:

  • re-add the liveness check of the segment_updater (which has now been removed in all places that used to call schedule_task)
  • ensure that this PR doesn't break the logic
  • Decide what to do with tantivy::PreparedCommit::commit_async: should we wrap the now-synchronous implementation in a async function (by spawning a thread) or just drop it? What do we do in WASM ?
  • fix the unit tests (currently the tests haven't been updated to reflect the changes, then they don't even compile at this point)
  • rename a bunch of things (there are still variables names *_future even if they don't hold futures anymore.
  • remove unneeded dependencies and fix other warnings

@fmassot
Copy link
Contributor

fmassot commented Mar 15, 2022

Thanks @StyMaar for your PR.

I think that my explanation was not very clear :/ and more importantly I did not understand that you wanted to remove the merger thread pool too!

Rationale without segment updater thread and with merger thread pool

Just to clarify what I tried to explain to you in my previous answer (but failed).

The segment updater thread is ensuring that each operation is executed one by one. Removing the segment updater thread raises then the question: can 2 operations be executed at the same time without interfering with each other?

The code is complex and it's hard to know what can happen. Here is a theoretical example that can lead to unexpected results:

  • let's say we have segments 1,2,3
  • a merge of segments 1,2,3 starts in a merger thread and will produce a new merged segment 4
  • the indexing worker is indexing documents and will produce a segment 5
  • Now we have a concurrency issue:

The save_metas is doing several things: it reads the committed segments and then save the index metas:

pub fn save_metas(
&self,
opstamp: Opstamp,
commit_message: Option<String>,
) -> crate::Result<()> {
if self.is_alive() {
let index = &self.index;
let directory = index.directory();
let mut commited_segment_metas = self.segment_manager.committed_segment_metas();
// We sort segment_readers by number of documents.
// This is an heuristic to make multithreading more efficient.
//
// This is not done at the searcher level because I had a strange
// use case in which I was dealing with a large static index,
// dispatched over 5 SSD drives.
//
// A `UnionDirectory` makes it possible to read from these
// 5 different drives and creates a meta.json on the fly.
// In order to optimize the throughput, it creates a lasagna of segments
// from the different drives.
//
// Segment 1 from disk 1, Segment 1 from disk 2, etc.
commited_segment_metas.sort_by_key(|segment_meta| -(segment_meta.max_doc() as i32));
let index_meta = IndexMeta {
index_settings: index.settings().clone(),
segments: commited_segment_metas,
schema: index.schema(),
opstamp,
payload: commit_message,
};
// TODO add context to the error.
save_metas(&index_meta, directory.box_clone().borrow_mut())?;
self.store_meta(&index_meta);
}

I don't see how the code guarantees safe read/write of metas and ensure that at the end we have an index metas with only segments 4 and 5.
And it will be even worse if we have several indexing workers and several merger threads.

I looked only at save_metas but the code is far too complex for me to ensure that there will be no other concurrency issues.

Rationale without segment updater thread and merger thread pool

I see 2 problems:

  • I don't think we want to execute indexing and merge on the same thread, merging will stop indexing for a while and in many use cases we don't want this to happen
  • In the case of one indexing worker, every operation will be executed one after the other so there will be no issues. With several indexing workers, I think this is still not safe as we will have concurrent calls of save_metas for example.

@fulmicoton
Copy link
Collaborator

Closing as this PR is way too naive.

Good news however: Incidentally, I am removing all of the async stuff in #1315 (in a correct way).

@fulmicoton fulmicoton closed this Mar 18, 2022
@StyMaar
Copy link
Contributor Author

StyMaar commented Mar 18, 2022

Of course it's naive: I wrote this pull request explicitly as a way to identify exactly how such a change would break tantivy, in order to reach a design that actually works.

I appreciate that you're working on an alternative. Unfortunately it won't work in WASM as it is, but maybe we can find a way to make it work with your new design.

@fmassot
Copy link
Contributor

fmassot commented Mar 18, 2022

@StyMaar #1315 is really nice as it introduces Rayon for segment updater thread and merger threads. If you want to test wasm with Rayon, I think you can easily use Rayon for the indexing threads as well, with this change, tantivy will be based only on Rayon for threads.

@fulmicoton
Copy link
Collaborator

fulmicoton commented Mar 18, 2022

@StyMaar @fmassot #1315 is merged and covers all of this part 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants