Work on WASM support, part 1/2: Remove the asynchronous code in indexer. #1312

StyMaar · 2022-03-15T15:15:28Z

Warning: This PR is far from a mergeable state, it's intended as a PoC and a support for technical discussions.

Motivations:

In order to port Tantivy on wasm (including indexing), there are two main issues in the current Tantivy implementation:

the fact that segment_updater.rs currently uses asynchronous code, which rely on futures::executor to work. This is what's addressed in this pull-request.
The work being done in parallel in several thread in index_writer.rs (not the subject of this PR)

What this PR does:

In this PR, I removed all asynchronous code and made the code synchronous instead. Which at first glance should not affect the behavior in any way since the current code then uses futures::executor::block_on on top of the asynchronous code.

But, when discussing with @fmassot about it, he questioned whether this would alter the logic:

Interesting, I did not thought about replacing async code by sync code for segment update operations. While reading the code, I discovered something important brought by the segment updater thread: it is in charge of processing all the segment updates operation one by one.
Cf.

tantivy/src/indexer/segment_updater.rs

Lines 88 to 93 in 2069e3e

// The segment update runner is in charge of processing all

// of the `SegmentUpdate`s.

//

// All this processing happens on a single thread

// consuming a common queue.

//

One operation can do several things. Let's see the different operations:

schedule_add_segment : it's only adding a segment to the segment manager and considering merge options (it may trigger a task that will be executed in merge dedicated threads).

schedule_commit: this operation is trickier, it is doing delete purge, segment commit, save of metas, garbage collect.

schedule_garbage_collect: garbage collect

end_merge: this operation seems complex too

Using sync instead of async changes this logic.

I'm not entirely sure I understands what he meant, or that he understood what I had in mind, so that's why I'm creating this pull-request, so we can discuss about it more concretely.

Note: What must be done before merging could be considered:

re-add the liveness check of the segment_updater (which has now been removed in all places that used to call schedule_task)
ensure that this PR doesn't break the logic
Decide what to do with tantivy::PreparedCommit::commit_async: should we wrap the now-synchronous implementation in a async function (by spawning a thread) or just drop it? What do we do in WASM ?
fix the unit tests (currently the tests haven't been updated to reflect the changes, then they don't even compile at this point)
rename a bunch of things (there are still variables names *_future even if they don't hold futures anymore.
remove unneeded dependencies and fix other warnings

(cargo check passes, test are broken)

fmassot · 2022-03-15T20:38:58Z

Thanks @StyMaar for your PR.

I think that my explanation was not very clear :/ and more importantly I did not understand that you wanted to remove the merger thread pool too!

Rationale without segment updater thread and with merger thread pool

Just to clarify what I tried to explain to you in my previous answer (but failed).

The segment updater thread is ensuring that each operation is executed one by one. Removing the segment updater thread raises then the question: can 2 operations be executed at the same time without interfering with each other?

The code is complex and it's hard to know what can happen. Here is a theoretical example that can lead to unexpected results:

let's say we have segments 1,2,3
a merge of segments 1,2,3 starts in a merger thread and will produce a new merged segment 4
the indexing worker is indexing documents and will produce a segment 5
Now we have a concurrency issue:
- at the end of the merge, save_metas is called in the merger thread instead of the segment updater thread as we remove it:
  
  tantivy/src/indexer/segment_updater.rs
  
  Line 655 in 2069e3e
  
  .save_metas(previous_metas.opstamp, previous_metas.payload.clone())?;
- the schedule_commit will also call save_metas:
  
  tantivy/src/indexer/segment_updater.rs
  
  Line 480 in 2069e3e
  
  segment_updater.save_metas(opstamp, payload)?;
  
  in the index writer thread

The save_metas is doing several things: it reads the committed segments and then save the index metas:

tantivy/src/indexer/segment_updater.rs

Lines 413 to 447 in 2069e3e

    
           pub fn save_metas( 
        
               &self, 
        
               opstamp: Opstamp, 
        
               commit_message: Option<String>, 
        
           ) -> crate::Result<()> { 
        
               if self.is_alive() { 
        
                   let index = &self.index; 
        
                   let directory = index.directory(); 
        
                   let mut commited_segment_metas = self.segment_manager.committed_segment_metas(); 
        
                   // We sort segment_readers by number of documents. 
        
                   // This is an heuristic to make multithreading more efficient. 
        
                   // 
        
                   // This is not done at the searcher level because I had a strange 
        
                   // use case in which I was dealing with a large static index, 
        
                   // dispatched over 5 SSD drives. 
        
                   // 
        
                   // A `UnionDirectory` makes it possible to read from these 
        
                   // 5 different drives and creates a meta.json on the fly. 
        
                   // In order to optimize the throughput, it creates a lasagna of segments 
        
                   // from the different drives. 
        
                   // 
        
                   // Segment 1 from disk 1, Segment 1 from disk 2, etc. 
        
                   commited_segment_metas.sort_by_key(|segment_meta| -(segment_meta.max_doc() as i32)); 
        
                   let index_meta = IndexMeta { 
        
                       index_settings: index.settings().clone(), 
        
                       segments: commited_segment_metas, 
        
                       schema: index.schema(), 
        
                       opstamp, 
        
                       payload: commit_message, 
        
                   }; 
        
                   // TODO add context to the error. 
        
                   save_metas(&index_meta, directory.box_clone().borrow_mut())?; 
        
                   self.store_meta(&index_meta); 
        
               }

I don't see how the code guarantees safe read/write of metas and ensure that at the end we have an index metas with only segments 4 and 5.
And it will be even worse if we have several indexing workers and several merger threads.

I looked only at save_metas but the code is far too complex for me to ensure that there will be no other concurrency issues.

Rationale without segment updater thread and merger thread pool

I see 2 problems:

I don't think we want to execute indexing and merge on the same thread, merging will stop indexing for a while and in many use cases we don't want this to happen
In the case of one indexing worker, every operation will be executed one after the other so there will be no issues. With several indexing workers, I think this is still not safe as we will have concurrent calls of save_metas for example.

fulmicoton · 2022-03-18T05:09:27Z

Closing as this PR is way too naive.

Good news however: Incidentally, I am removing all of the async stuff in #1315 (in a correct way).

StyMaar · 2022-03-18T13:12:29Z

Of course it's naive: I wrote this pull request explicitly as a way to identify exactly how such a change would break tantivy, in order to reach a design that actually works.

I appreciate that you're working on an alternative. Unfortunately it won't work in WASM as it is, but maybe we can find a way to make it work with your new design.

fmassot · 2022-03-18T14:26:56Z

@StyMaar #1315 is really nice as it introduces Rayon for segment updater thread and merger threads. If you want to test wasm with Rayon, I think you can easily use Rayon for the indexing threads as well, with this change, tantivy will be based only on Rayon for threads.

fulmicoton · 2022-03-18T14:34:43Z

@StyMaar @fmassot #1315 is merged and covers all of this part 1.

Going full-sync: removed the need for futures::executor thread pools

338d584

(cargo check passes, test are broken)

fulmicoton closed this Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work on WASM support, part 1/2: Remove the asynchronous code in indexer. #1312

Work on WASM support, part 1/2: Remove the asynchronous code in indexer. #1312

StyMaar commented Mar 15, 2022 •

edited

Loading

fmassot commented Mar 15, 2022 •

edited

Loading

fulmicoton commented Mar 18, 2022

StyMaar commented Mar 18, 2022

fmassot commented Mar 18, 2022

fulmicoton commented Mar 18, 2022 •

edited

Loading

	// The segment update runner is in charge of processing all
	// of the `SegmentUpdate`s.
	//
	// All this processing happens on a single thread
	// consuming a common queue.
	//

Work on WASM support, part 1/2: Remove the asynchronous code in indexer. #1312

Work on WASM support, part 1/2: Remove the asynchronous code in indexer. #1312

Conversation

StyMaar commented Mar 15, 2022 • edited Loading

Motivations:

What this PR does:

Note: What must be done before merging could be considered:

fmassot commented Mar 15, 2022 • edited Loading

Rationale without segment updater thread and with merger thread pool

Rationale without segment updater thread and merger thread pool

fulmicoton commented Mar 18, 2022

StyMaar commented Mar 18, 2022

fmassot commented Mar 18, 2022

fulmicoton commented Mar 18, 2022 • edited Loading

StyMaar commented Mar 15, 2022 •

edited

Loading

fmassot commented Mar 15, 2022 •

edited

Loading

fulmicoton commented Mar 18, 2022 •

edited

Loading