Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery instances fully lazy. #1756

Merged
merged 1 commit into from
Jan 6, 2023
Merged

Make construction of LevenshteinAutomatonBuilder for FuzzyTermQuery instances fully lazy. #1756

merged 1 commit into from
Jan 6, 2023

Conversation

adamreichold
Copy link
Collaborator

No description provided.

@codecov-commenter
Copy link

codecov-commenter commented Jan 3, 2023

Codecov Report

Merging #1756 (b288f2b) into main (b22f966) will decrease coverage by 0.01%.
The diff coverage is 86.20%.

❗ Current head b288f2b differs from pull request most recent head 789b5e6. Consider uploading reports for the commit 789b5e6 to get more accurate results

@@            Coverage Diff             @@
##             main    #1756      +/-   ##
==========================================
- Coverage   94.13%   94.12%   -0.02%     
==========================================
  Files         267      267              
  Lines       50900    50903       +3     
==========================================
- Hits        47917    47910       -7     
- Misses       2983     2993      +10     
Impacted Files Coverage Δ
src/query/fuzzy_query.rs 90.47% <73.33%> (-0.59%) ⬇️
src/directory/ram_directory.rs 90.50% <100.00%> (ø)
src/directory/watch_event_router.rs 95.79% <100.00%> (ø)
src/indexer/json_term_writer.rs 99.79% <100.00%> (ø)
src/indexer/segment_updater.rs 94.40% <100.00%> (-1.05%) ⬇️
src/postings/compression/mod.rs 100.00% <100.00%> (ø)
src/query/boolean_query/boolean_query.rs 92.66% <100.00%> (ø)
src/query/boolean_query/boolean_weight.rs 93.43% <100.00%> (ø)
src/query/query_parser/logical_ast.rs 80.00% <100.00%> (ø)
src/schema/text_options.rs 100.00% <100.00%> (ø)
... and 7 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Comment on lines 115 to 118
static BUILDER: Lazy<RwLock<FxHashMap<(u8, bool), LevenshteinAutomatonBuilder>>> =
Lazy::new(Default::default);

loop {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want a RwLock here? Although this makes the logic lazy, it adds additional overhead to every read we try to create the scorer, and I believe the standard library RwLocks still have some performance issues plaguing them when being accessed concurrently by several threads even when reading only, and this is potentially a very hot path.

Secondly, this removes the cap on the maximum edit distance per term, which although isn't bad in itself, does open the potential for people to pick a number too high, 3 tends to be the realistic maximum you'd use, although it depends on the word length.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really want a RwLock here? Although this makes the logic lazy, it adds additional overhead to every read we try to create the scorer, and I believe the standard library RwLocks still have some performance issues plaguing them when being accessed concurrently by several threads even when reading only, and this is potentially a very hot path.

Yes, I am somewhat unhappy with the RwLock too, but some form of synchronization will have to be necessary if laziness is desired.

Alternatives I see are:

  • Using Mutex which should be faster but would prevent concurrent reads after the relevant builders have been initialized. Alternatively, Arc<LevenshteinAutomatonBuilder> could be cached and cloned under the mutex so that usage can proceed without it.
  • But if we do go for the additional indirection of Arc<LevenshteinAutomatonBuilder>, we might also use ArcSwap instead of RwLock (already a dependency) and always publish a new (larger) hash table when another LevenshteinAutomatonBuilder had to be initialized containing copies of all the previous Arc<LevenshteinAutomatonBuilder> instances. (LevenshteinAutomatonBuilder itself is not clone, but I guess that could also be changed upstream to avoid the additional Arc indirection in this case.)

Secondly, this removes the cap on the maximum edit distance per term, which although isn't bad in itself, does open the potential for people to pick a number too high, 3 tends to be the realistic maximum you'd use, although it depends on the word length.

The limitation seems artificial IMHO and related to the eager construction of the cached LevenshteinAutomatonBuilder instances which is why I removed it after making things lazy.

I have also read about 3 being a realistic limit, but I think this would better be served by a soft limit, e.g. advice given in the documentation of the new and new_prefix methods.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opened quickwit-oss/levenshtein-automata#13 to enable cloning of the builders.

Copy link
Collaborator

@fulmicoton fulmicoton Jan 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a hashmap (or arrays) with once_cell as value? That would remove the RwLock , the loop, and the possible expensive multiple levenshtein buider building upon race condition.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could have a hashmap (or arrays) with once_cell as value?

We could lazily initialize only the entries and eagerly construct the data structure itself, but if we keep the static limit on the distance, then the current approach - lazily constructing all builders at once - seems to most reasonable to me and I would just remove the TODO.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pushed an implementation of that approach, i.e. using [[OnceCell<LevenshteinAutomatonBuilder>; 2]; 3].

Comment on lines +128 to +129
.get(self.transposition_cost_one as usize)
.unwrap()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.get(self.transposition_cost_one as usize)
.unwrap()
[self.transposition_cost_one as usize]

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would not that be the same?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would certainly be the same, but it formats very unfavorably:

let automaton_builder = AUTOMATON_BUILDER
    .get(self.distance as usize)
    .ok_or_else(|| {
        InvalidArgument(format!(
            "Levenshtein distance of {} is not allowed. Choose a value less than {}",
            self.distance,
            AUTOMATON_BUILDER.len()
        ))
    })?[self.transposition_cost_one as usize]
    .get_or_init(|| {
        LevenshteinAutomatonBuilder::new(self.distance, self.transposition_cost_one)
    });

which is why I went for this somewhat verbose way of doing it. But I am not really invested in this.

Which one do you prefer after seeing the formatting, [..] or get(..).unwrap()?

@fulmicoton fulmicoton merged commit 1afa5bf into quickwit-oss:main Jan 6, 2023
@adamreichold adamreichold deleted the lazy-levenshtein-builder branch January 6, 2023 07:58
This was referenced Jan 13, 2023
Hodkinson pushed a commit to Hodkinson/tantivy that referenced this pull request Jan 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants