Word prefix pair proximity docids indexation refactor #587

loiclec · 2022-07-14T11:59:24Z

Pull Request

What does this PR do?

Refactor the code of WordPrefixPairProximityDocIds to make it much faster, fix a bug, and add a unit test.

Why is it faster?

Because we avoid using a sorter to insert the (word1, prefix, proximity) keys and their associated bitmaps, and thus we don't have to sort a potentially very big set of data. I have also added a couple of other optimisations:

reusing allocations
using a prefix trie instead of an array of prefixes to get all the prefixes of a word
inserting directly into the database instead of putting the data in an intermediary grenad when possible. Also avoid checking for pre-existing values in the database when we know for certain that they do not exist.

What bug was fixed?

When reindexing, the new_prefix_fst_words prefixes may look like:

["ant",  "axo", "bor"]

which we group by first letter:

[["ant", "axo"], ["bor"]]

Later in the code, if we have the word2 "axolotl", we try to find which subarray of prefixes contains its prefixes. This check is done with word2.starts_with(subarray_prefixes[0]), but "axolotl".starts_with("ant") is false, and thus we wrongly think that there are no prefixes in new_prefix_fst_words that are prefixes of axolotl.

StrStrU8Codec

I had to change the encoding of StrStrU8Codec to make the second string null-terminated as well. I don't think this should be a problem, but I may have missed some nuances about the impacts of this change.

Requests when reviewing this PR

I have explained what the code does in the module documentation of word_pair_proximity_prefix_docids. It would be nice if someone could read it and give their opinion on whether it is a clear explanation or not.

I also have a couple questions regarding the code itself:

Should we clean up and factor out the PrefixTrieNode code to try and make broader use of it outside this module? For now, the prefixes undergo a few transformations: from FST, to array, to prefix trie. It seems like it could be simplified.
I wrote a function called write_into_lmdb_database_without_merging. (1) Are we okay with such a function existing? (2) Should it be in grenad_helpers instead?

Benchmark Results

We reduce the time it takes to index about 8% in most cases, but it varies between -3% and -20%.

group                                                                     indexing_main_ce90fc62                  indexing_word-prefix-pair-proximity-docids-refactor_cbad2023
-----                                                                     ----------------------                  ------------------------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00  1893.0±233.03µs        ? ?/sec    1.01  1921.2±260.79µs        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.05      9.4±3.51ms        ? ?/sec     1.00      9.0±2.14ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.22    18.3±11.42ms        ? ?/sec     1.00     15.0±5.79ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.00     41.4±4.20ms        ? ?/sec     1.28    53.0±13.97ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.00   285.6±18.12ms        ? ?/sec     1.03   293.1±16.09ms        ? ?/sec
indexing/Indexing geo_point                                               1.03      60.8±0.45s        ? ?/sec     1.00      58.8±0.68s        ? ?/sec
indexing/Indexing movies in three batches                                 1.14      16.5±0.30s        ? ?/sec     1.00      14.5±0.24s        ? ?/sec
indexing/Indexing movies with default settings                            1.11      13.7±0.07s        ? ?/sec     1.00      12.3±0.28s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.10      10.6±0.11s        ? ?/sec     1.00       9.6±0.15s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.11       9.4±0.15s        ? ?/sec     1.00       8.5±0.10s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.18      66.2±0.39s        ? ?/sec     1.00      56.0±0.67s        ? ?/sec
indexing/Indexing songs with default settings                             1.07      58.7±1.26s        ? ?/sec     1.00      54.7±1.71s        ? ?/sec
indexing/Indexing songs without any facets                                1.08      53.1±0.88s        ? ?/sec     1.00      49.3±1.43s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.08      57.7±1.33s        ? ?/sec     1.00      53.3±0.98s        ? ?/sec
indexing/Indexing wiki                                                    1.06   1051.1±21.46s        ? ?/sec     1.00    989.6±24.55s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.20    1184.8±8.93s        ? ?/sec     1.00     989.7±7.06s        ? ?/sec
indexing/Reindexing geo_point                                             1.04      67.5±0.75s        ? ?/sec     1.00      64.9±0.32s        ? ?/sec
indexing/Reindexing movies with default settings                          1.12      13.9±0.17s        ? ?/sec     1.00      12.4±0.13s        ? ?/sec
indexing/Reindexing songs with default settings                           1.05      60.6±0.84s        ? ?/sec     1.00      57.5±0.99s        ? ?/sec
indexing/Reindexing wiki                                                  1.07   1725.0±17.92s        ? ?/sec     1.00    1611.4±9.90s        ? ?/sec

ManyTheFish

Nice work, I like the code structure.
However, This change is really huge, and, I'd like to have the feedback of someone else like kerollmops before merging this.

milli/src/update/word_prefix_pair_proximity_docids.rs

infos/src/main.rs

New snapshot (yes, it's wrong as well, it will get fixed later): --- source: milli/src/update/word_prefix_pair_proximity_docids.rs --- 5 a 1 [101, ] 5 a 2 [101, ] 5 am 1 [101, ] 5 b 4 [101, ] 5 be 4 [101, ] am a 3 [101, ] amazing a 1 [100, ] amazing a 2 [100, ] amazing a 3 [100, ] amazing an 1 [100, ] amazing an 2 [100, ] amazing b 2 [100, ] amazing be 2 [100, ] an a 1 [100, ] an a 2 [100, 202, ] an am 1 [100, ] an b 3 [100, ] an be 3 [100, ] and a 2 [100, ] and a 3 [100, ] and a 4 [100, ] and b 1 [100, ] and be 1 [100, ] �d\0 0 [100, 202, ] an an 2 [100, ] and am 2 [100, ] and an 3 [100, ] at a 2 [100, 101, ] at a 3 [100, ] at am 2 [100, 101, ] at an 1 [100, 202, ] at an 3 [100, ] at b 3 [101, ] at b 4 [100, ] at be 3 [101, ] at be 4 [100, ] beautiful a 2 [100, ] beautiful a 3 [100, ] beautiful a 4 [100, ] beautiful am 3 [100, ] beautiful an 2 [100, ] beautiful an 4 [100, ] bell a 2 [101, ] bell a 4 [101, ] bell am 4 [101, ] extraordinary a 2 [202, ] extraordinary a 3 [202, ] extraordinary an 2 [202, ] house a 4 [100, 202, ] house a 4 [100, ] house am 4 [100, ] house an 3 [100, 202, ] house b 2 [100, ] house be 2 [100, ] rings a 1 [101, ] rings a 3 [101, ] rings am 3 [101, ] rings b 2 [101, ] rings be 2 [101, ] the a 3 [101, ] the b 1 [101, ] the be 1 [101, ]

New full snapshot: --- source: milli/src/update/word_prefix_pair_proximity_docids.rs --- 5 a 1 [101, ] 5 a 2 [101, ] 5 am 1 [101, ] 5 b 4 [101, ] 5 be 4 [101, ] am a 3 [101, ] amazing a 1 [100, ] amazing a 2 [100, ] amazing a 3 [100, ] amazing an 1 [100, ] amazing an 2 [100, ] amazing b 2 [100, ] amazing be 2 [100, ] an a 1 [100, ] an a 2 [100, 202, ] an am 1 [100, ] an an 2 [100, ] an b 3 [100, ] an be 3 [100, ] and a 2 [100, ] and a 3 [100, ] and a 4 [100, ] and am 2 [100, ] and an 3 [100, ] and b 1 [100, ] and be 1 [100, ] at a 1 [100, 202, ] at a 2 [100, 101, ] at a 3 [100, ] at am 2 [100, 101, ] at an 1 [100, 202, ] at an 3 [100, ] at b 3 [101, ] at b 4 [100, ] at be 3 [101, ] at be 4 [100, ] beautiful a 2 [100, ] beautiful a 3 [100, ] beautiful a 4 [100, ] beautiful am 3 [100, ] beautiful an 2 [100, ] beautiful an 4 [100, ] bell a 2 [101, ] bell a 4 [101, ] bell am 4 [101, ] extraordinary a 2 [202, ] extraordinary a 3 [202, ] extraordinary an 2 [202, ] house a 3 [100, 202, ] house a 4 [100, 202, ] house am 4 [100, ] house an 3 [100, 202, ] house b 2 [100, ] house be 2 [100, ] rings a 1 [101, ] rings a 3 [101, ] rings am 3 [101, ] rings b 2 [101, ] rings be 2 [101, ] the a 3 [101, ] the b 1 [101, ] the be 1 [101, ]

Kerollmops · 2022-08-17T12:01:40Z

milli/src/update/word_prefix_pair_proximity_docids.rs

+#[derive(Default)]
+struct PrefixAndProximityBatch {
+    word1: Vec<u8>,
+    batch: Vec<(Vec<u8>, Vec<Cow<'static, [u8]>>)>,
+}


I am thinking about a potential issue with this data structure: could you prove to me that it doesn't consume to much memory? I mean, if you tell me that the tree we not grow much as it is flush every time the word1 changes it is OK.

It is flushed every time the word1 changes and every time the first letter of word2 changes, so it never gets big.

Kerollmops · 2022-08-17T12:03:06Z

milli/src/update/word_prefix_pair_proximity_docids.rs

+struct PrefixTrieNode {
+    children: Vec<(PrefixTrieNode, u8)>,
+    is_end_node: bool,
+}


How much memory will this tree consume in normal use cases? I mean, will it be flush fast enough to not trigger a memory issue.

It takes a completely negligible amount of memory and never gets flushed. Reasons:

it contains only the prefixes corresponding to more than 50 different words, which are rare

the maximum length of the prefixes is very short (typically 2 bytes only)

Kerollmops

After those explanations, I would say that it is good to be merged as it is much faster and doesn't consume too much memory.
bors merge

587: Word prefix pair proximity docids indexation refactor r=Kerollmops a=loiclec # Pull Request ## What does this PR do? Refactor the code of `WordPrefixPairProximityDocIds` to make it much faster, fix a bug, and add a unit test. ## Why is it faster? Because we avoid using a sorter to insert the (`word1`, `prefix`, `proximity`) keys and their associated bitmaps, and thus we don't have to sort a potentially very big set of data. I have also added a couple of other optimisations: 1. reusing allocations 2. using a prefix trie instead of an array of prefixes to get all the prefixes of a word 3. inserting directly into the database instead of putting the data in an intermediary grenad when possible. Also avoid checking for pre-existing values in the database when we know for certain that they do not exist. ## What bug was fixed? When reindexing, the `new_prefix_fst_words` prefixes may look like: ``` ["ant", "axo", "bor"] ``` which we group by first letter: ``` [["ant", "axo"], ["bor"]] ``` Later in the code, if we have the word2 "axolotl", we try to find which subarray of prefixes contains its prefixes. This check is done with `word2.starts_with(subarray_prefixes[0])`, but `"axolotl".starts_with("ant")` is false, and thus we wrongly think that there are no prefixes in `new_prefix_fst_words` that are prefixes of `axolotl`. ## StrStrU8Codec I had to change the encoding of `StrStrU8Codec` to make the second string null-terminated as well. I don't think this should be a problem, but I may have missed some nuances about the impacts of this change. ## Requests when reviewing this PR I have explained what the code does in the module documentation of `word_pair_proximity_prefix_docids`. It would be nice if someone could read it and give their opinion on whether it is a clear explanation or not. I also have a couple questions regarding the code itself: - Should we clean up and factor out the `PrefixTrieNode` code to try and make broader use of it outside this module? For now, the prefixes undergo a few transformations: from FST, to array, to prefix trie. It seems like it could be simplified. - I wrote a function called `write_into_lmdb_database_without_merging`. (1) Are we okay with such a function existing? (2) Should it be in `grenad_helpers` instead? ## Benchmark Results We reduce the time it takes to index about 8% in most cases, but it varies between -3% and -20%. ``` group indexing_main_ce90fc62 indexing_word-prefix-pair-proximity-docids-refactor_cbad2023 ----- ---------------------- ------------------------------------------------------------ indexing/-geo-delete-facetedNumber-facetedGeo-searchable- 1.00 1893.0±233.03µs ? ?/sec 1.01 1921.2±260.79µs ? ?/sec indexing/-movies-delete-facetedString-facetedNumber-searchable- 1.05 9.4±3.51ms ? ?/sec 1.00 9.0±2.14ms ? ?/sec indexing/-movies-delete-facetedString-facetedNumber-searchable-nested- 1.22 18.3±11.42ms ? ?/sec 1.00 15.0±5.79ms ? ?/sec indexing/-songs-delete-facetedString-facetedNumber-searchable- 1.00 41.4±4.20ms ? ?/sec 1.28 53.0±13.97ms ? ?/sec indexing/-wiki-delete-searchable- 1.00 285.6±18.12ms ? ?/sec 1.03 293.1±16.09ms ? ?/sec indexing/Indexing geo_point 1.03 60.8±0.45s ? ?/sec 1.00 58.8±0.68s ? ?/sec indexing/Indexing movies in three batches 1.14 16.5±0.30s ? ?/sec 1.00 14.5±0.24s ? ?/sec indexing/Indexing movies with default settings 1.11 13.7±0.07s ? ?/sec 1.00 12.3±0.28s ? ?/sec indexing/Indexing nested movies with default settings 1.10 10.6±0.11s ? ?/sec 1.00 9.6±0.15s ? ?/sec indexing/Indexing nested movies without any facets 1.11 9.4±0.15s ? ?/sec 1.00 8.5±0.10s ? ?/sec indexing/Indexing songs in three batches with default settings 1.18 66.2±0.39s ? ?/sec 1.00 56.0±0.67s ? ?/sec indexing/Indexing songs with default settings 1.07 58.7±1.26s ? ?/sec 1.00 54.7±1.71s ? ?/sec indexing/Indexing songs without any facets 1.08 53.1±0.88s ? ?/sec 1.00 49.3±1.43s ? ?/sec indexing/Indexing songs without faceted numbers 1.08 57.7±1.33s ? ?/sec 1.00 53.3±0.98s ? ?/sec indexing/Indexing wiki 1.06 1051.1±21.46s ? ?/sec 1.00 989.6±24.55s ? ?/sec indexing/Indexing wiki in three batches 1.20 1184.8±8.93s ? ?/sec 1.00 989.7±7.06s ? ?/sec indexing/Reindexing geo_point 1.04 67.5±0.75s ? ?/sec 1.00 64.9±0.32s ? ?/sec indexing/Reindexing movies with default settings 1.12 13.9±0.17s ? ?/sec 1.00 12.4±0.13s ? ?/sec indexing/Reindexing songs with default settings 1.05 60.6±0.84s ? ?/sec 1.00 57.5±0.99s ? ?/sec indexing/Reindexing wiki 1.07 1725.0±17.92s ? ?/sec 1.00 1611.4±9.90s ? ?/sec ``` Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>

bors · 2022-08-17T13:08:44Z

This PR was included in a batch that successfully built, but then failed to merge into main. It will not be retried.

Additional information:

{"message":"1 review requesting changes and 1 approving review by reviewers with write access.","documentation_url":"https://docs.github.com/articles/about-protected-branches"}

Tout il est bon

Kerollmops

bors merge

bors · 2022-08-17T14:22:57Z

Build succeeded:

loiclec added no breaking The related changes are not breaking (DB nor API) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption labels Jul 14, 2022

loiclec requested a review from ManyTheFish July 18, 2022 10:13

loiclec force-pushed the word-prefix-pair-proximity-docids-refactor branch from 6135a4c to 01046a6 Compare July 19, 2022 04:54

loiclec marked this pull request as ready for review July 19, 2022 08:23

ManyTheFish previously requested changes Jul 20, 2022

View reviewed changes

milli/src/update/word_prefix_pair_proximity_docids.rs Outdated Show resolved Hide resolved

milli/src/update/word_prefix_pair_proximity_docids.rs Show resolved Hide resolved

milli/src/update/word_prefix_pair_proximity_docids.rs Show resolved Hide resolved

ManyTheFish reviewed Jul 20, 2022

View reviewed changes

milli/src/update/word_prefix_pair_proximity_docids.rs Show resolved Hide resolved

loiclec mentioned this pull request Jan 18, 2023

Write snapshot tests meilisearch/meilisearch#3360

Open

Kerollmops suggested changes Aug 9, 2022

View reviewed changes

Loïc Lecrenier added 16 commits August 17, 2022 11:59

Refactor word prefix pair proximity indexation

3065931

Refactor word prefix pair proximity indexation further

86807ca

Add tests for WordPrefixPairProximityDocIds

d350114

Optimise WordPrefixPairProximityDocIds merge operation

044356d

Simplify and document WordPrefixPairProximityDocIds::execute

2209216

Move content of readme for WordPrefixPairProximityDocids into the code

ea4a967

Add more comments to WordPrefixPairProximityDocids::execute

06f3fd8

Add newlines in documentation of word_prefix_pair_proximity_docids

34c991e

Run cargo fmt

f6f8f54

Don't run block code in doc tests of word_pair_proximity_docids

7309111

Remove cached Allocations struct from wpppd indexing

1bc4788

Add some documentation to PrefixTrieNode

405555b

Remove commented-out function

4f9edf1

cargo fmt

78d9f06

loiclec force-pushed the word-prefix-pair-proximity-docids-refactor branch from 877c6c5 to 78d9f06 Compare August 17, 2022 10:25

Apply review suggestions

9325276

Kerollmops approved these changes Aug 17, 2022

View reviewed changes

Kerollmops suggested changes Aug 17, 2022

View reviewed changes

Kerollmops approved these changes Aug 17, 2022

View reviewed changes

Kerollmops requested a review from ManyTheFish August 17, 2022 14:05

Kerollmops approved these changes Aug 17, 2022

View reviewed changes

bors bot merged commit 087da56 into main Aug 17, 2022

bors bot deleted the word-prefix-pair-proximity-docids-refactor branch August 17, 2022 14:22

curquiza mentioned this pull request Aug 24, 2022

Bad performance or crash during indexation meilisearch/meilisearch#2132

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Word prefix pair proximity docids indexation refactor #587

Word prefix pair proximity docids indexation refactor #587

loiclec commented Jul 14, 2022 •

edited

Loading

ManyTheFish left a comment •

edited

Loading

Kerollmops Aug 17, 2022

loiclec Aug 17, 2022

Kerollmops Aug 17, 2022

loiclec Aug 17, 2022

Kerollmops left a comment

bors bot commented Aug 17, 2022

Kerollmops left a comment

bors bot commented Aug 17, 2022

Word prefix pair proximity docids indexation refactor #587

Word prefix pair proximity docids indexation refactor #587

Conversation

loiclec commented Jul 14, 2022 • edited Loading

Pull Request

What does this PR do?

Why is it faster?

What bug was fixed?

StrStrU8Codec

Requests when reviewing this PR

Benchmark Results

ManyTheFish left a comment • edited Loading

Choose a reason for hiding this comment

Kerollmops Aug 17, 2022

Choose a reason for hiding this comment

loiclec Aug 17, 2022

Choose a reason for hiding this comment

Kerollmops Aug 17, 2022

Choose a reason for hiding this comment

loiclec Aug 17, 2022

Choose a reason for hiding this comment

Kerollmops left a comment

Choose a reason for hiding this comment

bors bot commented Aug 17, 2022

Kerollmops left a comment

Choose a reason for hiding this comment

bors bot commented Aug 17, 2022

loiclec commented Jul 14, 2022 •

edited

Loading

ManyTheFish left a comment •

edited

Loading