Merge #587

587: Word prefix pair proximity docids indexation refactor r=Kerollmops a=loiclec # Pull Request ## What does this PR do? Refactor the code of `WordPrefixPairProximityDocIds` to make it much faster, fix a bug, and add a unit test. ## Why is it faster? Because we avoid using a sorter to insert the (`word1`, `prefix`, `proximity`) keys and their associated bitmaps, and thus we don't have to sort a potentially very big set of data. I have also added a couple of other optimisations: 1. reusing allocations 2. using a prefix trie instead of an array of prefixes to get all the prefixes of a word 3. inserting directly into the database instead of putting the data in an intermediary grenad when possible. Also avoid checking for pre-existing values in the database when we know for certain that they do not exist. ## What bug was fixed? When reindexing, the `new_prefix_fst_words` prefixes may look like: ``` ["ant", "axo", "bor"] ``` which we group by first letter: ``` [["ant", "axo"], ["bor"]] ``` Later in the code, if we have the word2 "axolotl", we try to find which subarray of prefixes contains its prefixes. This check is done with `word2.starts_with(subarray_prefixes[0])`, but `"axolotl".starts_with("ant")` is false, and thus we wrongly think that there are no prefixes in `new_prefix_fst_words` that are prefixes of `axolotl`. ## StrStrU8Codec I had to change the encoding of `StrStrU8Codec` to make the second string null-terminated as well. I don't think this should be a problem, but I may have missed some nuances about the impacts of this change. ## Requests when reviewing this PR I have explained what the code does in the module documentation of `word_pair_proximity_prefix_docids`. It would be nice if someone could read it and give their opinion on whether it is a clear explanation or not. I also have a couple questions regarding the code itself: - Should we clean up and factor out the `PrefixTrieNode` code to try and make broader use of it outside this module? For now, the prefixes undergo a few transformations: from FST, to array, to prefix trie. It seems like it could be simplified. - I wrote a function called `write_into_lmdb_database_without_merging`. (1) Are we okay with such a function existing? (2) Should it be in `grenad_helpers` instead? ## Benchmark Results We reduce the time it takes to index about 8% in most cases, but it varies between -3% and -20%. ``` group indexing_main_ce90fc62 indexing_word-prefix-pair-proximity-docids-refactor_cbad2023 ----- ---------------------- ------------------------------------------------------------ indexing/-geo-delete-facetedNumber-facetedGeo-searchable- 1.00 1893.0±233.03µs ? ?/sec 1.01 1921.2±260.79µs ? ?/sec indexing/-movies-delete-facetedString-facetedNumber-searchable- 1.05 9.4±3.51ms ? ?/sec 1.00 9.0±2.14ms ? ?/sec indexing/-movies-delete-facetedString-facetedNumber-searchable-nested- 1.22 18.3±11.42ms ? ?/sec 1.00 15.0±5.79ms ? ?/sec indexing/-songs-delete-facetedString-facetedNumber-searchable- 1.00 41.4±4.20ms ? ?/sec 1.28 53.0±13.97ms ? ?/sec indexing/-wiki-delete-searchable- 1.00 285.6±18.12ms ? ?/sec 1.03 293.1±16.09ms ? ?/sec indexing/Indexing geo_point 1.03 60.8±0.45s ? ?/sec 1.00 58.8±0.68s ? ?/sec indexing/Indexing movies in three batches 1.14 16.5±0.30s ? ?/sec 1.00 14.5±0.24s ? ?/sec indexing/Indexing movies with default settings 1.11 13.7±0.07s ? ?/sec 1.00 12.3±0.28s ? ?/sec indexing/Indexing nested movies with default settings 1.10 10.6±0.11s ? ?/sec 1.00 9.6±0.15s ? ?/sec indexing/Indexing nested movies without any facets 1.11 9.4±0.15s ? ?/sec 1.00 8.5±0.10s ? ?/sec indexing/Indexing songs in three batches with default settings 1.18 66.2±0.39s ? ?/sec 1.00 56.0±0.67s ? ?/sec indexing/Indexing songs with default settings 1.07 58.7±1.26s ? ?/sec 1.00 54.7±1.71s ? ?/sec indexing/Indexing songs without any facets 1.08 53.1±0.88s ? ?/sec 1.00 49.3±1.43s ? ?/sec indexing/Indexing songs without faceted numbers 1.08 57.7±1.33s ? ?/sec 1.00 53.3±0.98s ? ?/sec indexing/Indexing wiki 1.06 1051.1±21.46s ? ?/sec 1.00 989.6±24.55s ? ?/sec indexing/Indexing wiki in three batches 1.20 1184.8±8.93s ? ?/sec 1.00 989.7±7.06s ? ?/sec indexing/Reindexing geo_point 1.04 67.5±0.75s ? ?/sec 1.00 64.9±0.32s ? ?/sec indexing/Reindexing movies with default settings 1.12 13.9±0.17s ? ?/sec 1.00 12.4±0.13s ? ?/sec indexing/Reindexing songs with default settings 1.05 60.6±0.84s ? ?/sec 1.00 57.5±0.99s ? ?/sec indexing/Reindexing wiki 1.07 1725.0±17.92s ? ?/sec 1.00 1611.4±9.90s ? ?/sec ``` Co-authored-by: Loïc Lecrenier <loic@meilisearch.com>
meilisearch · Aug 17, 2022 · 4a24bfc · 4a24bfc
2 parents 39869be + 9325276
commit 4a24bfc
Show file tree

Hide file tree

Showing 8 changed files with 807 additions and 193 deletions.
diff --git a/infos/src/main.rs b/infos/src/main.rs
@@ -1152,6 +1152,7 @@ fn word_pair_proximities_docids(
     prefix.extend_from_slice(word1.as_bytes());
     prefix.push(0);
     prefix.extend_from_slice(word2.as_bytes());
+    prefix.push(0);
 
     let db = index.word_pair_proximity_docids.as_polymorph();
     let iter = db.prefix_iter::<_, ByteSlice, RoaringBitmapCodec>(rtxn, &prefix)?;

diff --git a/milli/src/heed_codec/mod.rs b/milli/src/heed_codec/mod.rs
@@ -15,4 +15,4 @@ pub use self::roaring_bitmap_length::{
     BoRoaringBitmapLenCodec, CboRoaringBitmapLenCodec, RoaringBitmapLenCodec,
 };
 pub use self::str_beu32_codec::StrBEU32Codec;
-pub use self::str_str_u8_codec::StrStrU8Codec;
+pub use self::str_str_u8_codec::{StrStrU8Codec, UncheckedStrStrU8Codec};
diff --git a/milli/src/heed_codec/str_str_u8_codec.rs b/milli/src/heed_codec/str_str_u8_codec.rs
@@ -9,9 +9,11 @@ impl<'a> heed::BytesDecode<'a> for StrStrU8Codec {
     fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
         let (n, bytes) = bytes.split_last()?;
         let s1_end = bytes.iter().position(|b| *b == 0)?;
-        let (s1_bytes, s2_bytes) = bytes.split_at(s1_end);
+        let (s1_bytes, rest) = bytes.split_at(s1_end);
+        let rest = &rest[1..];
         let s1 = str::from_utf8(s1_bytes).ok()?;
-        let s2 = str::from_utf8(&s2_bytes[1..]).ok()?;
+        let (_, s2_bytes) = rest.split_last()?;
+        let s2 = str::from_utf8(s2_bytes).ok()?;
         Some((s1, s2, *n))
     }
 }
@@ -24,6 +26,35 @@ impl<'a> heed::BytesEncode<'a> for StrStrU8Codec {
         bytes.extend_from_slice(s1.as_bytes());
         bytes.push(0);
         bytes.extend_from_slice(s2.as_bytes());
+        bytes.push(0);
+        bytes.push(*n);
+        Some(Cow::Owned(bytes))
+    }
+}
+pub struct UncheckedStrStrU8Codec;
+
+impl<'a> heed::BytesDecode<'a> for UncheckedStrStrU8Codec {
+    type DItem = (&'a [u8], &'a [u8], u8);
+
+    fn bytes_decode(bytes: &'a [u8]) -> Option<Self::DItem> {
+        let (n, bytes) = bytes.split_last()?;
+        let s1_end = bytes.iter().position(|b| *b == 0)?;
+        let (s1_bytes, rest) = bytes.split_at(s1_end);
+        let rest = &rest[1..];
+        let (_, s2_bytes) = rest.split_last()?;
+        Some((s1_bytes, s2_bytes, *n))
+    }
+}
+
+impl<'a> heed::BytesEncode<'a> for UncheckedStrStrU8Codec {
+    type EItem = (&'a [u8], &'a [u8], u8);
+
+    fn bytes_encode((s1, s2, n): &Self::EItem) -> Option<Cow<[u8]>> {
+        let mut bytes = Vec::with_capacity(s1.len() + s2.len() + 1 + 1);
+        bytes.extend_from_slice(s1);
+        bytes.push(0);
+        bytes.extend_from_slice(s2);
+        bytes.push(0);
         bytes.push(*n);
         Some(Cow::Owned(bytes))
     }

diff --git a/milli/src/lib.rs b/milli/src/lib.rs
@@ -37,7 +37,7 @@ pub use self::fields_ids_map::FieldsIdsMap;
 pub use self::heed_codec::{
     BEU32StrCodec, BoRoaringBitmapCodec, BoRoaringBitmapLenCodec, CboRoaringBitmapCodec,
     CboRoaringBitmapLenCodec, FieldIdWordCountCodec, ObkvCodec, RoaringBitmapCodec,
-    RoaringBitmapLenCodec, StrBEU32Codec, StrStrU8Codec,
+    RoaringBitmapLenCodec, StrBEU32Codec, StrStrU8Codec, UncheckedStrStrU8Codec,
 };
 pub use self::index::Index;
 pub use self::search::{

diff --git a/milli/src/update/index_documents/extract/extract_word_pair_proximity_docids.rs b/milli/src/update/index_documents/extract/extract_word_pair_proximity_docids.rs
@@ -142,6 +142,7 @@ fn document_word_positions_into_sorter<'b>(
         key_buffer.extend_from_slice(w1.as_bytes());
         key_buffer.push(0);
         key_buffer.extend_from_slice(w2.as_bytes());
+        key_buffer.push(0);
         key_buffer.push(prox as u8);
 
         word_pair_proximity_docids_sorter.insert(&key_buffer, &document_id.to_ne_bytes())?;

diff --git a/...x_pair_proximity_docids.rs/test_update/update/word_prefix_pair_proximity_docids.hash.snap b/...x_pair_proximity_docids.rs/test_update/update/word_prefix_pair_proximity_docids.hash.snap
@@ -0,0 +1,4 @@
+---
+source: milli/src/update/word_prefix_pair_proximity_docids.rs
+---
+5ed4bf83317b10962a55ade353427bdd
diff --git a/...prefix_pair_proximity_docids.rs/test_update/update/word_prefix_pair_proximity_docids.snap b/...prefix_pair_proximity_docids.rs/test_update/update/word_prefix_pair_proximity_docids.snap