Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Optimise facets indexing #590

Merged
merged 4 commits into from
Aug 17, 2022
Merged

Optimise facets indexing #590

merged 4 commits into from
Aug 17, 2022

Conversation

loiclec
Copy link
Contributor

@loiclec loiclec commented Jul 20, 2022

Pull Request

What does this PR do?

Fixes #589

Notes

I added documentation for the whole module which attempts to explain the shape of the databases and their purpose. However, I realise there is already some documentation about this, so I am not sure if we want to keep it.

Benchmarks

We get a ~1.15x speed up on the geo_point benchmark.

group                                                                     indexing_main_57042355                  indexing_optimise-facets-indexation_5728619a
-----                                                                     ----------------------                  --------------------------------------------
indexing/-geo-delete-facetedNumber-facetedGeo-searchable-                 1.00  1862.7±294.45µs        ? ?/sec    1.58      2.9±1.32ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-           1.11      8.9±2.44ms        ? ?/sec     1.00      8.0±1.42ms        ? ?/sec
indexing/-movies-delete-facetedString-facetedNumber-searchable-nested-    1.00     12.8±3.32ms        ? ?/sec     1.32     16.9±6.98ms        ? ?/sec
indexing/-songs-delete-facetedString-facetedNumber-searchable-            1.09     43.8±4.78ms        ? ?/sec     1.00     40.3±3.79ms        ? ?/sec
indexing/-wiki-delete-searchable-                                         1.08   287.4±28.72ms        ? ?/sec     1.00    264.9±9.46ms        ? ?/sec
indexing/Indexing geo_point                                               1.14      61.2±0.39s        ? ?/sec     1.00      53.8±0.57s        ? ?/sec
indexing/Indexing movies in three batches                                 1.00      16.6±0.12s        ? ?/sec     1.00      16.5±0.10s        ? ?/sec
indexing/Indexing movies with default settings                            1.00      14.1±0.30s        ? ?/sec     1.00      14.0±0.28s        ? ?/sec
indexing/Indexing nested movies with default settings                     1.10      10.9±0.50s        ? ?/sec     1.00      10.0±0.10s        ? ?/sec
indexing/Indexing nested movies without any facets                        1.01       9.6±0.23s        ? ?/sec     1.00       9.5±0.06s        ? ?/sec
indexing/Indexing songs in three batches with default settings            1.07      66.3±0.55s        ? ?/sec     1.00      61.8±0.63s        ? ?/sec
indexing/Indexing songs with default settings                             1.03      58.8±0.82s        ? ?/sec     1.00      57.1±1.22s        ? ?/sec
indexing/Indexing songs without any facets                                1.00      53.6±1.09s        ? ?/sec     1.01      54.0±0.58s        ? ?/sec
indexing/Indexing songs without faceted numbers                           1.02      58.0±1.29s        ? ?/sec     1.00      57.1±1.43s        ? ?/sec
indexing/Indexing wiki                                                    1.00   1064.1±21.20s        ? ?/sec     1.00   1068.0±20.49s        ? ?/sec
indexing/Indexing wiki in three batches                                   1.00    1182.5±9.62s        ? ?/sec     1.01   1191.2±10.96s        ? ?/sec
indexing/Reindexing geo_point                                             1.12      68.0±0.21s        ? ?/sec     1.00      60.5±0.82s        ? ?/sec
indexing/Reindexing movies with default settings                          1.01      14.1±0.21s        ? ?/sec     1.00      14.0±0.26s        ? ?/sec
indexing/Reindexing songs with default settings                           1.04      61.6±0.57s        ? ?/sec     1.00      59.2±0.87s        ? ?/sec
indexing/Reindexing wiki                                                  1.00   1734.0±11.38s        ? ?/sec     1.01   1746.6±22.48s        ? ?/sec

@loiclec loiclec added indexing Related to the documents/settings indexing algorithms. no breaking The related changes are not breaking (DB nor API) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption labels Jul 20, 2022
@loiclec loiclec force-pushed the optimise-facets-indexation branch from a2c8bdf to 5728619 Compare July 20, 2022 07:50
@loiclec loiclec assigned loiclec and ManyTheFish and unassigned loiclec Jul 20, 2022
@loiclec loiclec marked this pull request as draft August 3, 2022 16:37
@loiclec
Copy link
Contributor Author

loiclec commented Aug 3, 2022

Note: I'm putting this PR back into draft status as I don't think the improvement is good enough.

The problem is that after adding just one document, we need to iterate over the whole database and perform a ton of bitmap operations, which is very slow. Ideally, the algorithm should be incremental and modify the database in-place instead of clearing it and rebuilding it from scratch. I am starting to think about what a proper solution will be and will update this PR later.

@loiclec
Copy link
Contributor Author

loiclec commented Aug 15, 2022

After looking a bit more into it, I realised there is a problem with the current structure of index.facet_id_string_docids. The elements of all levels > 0 have a pair of indices as key. These indices point to elements of level 0. For example:

┌───────┐   ┌───────────────────────────────────────────────────────────────────────────────┐
│  all  │   │                       [a, b, c, d, e, f, g, u, y, z]                          │
└───────┘   └───────────────────────────────────────────────────────────────────────────────┘

            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │             0 – 3             │             4 – 7             │     8 – 9     │
│Level 2│   │                               │                               │               │
└───────┘   │        [a, b, d, f, z]        │        [c, d, e, f, g]        │    [u, y]     │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │     0 – 1     │     2 – 3     │     4 – 5     │     6 – 7     │     8 – 9     │
│Level 1│   │  "ab" – "ac"  │ "ba" – "bac"  │ "gaf" – "gal" │"form" – "wow" │ "woz" – "zz"  │
└───────┘   │ [a, b, d, z]  │   [a, b, f]   │   [c, d, g]   │    [e, f]     │    [u, y]     │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │  "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │  "ZZ" │
└───────┘   │ [a, b]│ [d, z]│ [b, f]│ [a, f]│ [c, d]│  [g]  │  [e]  │ [e, f]│  [y]  │  [u]  │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

As a result, adding an element at level 0 automatically invalidates all its parents and its parents' successors. That means the minimum complexity of adding an element is Ω(size_level_0), which is too high.

I think this could be solved by using strings directly instead of indices as keys, like this:

┌───────┐   ┌───────────────────────────────────────────────────────────────────────────────┐
│  all  │   │                         a, b, c, d, e, f, g, u, y , z                         │
└───────┘   └───────────────────────────────────────────────────────────────────────────────┘
                                                                                             
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │         "ab" – "bac"          │         "gaf" – "wow"         │ "woz" – "zz"  │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │  "ab" – "ac"  │ "ba" – "bac"  │ "gaf" – "gal" │"form" – "wow" │ "woz" – "zz"  │
│Level 1│   │               │               │               │               │               │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │  "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │  "ZZ" │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

But as explained in search/facet/facet_string.rs, this would result in very long keys, and we would need to reduce the maximum allowed size of facet strings by half, to approximately 250 bytes (the max key size of LMDB is 511 bytes). If that is a big problem, we could consider only storing the start bound as a key maybe:

┌───────┐   ┌───────────────────────────────────────────────────────────────────────────────┐
│  all  │   │                         a, b, c, d, e, f, g, u, y , z                         │
└───────┘   └───────────────────────────────────────────────────────────────────────────────┘
                                                                                             
            ┌───────────────────────────────┬───────────────────────────────┬───────────────┐
┌───────┐   │             "ab"              │             "gaf"             │     "woz"     │
│Level 2│   │                               │                               │               │
└───────┘   │         a, b, d, f, z         │         c, d, e, f, g         │     u, y      │
            ├───────────────┬───────────────┼───────────────┬───────────────┼───────────────┤
┌───────┐   │     "ab"      │     "ba"      │     "gaf"     │    "form"     │     "woz"     │
│Level 1│   │               │               │               │               │               │
└───────┘   │  a, b, d, z   │    a, b, f    │    c, d, g    │     e, f      │     u, y      │
            ├───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┼───────┬───────┤
┌───────┐   │  "ab" │  "ac" │  "ba" │ "bac" │ "gaf" │ "gal" │ "form"│ "wow" │ "woz" │  "zz" │
│Level 0│   │  "AB" │ " Ac" │ "ba " │ "Bac" │ " GAF"│ "gal" │ "Form"│ " wow"│ "woz" │  "ZZ" │
└───────┘   │  a, b │  d, z │  b, f │  a, f │  c, d │   g   │   e   │  e, f │   y   │   u   │
            └───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┴───────┘

But I suspect it would make it difficult to use and/or create this structure.

@loiclec loiclec changed the title Optimise facets indexation Optimise facets indexing Aug 16, 2022
@loiclec
Copy link
Contributor Author

loiclec commented Aug 16, 2022

After thinking about it more, I think this PR should be reviewed as is for now, because:

  1. It still brings a fairly large improvement in indexing speed
  2. Adding code to incrementally update the database will be complicated and take a long time
  3. Even after we have the ability to incrementally update the DB, we will still need to use this bit of code to build the DB from scratch

milli/src/update/facets.rs Show resolved Hide resolved
milli/src/update/facets.rs Show resolved Hide resolved
@loiclec loiclec force-pushed the optimise-facets-indexation branch from 5728619 to 3968790 Compare August 17, 2022 10:43
@loiclec loiclec force-pushed the optimise-facets-indexation branch from 6e0fec6 to 6cc9757 Compare August 17, 2022 10:59
@loiclec loiclec marked this pull request as ready for review August 17, 2022 11:30
@loiclec loiclec requested a review from Kerollmops August 17, 2022 11:30
Copy link
Member

@Kerollmops Kerollmops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much!
bors merge

@bors
Copy link
Contributor

bors bot commented Aug 17, 2022

@bors bors bot merged commit 39869be into main Aug 17, 2022
@bors bors bot deleted the optimise-facets-indexation branch August 17, 2022 12:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
indexing Related to the documents/settings indexing algorithms. no breaking The related changes are not breaking (DB nor API) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Optimise indexing of facet databases
3 participants