SSTableIndex could be a multi layer SSTable #1948

fulmicoton · 2023-03-21T01:53:55Z

Right now we use the SSTable format to store block "checkpoints" in a Vec and do binary search on those.

An alternative could be to store several layers of SSTable:

Layer 1 one like current SSTableIndex. Checkpoint every N / B docs.
Layer 2 Checkpoint in layer 1. N / B^2
Layer 3 Checkpoint in layer 1. N / B^3
...

Thanks to geometric series, the overhead of having that stack of layers is minor for B = 16 for instance.
We would have transformed the binary search into a coarse to scale linear search, with good strong locality.

The main benefit would be to make opening a term dictionary very cheap, regardless of number of blocks.

@trinity-1686a @PSeitz let me know if I make sense? To know if this is useful, we need to know
how expensive it is to open a sstable with 10 millions terms today.
To accept such a change, we will need also a bench on ord_to_term (if it does not exist already)

The text was updated successfully, but these errors were encountered:

fulmicoton · 2023-03-21T01:56:58Z

Assigned to @trinity-1686a for the moment, but we don't even know if we want this.

trinity-1686a · 2023-03-23T11:32:44Z

In the flamegraph PSeitz linked here #1946 (comment), Dictionary::open is very fast (0.42% of samples), whereas Dictionary::ord_to_term is 9.94% of samples. A naive approach could cause ord_to_term to be N time slower (with N the number of layers).
On some other workload (a simple TermQuery), the Dictionary::open is probably most of the cost and it could then make sens to do that kind of change (or maybe it doesn't).
When querying the index, the block being opened should probably be cached to amortize that cost on workload which may query the index a lot (columnar case)

trinity-1686a · 2023-04-20T09:16:08Z

unassigning myself for now, while there is possibly something to gain for large segments, I believe there are probably more impactful changes to explore

fulmicoton assigned trinity-1686a Mar 21, 2023

trinity-1686a mentioned this issue Mar 21, 2023

fix bug with new sstable index format #1953

Merged

trinity-1686a removed their assignment Apr 20, 2023

trinity-1686a mentioned this issue Oct 17, 2023

Slow missing term lookup in tantivy? quickwit-oss/quickwit#3964

Closed

trinity-1686a self-assigned this Nov 7, 2023

trinity-1686a mentioned this issue Nov 10, 2023

multilayer sstable #2246

Closed

trinity-1686a removed their assignment Nov 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SSTableIndex could be a multi layer SSTable #1948

SSTableIndex could be a multi layer SSTable #1948

fulmicoton commented Mar 21, 2023 •

edited

Loading

fulmicoton commented Mar 21, 2023

trinity-1686a commented Mar 23, 2023

trinity-1686a commented Apr 20, 2023

SSTableIndex could be a multi layer SSTable #1948

SSTableIndex could be a multi layer SSTable #1948

Comments

fulmicoton commented Mar 21, 2023 • edited Loading

fulmicoton commented Mar 21, 2023

trinity-1686a commented Mar 23, 2023

trinity-1686a commented Apr 20, 2023

fulmicoton commented Mar 21, 2023 •

edited

Loading