Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow queries when index entries deleted #311

Closed
martinsumner opened this issue Mar 25, 2020 · 4 comments
Closed

Slow queries when index entries deleted #311

martinsumner opened this issue Mar 25, 2020 · 4 comments

Comments

@martinsumner
Copy link
Owner

martinsumner commented Mar 25, 2020

If there is a large leveled store, then we add a lot of index entries, then delete those index entries - then the time taken to conclude an index query will be proportional to the number of removed entries not the active entries.

This is because leveled tombstones in the ledger are maintained as the ledger is compacted until the tombstone reaches the basement. When it reaches the basement, the tombstone will be reaped - as it is now clear that there are no further objects to be erased.

So if we add a {Key1, V1}
then alter by adding {Key1, V2}
then delete with {Key1, tomb}

The tomb must be maintained in the store even after it has merged through {Key1, V2} so as to not resurrect {Key1, V1}. Only when a tomb is merged into the basement level do we know it has no more victims to kill and so the tomb can be erased i.e. the tomb will not be written to the basement level.

There may be certain workloads that result in very infrequent merges to the basement, and hence there may be large files full of tombstones that are repeatedly scanned for queries to discover all the tombstones (which don't result in any results). So the result may take a long time to return even when there are very few actual results.

@martinsumner
Copy link
Owner Author

Within basho leveldb, this situation could also arise, but is more than likely resolved via grooming compactions. Proactively pushing files with large numbers of tombstones towards the basement.

https://github.com/basho/leveldb/wiki/Mv-aggressive-delete

@martinsumner
Copy link
Owner Author

martinsumner commented Mar 25, 2020

There may be an additional problem related to the switching of files. When a file in Level Basement + 1 is picked to be merged, if it doesn't overlap with any file in Level Basement, then it will be switched to the Basement rather than undergoing a merge.

This switching process would mean that tombstones are not reaped (unless a file in the future is merged from Basement + 1 which does overlap).

This is a problem that we have not seen in production, and would require some peculiar circumstances to create.

The more general problem of tombstones at Basement + 1 has been observed in production.

@martinsumner
Copy link
Owner Author

martinsumner commented Mar 26, 2020

There are a number of alternatives here:

  1. Treat index keys differently, as they don't have modifications - and drop them on the first merge with an older key (as they have no value).

  2. Force a proportion of compactions to be grooming compactions (ones that favour the SST file with the most tombstones).

  3. Add additional grooming compactions (potentially triggered only when the clerk has had no work for a period).

Option 1 may encounter issues with TTL and index entries. The TTL is the value, so potentially you could have updated index entries where the TTL changes. This is not part of Riak implementation, but exists as a theoretical possibility in non-Riak implementations. It is also a more involved code change than the alternatives.

Option 2 will makes things better without necessarily solving the problem.

Option 3 may lead to fluctuations in performance. In implementing Option 3 the shape of the LSM tree would change out of hours (e.g. there would be gaps above the basement). This means as load then increases initial performance will improve, as new merges to the top of the tree won't force a cascade due to space in the tree. This is an improvement, but also a disadvantage as the supportable load may be less predictable and hence harder to measure.

@martinsumner
Copy link
Owner Author

Option 2 progressed for now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant