Choose a better threshold for the soft deletion #570

irevoire · 2022-06-28T14:42:39Z

Once #557 is merged the engine will wait until 10k documents have been soft deleted before running a real deletion.
Since this number is so smol it should not cause any issue, but the bigger this number is and the bigger the perf improvement is. BUT if this number gets too big we're going to cancel indexing tasks saying there is no space left on the device when in reality there are a lot of documents waiting to be deleted.

Thus I would like to think of a better threshold, currently, my favourite idea would be to calculate an approximative document size (we take the size of the DB divided by the number of documents + soft-deleted documents) and, if the number of soft-deleted documents reaches 10% of the total space we can use then we delete them for real.
Also maybe this could be configurable. Maybe instead of 10% we should instead say something like « 100mb » (but in this case it should really be configurable).

irevoire · 2022-07-11T15:45:16Z

Currently, we can't use the size of the disk (or estimate the size each document takes) because we don't have this information.
The only information we have is the size of the database on disk with all freed pages, and if we use that, we're going to allow more and more soft-deleted documents for the same number of indexed documents every time we re-index.

To make this works, we would need to add a new feature in heed that multiplies the size of a memory page by the number of leaf pages. Maybe by the number of leaf pages + branch pages + overflow pages?
I can use the mdb_env_stat function to aggregate my results over all databases.

This doesn’t work.
There is also a mdb_envinfo structure in lmdb, but it’s useless as well.
Both structures do not change when we index a file which means it’s measuring useless data.

irevoire · 2022-08-04T16:57:33Z

The last solution I can think of is to use the MDB_env structure ourselves to compute the number of free pages x the size of a page.
That would be a cool solution if only the structure was not private.

Thus there is three options from here;

I missed something, @Kerollmops I tried to ping @hyc on twitter but it seems like he didn’t see my tweet. It would be cool I you could try to contact him as well. Here is what I sent him; https://twitter.com/irevoire/status/1554770698475905024
We fork lmdb and introduce a function that compute the free space left for us.
-> And if @hyc is ok with this idea we try to merge that on the main branch.
We compute the size directly from heed with a lot of very unsafe code. That should work while we don’t update lmdb 😩

hyc · 2022-08-04T18:21:40Z

Sounds like the best thing to do is to migrate the freespace code from mdb_stat.c into the library. But I don't see how that is actually a useful figure for the problem you're trying to solve here.

irevoire · 2022-08-05T12:03:47Z

Oh, nice to see you here @hyc!

But I don't see how that is actually a useful figure for the problem you're trying to solve here.

Well, basically, we want to get an idea of how much size we’re really using to avoid freeing space when we don’t need to (=there is still a lot of available space).
But since we can’t rely on the size used by lmdb on disk, I was trying to recompute the size myself 😅

Sounds like the best thing to do is to migrate the freespace code from mdb_stat.c into the library.

Is this something doable for you? Or should we include the code in our library?
I totally missed this file when I did my tests; I’ll look into it Tuesday!

607: Better threshold r=Kerollmops a=irevoire # Pull Request ## What does this PR do? Fixes #570 This PR tries to improve the threshold used to trigger the real deletion of documents. The deletion is now triggered in two cases; - 10% of the total available space is used by soft deleted documents - 90% of the total available space is used. In this context, « total available space » means the `map_size` of lmdb. And the size used by the soft deleted documents is actually an estimation. We can't determine precisely the size used by one document thus what we do is; take the total space used, divide it by the number of documents + soft deleted documents to estimate the size of one average document. Then multiply the size of one avg document by the number of soft deleted document. -------- <img width="808" alt="image" src="https://user-images.githubusercontent.com/7032172/185083075-92cf379e-8ae1-4bfc-9ca6-93b54e6ab4e9.png"> Here we can see we have a ~10GB drift in the end between the space used by the soft deleted and the real space used by the documents. Personally I don’t think that's a big issue because once the red line reach 90GB everything will be freed but now you know. If you have an idea on how to improve this estimation I would love to hear it. It look like the difference is linear so maybe we could simply multiply the current estimation by two? Co-authored-by: Irevoire <tamo@meilisearch.com>

irevoire added the enhancement New feature or request label Jun 28, 2022

curquiza added the performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption label Jul 5, 2022

curquiza mentioned this issue Jul 5, 2022

Improve indexation speed: introduce the soft document deletion meilisearch/meilisearch#2583

Closed

irevoire mentioned this issue Aug 17, 2022

Better threshold #607

Merged

bors bot closed this as completed in 79094bc Aug 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Choose a better threshold for the soft deletion #570

Choose a better threshold for the soft deletion #570

irevoire commented Jun 28, 2022

irevoire commented Jul 11, 2022 •

edited

Loading

irevoire commented Aug 4, 2022

hyc commented Aug 4, 2022

irevoire commented Aug 5, 2022

Choose a better threshold for the soft deletion #570

Choose a better threshold for the soft deletion #570

Comments

irevoire commented Jun 28, 2022

irevoire commented Jul 11, 2022 • edited Loading

irevoire commented Aug 4, 2022

hyc commented Aug 4, 2022

irevoire commented Aug 5, 2022

irevoire commented Jul 11, 2022 •

edited

Loading