Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Choose a better threshold for the soft deletion #570

Closed
irevoire opened this issue Jun 28, 2022 · 4 comments · Fixed by #607
Closed

Choose a better threshold for the soft deletion #570

irevoire opened this issue Jun 28, 2022 · 4 comments · Fixed by #607
Labels
enhancement New feature or request performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption

Comments

@irevoire
Copy link
Member

Once #557 is merged the engine will wait until 10k documents have been soft deleted before running a real deletion.
Since this number is so smol it should not cause any issue, but the bigger this number is and the bigger the perf improvement is. BUT if this number gets too big we're going to cancel indexing tasks saying there is no space left on the device when in reality there are a lot of documents waiting to be deleted.

Thus I would like to think of a better threshold, currently, my favourite idea would be to calculate an approximative document size (we take the size of the DB divided by the number of documents + soft-deleted documents) and, if the number of soft-deleted documents reaches 10% of the total space we can use then we delete them for real.
Also maybe this could be configurable. Maybe instead of 10% we should instead say something like « 100mb » (but in this case it should really be configurable).

@irevoire irevoire added the enhancement New feature or request label Jun 28, 2022
@curquiza curquiza added the performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption label Jul 5, 2022
@irevoire
Copy link
Member Author

irevoire commented Jul 11, 2022

Currently, we can't use the size of the disk (or estimate the size each document takes) because we don't have this information.
The only information we have is the size of the database on disk with all freed pages, and if we use that, we're going to allow more and more soft-deleted documents for the same number of indexed documents every time we re-index.

To make this works, we would need to add a new feature in heed that multiplies the size of a memory page by the number of leaf pages. Maybe by the number of leaf pages + branch pages + overflow pages?
I can use the mdb_env_stat function to aggregate my results over all databases.

This doesn’t work.
There is also a mdb_envinfo structure in lmdb, but it’s useless as well.
Both structures do not change when we index a file which means it’s measuring useless data.

@irevoire
Copy link
Member Author

irevoire commented Aug 4, 2022

The last solution I can think of is to use the MDB_env structure ourselves to compute the number of free pages x the size of a page.
That would be a cool solution if only the structure was not private.

Thus there is three options from here;

  1. I missed something, @Kerollmops I tried to ping @hyc on twitter but it seems like he didn’t see my tweet. It would be cool I you could try to contact him as well. Here is what I sent him; https://twitter.com/irevoire/status/1554770698475905024
  2. We fork lmdb and introduce a function that compute the free space left for us.
    -> And if @hyc is ok with this idea we try to merge that on the main branch.
  3. We compute the size directly from heed with a lot of very unsafe code. That should work while we don’t update lmdb 😩

@hyc
Copy link

hyc commented Aug 4, 2022

Sounds like the best thing to do is to migrate the freespace code from mdb_stat.c into the library. But I don't see how that is actually a useful figure for the problem you're trying to solve here.

@irevoire
Copy link
Member Author

irevoire commented Aug 5, 2022

Oh, nice to see you here @hyc!

But I don't see how that is actually a useful figure for the problem you're trying to solve here.

Well, basically, we want to get an idea of how much size we’re really using to avoid freeing space when we don’t need to (=there is still a lot of available space).
But since we can’t rely on the size used by lmdb on disk, I was trying to recompute the size myself 😅

Sounds like the best thing to do is to migrate the freespace code from mdb_stat.c into the library.

Is this something doable for you? Or should we include the code in our library?
I totally missed this file when I did my tests; I’ll look into it Tuesday!

bors bot added a commit that referenced this issue Aug 17, 2022
607: Better threshold r=Kerollmops a=irevoire

# Pull Request

## What does this PR do?
Fixes #570 

This PR tries to improve the threshold used to trigger the real deletion of documents.
The deletion is now triggered in two cases;
- 10% of the total available space is used by soft deleted documents
- 90% of the total available space is used.

In this context, « total available space » means the `map_size` of lmdb.
And the size used by the soft deleted documents is actually an estimation. We can't determine precisely the size used by one document thus what we do is; take the total space used, divide it by the number of documents + soft deleted documents to estimate the size of one average document. Then multiply the size of one avg document by the number of soft deleted document.

--------

<img width="808" alt="image" src="https://user-images.githubusercontent.com/7032172/185083075-92cf379e-8ae1-4bfc-9ca6-93b54e6ab4e9.png">

Here we can see we have a ~10GB drift in the end between the space used by the soft deleted and the real space used by the documents.
Personally I don’t think that's a big issue because once the red line reach 90GB everything will be freed but now you know.

If you have an idea on how to improve this estimation I would love to hear it.
It look like the difference is linear so maybe we could simply multiply the current estimation by two?

Co-authored-by: Irevoire <tamo@meilisearch.com>
bors bot added a commit that referenced this issue Aug 17, 2022
607: Better threshold r=Kerollmops a=irevoire

# Pull Request

## What does this PR do?
Fixes #570 

This PR tries to improve the threshold used to trigger the real deletion of documents.
The deletion is now triggered in two cases;
- 10% of the total available space is used by soft deleted documents
- 90% of the total available space is used.

In this context, « total available space » means the `map_size` of lmdb.
And the size used by the soft deleted documents is actually an estimation. We can't determine precisely the size used by one document thus what we do is; take the total space used, divide it by the number of documents + soft deleted documents to estimate the size of one average document. Then multiply the size of one avg document by the number of soft deleted document.

--------

<img width="808" alt="image" src="https://user-images.githubusercontent.com/7032172/185083075-92cf379e-8ae1-4bfc-9ca6-93b54e6ab4e9.png">

Here we can see we have a ~10GB drift in the end between the space used by the soft deleted and the real space used by the documents.
Personally I don’t think that's a big issue because once the red line reach 90GB everything will be freed but now you know.

If you have an idea on how to improve this estimation I would love to hear it.
It look like the difference is linear so maybe we could simply multiply the current estimation by two?

Co-authored-by: Irevoire <tamo@meilisearch.com>
@bors bors bot closed this as completed in 79094bc Aug 17, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants