-
Notifications
You must be signed in to change notification settings - Fork 81
Choose a better threshold for the soft deletion #570
Comments
Currently, we can't use the size of the disk (or estimate the size each document takes) because we don't have this information. To make this works, we would need to add a new feature in heed that multiplies the size of a memory page by the number of leaf pages. Maybe by the number of leaf pages + branch pages + overflow pages? This doesn’t work. |
The last solution I can think of is to use the MDB_env structure ourselves to compute the number of free pages x the size of a page. Thus there is three options from here;
|
Sounds like the best thing to do is to migrate the freespace code from mdb_stat.c into the library. But I don't see how that is actually a useful figure for the problem you're trying to solve here. |
Oh, nice to see you here @hyc!
Well, basically, we want to get an idea of how much size we’re really using to avoid freeing space when we don’t need to (=there is still a lot of available space).
Is this something doable for you? Or should we include the code in our library? |
607: Better threshold r=Kerollmops a=irevoire # Pull Request ## What does this PR do? Fixes #570 This PR tries to improve the threshold used to trigger the real deletion of documents. The deletion is now triggered in two cases; - 10% of the total available space is used by soft deleted documents - 90% of the total available space is used. In this context, « total available space » means the `map_size` of lmdb. And the size used by the soft deleted documents is actually an estimation. We can't determine precisely the size used by one document thus what we do is; take the total space used, divide it by the number of documents + soft deleted documents to estimate the size of one average document. Then multiply the size of one avg document by the number of soft deleted document. -------- <img width="808" alt="image" src="https://user-images.githubusercontent.com/7032172/185083075-92cf379e-8ae1-4bfc-9ca6-93b54e6ab4e9.png"> Here we can see we have a ~10GB drift in the end between the space used by the soft deleted and the real space used by the documents. Personally I don’t think that's a big issue because once the red line reach 90GB everything will be freed but now you know. If you have an idea on how to improve this estimation I would love to hear it. It look like the difference is linear so maybe we could simply multiply the current estimation by two? Co-authored-by: Irevoire <tamo@meilisearch.com>
607: Better threshold r=Kerollmops a=irevoire # Pull Request ## What does this PR do? Fixes #570 This PR tries to improve the threshold used to trigger the real deletion of documents. The deletion is now triggered in two cases; - 10% of the total available space is used by soft deleted documents - 90% of the total available space is used. In this context, « total available space » means the `map_size` of lmdb. And the size used by the soft deleted documents is actually an estimation. We can't determine precisely the size used by one document thus what we do is; take the total space used, divide it by the number of documents + soft deleted documents to estimate the size of one average document. Then multiply the size of one avg document by the number of soft deleted document. -------- <img width="808" alt="image" src="https://user-images.githubusercontent.com/7032172/185083075-92cf379e-8ae1-4bfc-9ca6-93b54e6ab4e9.png"> Here we can see we have a ~10GB drift in the end between the space used by the soft deleted and the real space used by the documents. Personally I don’t think that's a big issue because once the red line reach 90GB everything will be freed but now you know. If you have an idea on how to improve this estimation I would love to hear it. It look like the difference is linear so maybe we could simply multiply the current estimation by two? Co-authored-by: Irevoire <tamo@meilisearch.com>
Once #557 is merged the engine will wait until 10k documents have been soft deleted before running a real deletion.
Since this number is so smol it should not cause any issue, but the bigger this number is and the bigger the perf improvement is. BUT if this number gets too big we're going to cancel indexing tasks saying there is no space left on the device when in reality there are a lot of documents waiting to be deleted.
Thus I would like to think of a better threshold, currently, my favourite idea would be to calculate an approximative document size (we take the size of the DB divided by the number of documents + soft-deleted documents) and, if the number of soft-deleted documents reaches 10% of the total space we can use then we delete them for real.
Also maybe this could be configurable. Maybe instead of 10% we should instead say something like « 100mb » (but in this case it should really be configurable).
The text was updated successfully, but these errors were encountered: