Skip to content
This repository has been archived by the owner on Apr 4, 2023. It is now read-only.

Soft-deletion computation no longer depends on the mapsize #747

Merged
merged 6 commits into from
Dec 19, 2022

Conversation

dureuill
Copy link
Contributor

@dureuill dureuill commented Dec 15, 2022

Pull Request

Related issue

Related to meilisearch/meilisearch#3231: After removing --max-index-size, the mapsize will always be unrelated to the actual max size the user wants for their DB, so it doesn't make sense to use these values any longer.

This implements solution 2.3 from meilisearch/meilisearch#3231 (comment)

What does this PR do?

User-visible

  • Soft-deleted are no longer deleted when there is less than 10% of the mapsize available or when they take more than 10% of the mapsize
  • Instead, they are deleted when they are more soft deleted than regular documents, or when they take more than 1GiB disk space (estimated).

Implementation standpoint

  1. Adds a DeletionStrategy struct to replace the boolean disable_soft_deletion that we had up until now. This enum allows us to specify that we want "always hard", "always soft", or to use the dynamic soft-deletion strategy (default).
  2. Uses the current strategy when deleting documents, with the new heuristics being used in the DeletionStrategy::Dynamic variant.
  3. Updates the tests to use the appropriate DeletionStrategy whenever needed (one of AlwaysHard or AlwaysSoft depending on the test)

Note to reviewers: this PR is optimized for a commit-by-commit review.

PR checklist

Please check if your PR fulfills the following requirements:

  • Does this PR fix an existing issue, or have you listed the changes applied in the PR description (and why they are needed)?
  • Have you read the contributing guidelines?
  • Have you made sure that the title is accurate and descriptive of the changes?

Thank you so much for contributing to Meilisearch!

@dureuill dureuill added indexing Related to the documents/settings indexing algorithms. no breaking The related changes are not breaking (DB nor API) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption labels Dec 15, 2022
@dureuill dureuill force-pushed the soft-deleted_computation_doesnt_use_max_size branch from 09952bc to 916c23e Compare December 19, 2022 09:10
@dureuill
Copy link
Contributor Author

dureuill commented Dec 19, 2022

Updated the PR:

  1. Added a DeletionStrategy struct to replace the boolean disable_soft_deletion that we had up until now. This enum allows us to specify that we want "always hard", "always soft", or to use the dynamic soft-deletion strategy (default).
  2. Use the current strategy when deleting documents, with the existing heuristics (then new heuristics) being used in the DeletionStrategy::Dynamic variant.
  3. Updated the tests to use the appropriate DeletionStrategy whenever needed (one of AlwaysHard or AlwaysSoft depending on the test)

These changes allow us not to modify the contents of the tests.

The PR description is up-to-date

@dureuill dureuill requested a review from loiclec December 19, 2022 10:56
irevoire
irevoire previously approved these changes Dec 19, 2022
Copy link
Member

@irevoire irevoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice thanks!

bors merge

milli/src/update/delete_documents.rs Outdated Show resolved Hide resolved
Comment on lines +190 to +210
let soft_deletion = match self.strategy {
DeletionStrategy::Dynamic => {
// decide to keep the soft deleted in the DB for now if they meet 2 criteria:
// 1. There is less than a fixed rate of 50% of soft-deleted to actual documents, *and*
// 2. Soft-deleted occupy an average of less than a fixed size on disk

let size_used = self.index.used_size()?;
let nb_documents = self.index.number_of_documents(self.wtxn)?;
let nb_soft_deleted = soft_deleted_docids.len();

(nb_soft_deleted < nb_documents) && {
const SOFT_DELETED_SIZE_BYTE_THRESHOLD: u64 = 1_073_741_824; // 1GiB

// nb_documents + nb_soft_deleted !=0 because if nb_documents is 0 we short-circuit earlier, and then we moved the documents to delete
// from the documents_docids to the soft_deleted_docids.
let estimated_document_size = size_used / (nb_documents + nb_soft_deleted);
let estimated_size_used_by_soft_deleted =
estimated_document_size * nb_soft_deleted;
estimated_size_used_by_soft_deleted < SOFT_DELETED_SIZE_BYTE_THRESHOLD
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, it's way easier to follow this way 🪨

@bors
Copy link
Contributor

bors bot commented Dec 19, 2022

Canceled.

Copy link
Member

@irevoire irevoire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops I updated a comment

bors merge

@bors
Copy link
Contributor

bors bot commented Dec 19, 2022

Build succeeded:

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
indexing Related to the documents/settings indexing algorithms. no breaking The related changes are not breaking (DB nor API) performance Related to the performance in term of search/indexation speed or RAM/CPU/Disk consumption
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants