Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v1.0: Remove --max-index-size and --max-task-db configuration options #2077

Closed
4 tasks done
maryamsulemani97 opened this issue Jan 3, 2023 · 4 comments · Fixed by #2118
Closed
4 tasks done

v1.0: Remove --max-index-size and --max-task-db configuration options #2077

maryamsulemani97 opened this issue Jan 3, 2023 · 4 comments · Fixed by #2118
Assignees
Milestone

Comments

@maryamsulemani97
Copy link
Contributor

maryamsulemani97 commented Jan 3, 2023

Remove --max-index-size and --max-task-db configuration options

To do

Some may be missing

  • Update /learn/configuration/instance_options.md
  • Update learn/advanced/storage.md
  • Update reference/errors/error_codes.md
  • Update learn/advanced/known_limitations.md

Reference

@dureuill
Copy link
Contributor

dureuill commented Jan 10, 2023

Hey docs team 👋

The removal of these options introduces two limitations1 that are "new"2:

  1. The number of indexes that can exist simultaneously in a Meilisearch DB becomes around 200 for Linux/macOS and around 20 for Windows. This is due to OS limits on the amount of virtual memory allocatable by a single process.
  2. The size of an index cannot grow beyond 500GiB.

Should we maybe document these limitations somewhere in the documentation?

EDIT: Oh, I'm seeing that Update learn/advanced/known_limitations.md is in the TODO list, so please disregard my message if this is already in the cards 🙏

Footnotes

  1. See this discussion for more context

  2. The tension between the maximum size of an index and the number of indexes always existed, with a total maximum size of about 100TiB for the Unixes (Linux and macOS) and 10TiB for Windows. The changes of v1 merely set in stone the numbers since the max size of an index is now hardcoded to 500GiB, so the resulting number of environments is about 200 for the Unixes and 20 for Windows.

@maryamsulemani97 maryamsulemani97 mentioned this issue Jan 10, 2023
@guimachiavelli
Copy link
Member

guimachiavelli commented Jan 16, 2023

Hi @dureuill!

I'm working on updating these and realised I need some clarification on a few points.


Maximum number of indexes

Why do we say the maximum number of indexes in an instance is around 200 (for unix/unix-like systems)? Is it possible for e.g. linux machine A to have a maximum of 201 indexes while linux machine B only supports 198?

More pragmatically, is the following statement correct?

"A single Meilisearch instance can have up to 200 indexes in Linux and macOS environments."


database_size_limit_reached

What will trigger this error? Reaching the maximum size for a single index? Reaching the maximum size for the task db? Trying to create more indexes than your system can support? All of those?

@dureuill
Copy link
Contributor

Hi @guimachiavelli 👋

Maximum number of indexes

We're being imprecise on the exact number due to the following reasons:

  1. the number was established empirically by running tests on an Archlinux system and a macOS system
  2. the number is tied to a very low-level configuration detail of the operating system: the size of the virtual memory address space that is available to a single process. This value is different from swap space, RAM amount, or available disk space. I cannot point to where the precise value can be found (for example, on macOS, ulimit -v for "address space (kbytes)" returns unlimited, yet I measured it to be around 100TB in practice with dichotomic tests), it's possible that one has to read the source code of the kernel to find out the precise value1.
  3. The virtual memory address space is shared for the whole process. While indexes are the main users of the address space, any memory allocation that occurs during the lifecycle of the application takes from that shared address space. For example, meilisearch makes an allocation of 2/3 of the total RAM of the machine at startup2. This will take 5.33GB from the address space for a machine with 8GB of RAM, which is pretty insignificant considering one index will take 500GB from that pool, but a machine with 128GB of RAM will take 83GB from the address space, so almost 1/5 of an index, which can make the difference between having 201 or 200 indexes available.
  4. Address space fragmentation can result in the OS being unable to provide a contiguous 500GB region of virtual memory, even if the address space contains enough free memory to have the 500GB in a fragmented manner. This depends on the internal state of the OS allocator and the "history" of previous allocations, which will typically be unique from one execution of meilisearch to the next.

Due to these reasons, it is hard to set a hard limit to the number of available indexes that can coexist in a meilisearch instance. If we need a hard limit, it should be safe to take a smaller number, e.g. 180, which would mean that 900TB from the address space will be taken by indexes, and it is unlikely that fragmentation and other allocations will cause the remaining 100TB to be entirely occupied. This only works for the unixes though, because the address space if much smaller on Windows, and so fragmentation and other allocations can absolutely not be abstracted away.

To summarize:

"A single Meilisearch instance can have up to 200 indexes in Linux and macOS environments."

is not a correct statement. A more conservative statement could be:

"A single Meilisearch instance can safely have up to 180 indexes in Linux and macOS environments. A greater number of indexes might also work without issue, or cause allocation failures depending on the runtime environment of the instance."

database_size_limit_reached

What will trigger this error?

Reaching the maximum size for a single index? ✅
Reaching the maximum size for the task db? ✅
Trying to create more indexes than your system can support? ❌

database_size_limit_reached is thrown when an underlying "database" reports that it has filled the virtual memory we allocated for it. A "database" here can refer to a single index, or to the task db.

Trying to create more indexes than your system can support will unfortunately not result in a clear user error: typically, what could occur is that the virtual memory allocation will fail when first sending documents to a freshly created index (the memory is not reserved before this point), reporting an OS-specific "allocation failure". Under Windows where the address space is much smaller, I could also observe an unrelated allocation failing (such as further allocations needed to index documents).

I understand that the situation is subtle and also not very user-friendly. The root cause is that we're allocating the whole address space that an index might ever need upfront, forcing us to choose a "large enough" amount of virtual memory so as to make bigger indexes possible, but not large enough that having multiple indexes becomes an impossibility.

We're currently working on mitigations that would prevent such low-level system details from being exposed to the end-user (such as dynamically resizing the indexes so that they can start with a smaller virtual memory allocation, and closing unused indexes so that we don't have to keep all of them in the virtual memory space), but we didn't want to rush this for v1.

I hope that my answer sheds some light on the current status, feel free to ask if you have further questions :-)

Footnotes

  1. This StackOverflow answer points to 128TiB of userspace virtual memory available to Linux programs. It is interesting that I measured less than that, I wonder if some of that virtual memory is used for some other purposes.

  2. unless one provides the --max-indexing-memory option with a different value.

@guimachiavelli
Copy link
Member

Thanks for the detailed answer, @dureuill! I think I have enough to move forward, but will soon request your review on the PR to make sure everything's accurate.

bors bot added a commit that referenced this issue Feb 6, 2023
2098: v1.0 r=guimachiavelli a=maryamsulemani97

Staging branch for v1.0.
Closes #2092, #2087, #2086, #2085, #2082, #2079, #2078, #2077, #2075, #2073, #2072, #2069, #2068, #2067, #2066, #2065

Co-authored-by: maryamsulemani97 <maryam@meilisearch.com>
Co-authored-by: gui machiavelli <hey@guimachiavelli.com>
Co-authored-by: Maryam <90181761+maryamsulemani97@users.noreply.github.com>
Co-authored-by: gui machiavelli <gui@meilisearch.com>
@bors bors bot closed this as completed in c5dc37c Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants