-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OutOfMemory
exception with low system memory usage
#359
Comments
We're running with thousands of different indexes on a shared box, but we're using process pools instead of your async handling. We don't see memory growth in these long lived processes (that access different indexes over time), but to be fair we have not yet upgraded to 0.22. As for suggestions, this might be a good application of the FIL memory profiler: https://pythonspeed.com/fil/docs/index.html In essence, you can run your application in a staging environment with a copy of your "real" index, do what you need to do to make memory grow, and then when memory is high, kill the process. The memory profiler will dump a file showing where allocations were held. You get a flamegraph view: |
It would also be good document more specifics about your situation, i.e., package version, platform, etc. |
There was an actual memory leak a few versions back, in PyO3: PyO3/pyo3#3285. So it is important to know which version you're using. There are also a few other memory leak issues on the PyO3 tracker, so you might have a look there too to see if anything might be affecting you. |
@mskarlin is this still an issue for you? Without a reproducer there isn't much we can do I think. |
Hi @cjrh yes it is still an issue (I work with @mskarlin on https://github.com/Future-House/paper-qa), sorry for the delayed response. I hit it with
I also seem to get it on
Using this answer, I run:
There is a series of successful Please let me know how to investigate this, I am not sure what to look into next. One detail that may matter is our index was built on a macOS machine, and is being used on Ubuntu, might that affect things? |
Thanks for the detail. Unfortunately, I can't think of an obvious solution for this.
Are you able to run |
Hmm, actually, I am slowly remembering something we saw in prod a while back. We also saw this
That share a UUID name is a segment. So what we figured was happening might be that each segment's files are separately being mmap'ed, and eventually we just run out of virtual address space causing the So what we did, which worked, was we recreated this tantivy index with a much larger memory limit on the writer. Basically (handwavy explanation), the writer accumulates data in memory over multiple document additions, and when the specified memory limit is reached, it writes out that accumulated data as a segment. A higher memory limit means fewer segments. This resolved our problem, but we didn't do a deeper dive to see precisely why the original memory error was being raised in the first place. The virt mem thing was just a hunch. So perhaps the best option for you to try is to regenerate your problematic tantivy index with a much larger memory limit on the writer. |
I just checked, we're currently using 1GB as the heap setting for the writer. I am wondering whether I should set this as the default in this package. It might be better to default to bigger (avoiding the hard-to-debug OutOfMemory error), rather than the current default. |
Thank you so much for the replies @cjrh I am following what you're saying. For reference my index is built from about 5.5k PDFs and goes to this many files: > ls /path/to/index | wc -l
21001
> ls /path/to/index | head -n 8
000a60b7148d4e8285cd78737474a3da.fast
000a60b7148d4e8285cd78737474a3da.fieldnorm
000a60b7148d4e8285cd78737474a3da.idx
000a60b7148d4e8285cd78737474a3da.pos
000a60b7148d4e8285cd78737474a3da.store
000a60b7148d4e8285cd78737474a3da.term
000d73f244864638833d735c8a76ebd7.fast
000d73f244864638833d735c8a76ebd7.fieldnorm
Last night I rebuilt the index on the Ubuntu machine itself. What happens is:
So actually I don't think it's an "index across OS's compatibility" issue any more.
This sounds quite reasonable, let's do this. Can you give me some rough guidance on how I can regenerate the index using a much larger memory limit? We instantiate our index: tantivy.Index
writer: tantivy.IndexWriter = index.writer() Should I configure the |
Firstly I want to say again @cjrh thank you again for your help here, basically your help is very awesome :)
I decided to be proactive here and took a guess at how to do this, making the below change and rebuilding the index: index: tantivy.Index
--- writer: tantivy.IndexWriter = index.writer()
+++ writer: tantivy.IndexWriter = index.writer(heap_size=1_000_000_000) Then looking at the new index's size:
And that did not resolve the issue, I still get the OutOfMemory error:
How can I correctly adjust the writer's memory limit? |
@jamesbraza I think from memory it is
This was the original issue that was raised for this #118 |
Okay thank you @wallies, I tried that just now (both position arg and keyword arg variations of Can you suggest me more things to try? |
So I am actually pretty stalled out due to this error, so sharing a thought
Perhaps there is some way tantivy |
Do you think it could be |
I don't know. But In my case, we're all on amd64.
So something is wrong here. Increasing the writer memory threshold should definitely result in fewer segments. Unless there is some other reason your segments are small. How often are you calling commit? |
Can you show a snippet of code that shows the loop where the index being created (docs being added to a writer)? |
Can you also show the output of
|
Here is a small test file I made to generate indexes with a few of the settings configurable as command line args: import argparse
import random
import string
from tantivy import SchemaBuilder, Index, Document
def run(path, heap_size=15_000_000, doc_count=1000, string_length=10, num_text_fields=10) -> Index:
text_fieldnames = [f"text{i}" for i in range(num_text_fields)]
schema = (
SchemaBuilder()
.add_unsigned_field("order", fast=True)
.add_text_field("text", stored=True)
.build()
)
schema_b = (
SchemaBuilder()
.add_unsigned_field("order", fast=True)
)
for fieldname in text_fieldnames:
schema_b.add_text_field(fieldname, stored=True)
schema = schema_b.build()
index = Index(schema, path)
writer = index.writer(heap_size)
for i in range(doc_count):
doc = Document()
doc.add_unsigned("order", i)
for fieldname in text_fieldnames:
doc.add_text(fieldname, "".join(random.choices(string.ascii_lowercase, k=string_length)))
writer.add_document(doc)
writer.commit()
writer.wait_merging_threads()
index.reload()
return index
def main(args):
import os
import tempfile
with tempfile.TemporaryDirectory() as path:
index = run(path, args.heap, args.count, args.length, args.fields)
# Print out some diagnostics
items = os.listdir(path)
print(len(items))
print(items[:10])
# Check that we can open the index
index = Index.open(path)
print(index)
if __name__ == "__main__":
parser = argparse.ArgumentParser(description="Tantivy Python segment test")
parser.add_argument("--heap", type=int, default=15_000_000, help="Heap size")
parser.add_argument("--count", type=int, default=1000, help="Document count")
parser.add_argument("--length", type=int, default=10, help="String length")
parser.add_argument("--fields", type=int, default=10, help="Number of text fields")
args = parser.parse_args()
main(args) Running this with these settings, for example (python 3.12, tantivy 0.22.0):
Takes a while to run, but produces a large index, nearly 10 GB. The important thing to note is that when it completes, there are 70 files in the index folder, not tens of thousands. During the indexing, the number of files goes up and down, because tantivy also merges segments together as part of the indexing process. This is me listing out the status of the tmp index folder a few times during the indexing (the tmp dir was
So I definitely think that your problem is due to ending up with a very large number of files in your tantivy index. Perhaps you can run the above script on the mac and see whether you end up with a large number of files or not. And if not, compare what the script is doing to what our live code is doing. |
If the same invocation of the script produces a large number of files on mac, but not on amd64, then that suggests there is a bug in the mac build somehow, and we can escalate to the experts in the tantivy project 😄 |
Once per input PDF, there are about 5.5k of them.
Yes it's here: https://github.com/Future-House/paper-qa/blob/v5.3.0/paperqa/agents/search.py#L404-L481
Here is the Ubuntu server, which hits the
Here is my Mac, which does not hit the
I just rebuilt the indexes using now
It seems the trend does hold, but just not as dramatically as you stated:
Even with a 15 GB heap size, at inference time the error persists:
Please note that it only seems to be a subset of index files lead to the
I know this is a lot to take in, but can you please give some guidance on what to try next? Also, I want to restate your efforts are very much appreciated 🙏 |
Do you think we should not be calling writer.commit()
writer.wait_merging_threads()
index.reload() We don't do |
Okay I was talking in-person at FutureHouse and I want to highlight something to help make it clear:
With this new mental clarity, I decided to play around and discovered this (with an index built on the 5.5k papers using the default In [1]: from paperqa.agents.search import SearchDocumentStorage, SearchIndex
In [2]: index_1 = await SearchIndex(index_name="normal-index").index
In [3]: index_2 = await SearchIndex(index_name="normal-index").index
In [4]: index_3 = await SearchIndex(index_name="normal-index").index
In [5]: index_4 = await SearchIndex(index_name="normal-index").index
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In[5], line 1
----> 1 index_4 = await SearchIndex(index_name="normal-index").index
File /path/to/.venv/lib/python3.12/site-packages/paperqa/agents/search.py:176, in SearchIndex.index(self)
174 index_path = await self.index_filename
175 if await (index_path / "meta.json").exists():
--> 176 self._index = Index.open(path=str(index_path)) # type: ignore[assignment]
177 print("success", self._index)
178 else:
179 # NOTE: this creates the above meta.json file
ValueError: Failed to open file for read: 'IoError { io_error: Os { code: 12, kind: OutOfMemory, message: "Cannot allocate memory" }, filepath: "/path/to/normal-index/index/9ac6cd3a75304ff794ef4377ae3ba1b2.store" }' At this point, running
So, seeing this, can you help me figure out what is going wrong? |
No
Yes, my strong suspicion is that you're hitting some kind of virtual memory limit from mmap calls being made for too many segments. So the physical memory in your machine doesn't matter.
These are still too many I think. Either you should try to run my test script that I posted above, and check whether you get a similar number of files as I do, or you will have to give me a script that I can easily run on my machine to verify that I get the large number of files that you see. If I can reproduce the issue on my computer it is very likely I will figure out what is going on. But failing that, it's very difficult. A very simple reproducer that both you and I can run on our machines and compare results would be very helpful.
Out of interest, have you tried building the index on the Ubuntu server, and when doing so how many segments (files) to you end up with in the index directory when doing so? Is it similar to what you end up with on the mac?
You must call The frequency of calling |
Allllright, coming from Future-House/paper-qa#628 (comment) on adding in use of
It turns out I re-ran the original code that lead to the I guess, I am left wondering, how can tantivy-py make this easier for "the next dev"? Is there some way we can change the Python bindings to make this more seamless, or obvious that |
For sure, the first port of call should be the documentation. The very first examples seen by a new user should include all the parts, including the merging threads. I can add this to the docs over the next few days for any that are missing. In terms of changing the code to make it easier for users, one of the design goals we have is to mimic the tantivy (upstream crate) interface as much as possible, so that knowledge of how one works can be transferred over to the other. So I wouldn't want to change the existing calls themselves, for example I wouldn't want to call wait_merging_threads inside the However, we should be able to add new methods, perhaps in a new python module namespace that automates some of these typical state management steps. For example, we could expose a new context manager (which rust doesn't have) that lets the user add docs and then in the finalizer section of the context manager we can call commit and wait_merging_threads. This can be explained as some kind of pythonic extension of the raw tantivy interface. I'll think about it more. For now, I am happy to close this issue if you are, since the original issue is now solved. |
We're struggling to diagnose an issue where we are instantiating several
Index
objects (built on the same shared directory) in asynchronous coroutines, and setting the following stack trace:We have a relatively small index (~5Gb), and even with several instantiated
Index
objects, the system RAM usage is low (on a machine with 60Gb of memory, we're only seeing ~3Gb used). But when running with 4 concurrent coroutines we see the above issue.In trying to reproduce, I made a tiny index with ~100 files, and ran 100 simultaneous coroutines, each of which with their own Index object, and was not able to reproduce the error.
Do you have any suggestions on what to investigate next?
The text was updated successfully, but these errors were encountered: