-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel_bulk does not seem to respect chunk_size + parallel_bulk memory leak? #1101
Comments
I found the problem with not the chunk_size parameter not being respected - I had forgotten to cast the chunk_size input to an integer, so it was parsed as a string. Unfortunately I did not get an error about this. For the memory issue; it seems to work fine with streaming_bulk, and is stable at around 50MB memory usage. For parallel_bulk however, there seems to be memory leaking/filled up. |
I have had similar issues with parallel_bulk and memory issues and those were also resolved by instead utilizing streaming_bulk in my own process pool. |
Same issue here with parallel_bulk, the memory usage kept going up for all python processes that run parallel_bulk. Each process reached about 2G (I have 10 child processes) and the system run out of memory. I had to switch back to streaming_bulk and increased the workers count. After that each process only consumed 88M and were stable. |
It does look like all the documents are loaded into memory to be distributed to workers. Would be better to use memory channels to distribute tasks. |
I have the same problem of memory filling up using parallel_bulk. I see in the source code that parallel_bulk is using multiprocess.Threadpool. When I did some experiments I found that the problem doesn't occur when using multiprocess.Pool instead. So it might be a good idea to use multiprocess.Pool in parallel_bulk, especially since Threadpool is not documented |
Could you give a sample implementation on how you solved this issue? Using streaming bulk instead of parallel_bulk doesn't seem to solve my memory issues. I am trying to index 60 million documents. |
I had an issue with parallel_bulk leaking memory when called over and over again. Was hitting this on Python 3.11.0 but not on Python 3.11.1. I believe it was due to gh-99205 which was fixed in Python 3.11.1. Creating a new ThreadPool on every parallel_bulk call was causing the above issue to be hit. |
I am trying to parse files containing millions of lines, and I am using the helpers.parallel_bulk function for indexing data.
However, it seems that parallel_bulk does not respect the chunk_size parameter, and instead fills up my memory with all the data before it starts insertion.
Code excerpt (full script can be found here):
The "yields" variable counts all the yields done in the generator loop. If I specify a chunk_size of 500, it was of my understanding that the parallel_bulk function should start indexing once the chunk_size is reached? Instead it continues without inserting until all input is completely read (or at around 400.000 chunks). I have confirmed this by printing on success.
Perhaps I am missing something here, or is this expected behavior?
The text was updated successfully, but these errors were encountered: