Memory leak #282

richardgarnier · 2020-10-12T14:32:07Z

I have noticed that the jobs are leaking on the worker side, when a worker queues jobs on that worker queues.
For clarity, if the worker object is constructed with new Worker('oom-test') and tries to do new Queue('oom-test').add(...), these jobs will leak.

However, if the worker queues job using a different queue (e.g. new Queue('oom-test-one').add(...), then it no longer leaks.

In order to reproduce, see the example project attached.

If not present needed a folder named '/tmp' at the root of the project (this is where the heapsnapshot will be stored)
Run npm install
In one terminal, run node master.js, in another node worker.js. Inspect the snapshots using chrome, and verify they are not leaking.
In one terminal, run node master.js, in another node worker-leak.js. By inspecting the snapshots using chrome, you can notice that the 1MB strings have never been collecting and have an increasing distance.

bullmq-leak.zip

The text was updated successfully, but these errors were encountered:

manast · 2020-10-12T20:40:30Z

thanks for the report, it is going to take some time to verify this. One think that sounds suspicious is that you mention the leak is in the worker but only when adding jobs. What if you only add jobs, does it not leak memory in that case? also, regarding this line:

  const result = await job.waitUntilFinished(queueEvents);

Does it leak memory even with that line out-commented? (in the past we had problems with that function. I did a quick analysis and it looks fine though).

richardgarnier · 2020-10-13T06:14:40Z

To check whether or not it leaks should be easy using the incremental snapshots. When opening them, we can see that the 1MB result strings are never collected, and have their distance gradually increase in the promise reaction chain. (see screenshot)
Also I didn't check, but it might get collected once the longer job completes.

One think that sounds suspicious is that you mention the leak is in the worker but only when adding jobs. What if you only add jobs, does it not leak memory in that case?

My wording was probably poor. I meant that it is only leaking when the worker itself is adding jobs. If the master were to execute the same code, no leak would occur.
I don't know whether the job leaks in its entirety or whether the result only leaks. I guess we could test by passing heavy arguments to the job and see what happens.

Does it leak memory even with that line out-commented?

Commenting the call to waitUntilFinished (and replacing it with something like .on('completed')) also fixes the leak.

manast · 2020-10-13T07:58:50Z

Commenting the call to waitUntilFinished (and replacing it with something like .on('completed')) also fixes the leak.

Ok, then the leak is in that function as I suspected. I will review it deeper to see how this can happen.

manast · 2020-10-13T08:08:10Z

I did a small fix here that I think could fix this issue: #284
I wont have time to verify it today though, I can check tomorrow if you cannot verify it yourself.

Embraser01 · 2020-10-13T10:28:33Z

Seems like the problem isn't in waitUntilFinished. If I replace (in worker.js) by await sleep(200) it still leak (I also tried your fix @manast but without success)

Embraser01 · 2020-10-13T11:26:02Z

Note that NodeJS GC remove the leak after the end of the job (here I took the heap snapshot 2s after the queue is drained)

manast · 2020-10-13T12:28:33Z

If it is removed by the GC then it is not a leak by definition.

Embraser01 · 2020-10-13T12:49:02Z

I agree it's not a "leak" per se. But there is still a memory accumulation somewhere that shouldn't be here. I looked into heapdump a little more and I found out that Jobs are kept because they are used here

Between the start and the end of the test, Jobs are only created, none are reclaimed (I also forced GC run to be sure) probably because of Promises (promises are not reclaimed too).

I don't really know why though.

richardgarnier · 2020-10-13T13:04:43Z

If it is removed by the GC then it is not a leak by definition.

In my case I had a worker that was acting as an aggregator.
It would split the work to do in multiple subtasks, get some of the results, batch them together and write the results somewhere (out of memory).
The fact the memory of the first subtask is not reclaimed before the aggregator task ends, makes it that my process is getting OOM.

manast · 2020-10-13T13:16:25Z

is it possible to simplify the test further, for example there is a jobOne and jobAll, etc. If it can be simplified to the absolutely minimum that reproduces the leak it would be easier to pinpoint where it comes from.

Embraser01 · 2020-10-13T13:35:54Z

So I think I found the reason behind this leak:

When linstening to multipleResolves on the worker process, there isn't any event emitted until the end of the job processing. At the end, it emits a lot (more than 100) of events.

These events comes from here:

bullmq/src/classes/worker.ts

Line 138 in 4262837

[...processing.keys()].map(p => p.then(() => [p])),

The () => [p] part keeps a reference to the fullfilled promise and so the Promise result (aka Job) which is not collected until the end of the queue.

I wrote a simple fix that uses indexes to find the original promise instead. It seems to fix the issue

--- a/src/classes/worker.ts	(revision 347a4551be8e8b4c3e222d011fef5eb9afcc4711)
+++ b/src/classes/worker.ts	(date 1602595125126)
@@ -134,9 +134,12 @@
        * Get the first promise that completes
        * Explanation https://stackoverflow.com/a/42898229/1848640
        */
-      const [completed] = await Promise.race(
-        [...processing.keys()].map(p => p.then(() => [p])),
+      const promises = [...processing.keys()];
+      const completedIdx = await Promise.race(
+        promises.map((p, idx) => p.then(() => idx)),
       );
+
+      const completed = promises[completedIdx];
 
       const token = processing.get(completed);
       processing.delete(completed);

Embraser01 · 2020-10-13T14:02:05Z

I found this issue related to Promise.race that can explain why promises are not collected until the end nodejs/node#17469

manast · 2020-10-13T14:11:31Z

yeah, that may be it. Fortunately there is a safeRace implementation that does not have the leak. Seems like it is not so trivial to just replace the Promise.race by something else that does the same.

Embraser01 · 2020-10-13T14:23:38Z

I think my fix will prevent most of OOM as it keeps a single number (with a small footprint) and doesn't imply much changes. On the other hand, if you prefer fixing the real issue behind this, using a better Promise.race implementation like the one linked in the issue seems good!

manast · 2020-10-13T16:01:23Z

looks great actually, and I think it is easier to understand than previous implementation.

…10-20) ### Bug Fixes * **job:** remove listeners before resolving promise ([563ce92](563ce92)) * **worker:** continue processing if handleFailed fails. fixes [#286](#286) ([4ef1cbc](4ef1cbc)) * **worker:** fix memory leak on Promise.race ([#282](#282)) ([a78ab2b](a78ab2b)) * **worker:** setname on worker blocking connection ([#291](#291)) ([50a87fc](50a87fc)) * remove async for loop in child pool fixes [#229](#229) ([d77505e](d77505e)) ### Features * **sandbox:** kill child workers gracefully ([#243](#243)) ([4262837](4262837))

richardgarnier · 2020-10-21T12:22:57Z

Thank you, I confirmed the fix in the PR was solving the issue, thus closing this.

Embraser01 mentioned this issue Oct 13, 2020

fix(worker): fix memory leak on Promise.race (#282) #285

Closed

manast pushed a commit that referenced this issue Oct 18, 2020

fix(worker): fix memory leak on Promise.race (#282)

a78ab2b

richardgarnier closed this as completed Oct 21, 2020

theoephraim mentioned this issue Jul 9, 2021

multiple promise resolves when using worker concurrency #636

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak #282

Memory leak #282

richardgarnier commented Oct 12, 2020

manast commented Oct 12, 2020

richardgarnier commented Oct 13, 2020

manast commented Oct 13, 2020

manast commented Oct 13, 2020

Embraser01 commented Oct 13, 2020

Embraser01 commented Oct 13, 2020

manast commented Oct 13, 2020

Embraser01 commented Oct 13, 2020

richardgarnier commented Oct 13, 2020

manast commented Oct 13, 2020

Embraser01 commented Oct 13, 2020

Embraser01 commented Oct 13, 2020

manast commented Oct 13, 2020

Embraser01 commented Oct 13, 2020

manast commented Oct 13, 2020

richardgarnier commented Oct 21, 2020

Memory leak #282

Memory leak #282

Comments

richardgarnier commented Oct 12, 2020

manast commented Oct 12, 2020

richardgarnier commented Oct 13, 2020

manast commented Oct 13, 2020

manast commented Oct 13, 2020

Embraser01 commented Oct 13, 2020

Embraser01 commented Oct 13, 2020

manast commented Oct 13, 2020

Embraser01 commented Oct 13, 2020

richardgarnier commented Oct 13, 2020

manast commented Oct 13, 2020

Embraser01 commented Oct 13, 2020

Embraser01 commented Oct 13, 2020

manast commented Oct 13, 2020

Embraser01 commented Oct 13, 2020

manast commented Oct 13, 2020

richardgarnier commented Oct 21, 2020