Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak #282

Closed
richardgarnier opened this issue Oct 12, 2020 · 16 comments
Closed

Memory leak #282

richardgarnier opened this issue Oct 12, 2020 · 16 comments

Comments

@richardgarnier
Copy link

I have noticed that the jobs are leaking on the worker side, when a worker queues jobs on that worker queues.
For clarity, if the worker object is constructed with new Worker('oom-test') and tries to do new Queue('oom-test').add(...), these jobs will leak.

However, if the worker queues job using a different queue (e.g. new Queue('oom-test-one').add(...), then it no longer leaks.

In order to reproduce, see the example project attached.

  1. If not present needed a folder named '/tmp' at the root of the project (this is where the heapsnapshot will be stored)
  2. Run npm install
  3. In one terminal, run node master.js, in another node worker.js. Inspect the snapshots using chrome, and verify they are not leaking.
  4. In one terminal, run node master.js, in another node worker-leak.js. By inspecting the snapshots using chrome, you can notice that the 1MB strings have never been collecting and have an increasing distance.

bullmq-leak.zip

@manast
Copy link
Contributor

manast commented Oct 12, 2020

thanks for the report, it is going to take some time to verify this. One think that sounds suspicious is that you mention the leak is in the worker but only when adding jobs. What if you only add jobs, does it not leak memory in that case? also, regarding this line:

  const result = await job.waitUntilFinished(queueEvents);

Does it leak memory even with that line out-commented? (in the past we had problems with that function. I did a quick analysis and it looks fine though).

@richardgarnier
Copy link
Author

To check whether or not it leaks should be easy using the incremental snapshots. When opening them, we can see that the 1MB result strings are never collected, and have their distance gradually increase in the promise reaction chain. (see screenshot)
Also I didn't check, but it might get collected once the longer job completes.

image

One think that sounds suspicious is that you mention the leak is in the worker but only when adding jobs. What if you only add jobs, does it not leak memory in that case?

My wording was probably poor. I meant that it is only leaking when the worker itself is adding jobs. If the master were to execute the same code, no leak would occur.
I don't know whether the job leaks in its entirety or whether the result only leaks. I guess we could test by passing heavy arguments to the job and see what happens.

Does it leak memory even with that line out-commented?

Commenting the call to waitUntilFinished (and replacing it with something like .on('completed')) also fixes the leak.

@manast
Copy link
Contributor

manast commented Oct 13, 2020

Commenting the call to waitUntilFinished (and replacing it with something like .on('completed')) also fixes the leak.

Ok, then the leak is in that function as I suspected. I will review it deeper to see how this can happen.

@manast
Copy link
Contributor

manast commented Oct 13, 2020

I did a small fix here that I think could fix this issue: #284
I wont have time to verify it today though, I can check tomorrow if you cannot verify it yourself.

@Embraser01
Copy link
Contributor

Seems like the problem isn't in waitUntilFinished. If I replace (in worker.js) by await sleep(200) it still leak (I also tried your fix @manast but without success)

@Embraser01
Copy link
Contributor

Note that NodeJS GC remove the leak after the end of the job (here I took the heap snapshot 2s after the queue is drained)

image

@manast
Copy link
Contributor

manast commented Oct 13, 2020

If it is removed by the GC then it is not a leak by definition.

@Embraser01
Copy link
Contributor

I agree it's not a "leak" per se. But there is still a memory accumulation somewhere that shouldn't be here. I looked into heapdump a little more and I found out that Jobs are kept because they are used here

image

Between the start and the end of the test, Jobs are only created, none are reclaimed (I also forced GC run to be sure) probably because of Promises (promises are not reclaimed too).
image

I don't really know why though.

@richardgarnier
Copy link
Author

If it is removed by the GC then it is not a leak by definition.

In my case I had a worker that was acting as an aggregator.
It would split the work to do in multiple subtasks, get some of the results, batch them together and write the results somewhere (out of memory).
The fact the memory of the first subtask is not reclaimed before the aggregator task ends, makes it that my process is getting OOM.

@manast
Copy link
Contributor

manast commented Oct 13, 2020

is it possible to simplify the test further, for example there is a jobOne and jobAll, etc. If it can be simplified to the absolutely minimum that reproduces the leak it would be easier to pinpoint where it comes from.

@Embraser01
Copy link
Contributor

So I think I found the reason behind this leak:

When linstening to multipleResolves on the worker process, there isn't any event emitted until the end of the job processing. At the end, it emits a lot (more than 100) of events.

These events comes from here:

[...processing.keys()].map(p => p.then(() => [p])),

The () => [p] part keeps a reference to the fullfilled promise and so the Promise result (aka Job) which is not collected until the end of the queue.

I wrote a simple fix that uses indexes to find the original promise instead. It seems to fix the issue

--- a/src/classes/worker.ts	(revision 347a4551be8e8b4c3e222d011fef5eb9afcc4711)
+++ b/src/classes/worker.ts	(date 1602595125126)
@@ -134,9 +134,12 @@
        * Get the first promise that completes
        * Explanation https://stackoverflow.com/a/42898229/1848640
        */
-      const [completed] = await Promise.race(
-        [...processing.keys()].map(p => p.then(() => [p])),
+      const promises = [...processing.keys()];
+      const completedIdx = await Promise.race(
+        promises.map((p, idx) => p.then(() => idx)),
       );
+
+      const completed = promises[completedIdx];
 
       const token = processing.get(completed);
       processing.delete(completed);

@Embraser01
Copy link
Contributor

I found this issue related to Promise.race that can explain why promises are not collected until the end nodejs/node#17469

@manast
Copy link
Contributor

manast commented Oct 13, 2020

yeah, that may be it. Fortunately there is a safeRace implementation that does not have the leak. Seems like it is not so trivial to just replace the Promise.race by something else that does the same.

@Embraser01
Copy link
Contributor

I think my fix will prevent most of OOM as it keeps a single number (with a small footprint) and doesn't imply much changes. On the other hand, if you prefer fixing the real issue behind this, using a better Promise.race implementation like the one linked in the issue seems good!

@manast
Copy link
Contributor

manast commented Oct 13, 2020

looks great actually, and I think it is easier to understand than previous implementation.

github-actions bot pushed a commit that referenced this issue Oct 20, 2020
…10-20)

### Bug Fixes

* **job:** remove listeners before resolving promise ([563ce92](563ce92))
* **worker:** continue processing if handleFailed fails. fixes [#286](#286) ([4ef1cbc](4ef1cbc))
* **worker:** fix memory leak on Promise.race ([#282](#282)) ([a78ab2b](a78ab2b))
* **worker:** setname on worker blocking connection ([#291](#291)) ([50a87fc](50a87fc))
* remove async for loop in child pool fixes [#229](#229) ([d77505e](d77505e))

### Features

* **sandbox:** kill child workers gracefully ([#243](#243)) ([4262837](4262837))
@richardgarnier
Copy link
Author

Thank you, I confirmed the fix in the PR was solving the issue, thus closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants