-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory leak introduced in v10.15.3 #26667
Comments
Is there any chance you could share a reproduction? |
I have a reproduction that works with our node-oracledb driver -- but if you don't have an Oracle Database handy that won't help much. If you do I can share the code that causes the issue; otherwise, it will take some time to develop a test case that demonstrates the issue! |
Depending on the complexity, looking at the code might help, because we know we’re looking for code that uses /cc @mhdawson as the PR author |
The code for the driver can be found here. The code that is probably the most likely to be the source of the issue can be found here. Specifically, it acquires the value of a reference (for the constructor), creates an instance and then calls napi_wrap() to associate a structure with the newly created object. |
I attach a zip file containing a simple module that simply creates an instance and wraps it:
This will create a file called stats.txt which contains a dump of the number of iterations processed as well as the RSS and heap size. |
@anthony-tuininga Are you saying that only occasionally there is a leak visible when running the program, or that you need many iterations within a single process in order to see the leak? Because I’m having a hard time seeing an actual leak here – at the very least, the finalizer callback always seems to be called for exactly 25000 objects each iteration (so, the expected amount), at least locally… |
What I meant was that the amount of memory lost is not sufficient to indicate that memory is being lost each iteration. Instead memory must be lost on some iterations only. Unfortunately, I ran both 10.15.2 and 10.15.3 with my simple module and discovered that both leak memory, albeit very slowly. I ran each of the loops 4 billion times and heap memory increased about 4 MB in both cases and RSS about 10 MB. So my example doesn't demonstrate the problem. :-( But there is a marked difference between the two versions when using the node-oracledb driver test case and the difference goes away if I revert the pull request I mentioned earlier. I'll see if I can find a better test case next week. |
Ok. I created a cut-down version of the node-oracledb module which has all of the Oracle bits ripped out. It demonstrates the problem. With Node.js 10.15.2 it stabilises around 350,000 iterations but with Node.js 10.15.3 it increases at around 100 bytes/iteration. Once unzipped and built, run with the following command:
|
@addaleax, with the new test case are you able to replicate the issue? |
This is what I see running your example on master:
|
Yes. If you run with Node.js 10.15.2 you will see that the RSS remains relatively stable.
|
I'll run with valgrind to see if anything obvious shows up. |
@mhdawson Yes, Finalizer callbacks run on the main thread. |
Yup. Finalizer needs to on the main thread. |
After 575000 iterations with NODE_PATH=. valgrind --leak-check=full node --expose-gc demo.js >valgrindresults 2>& Num Iters,RSS,Heap Total,Heap Used,Ext
25000,301633536,42622976,2649888,9078
50000,313131008,40001536,2626792,9110
75000,341643264,41050112,2631344,9110
100000,357126144,41050112,2634272,9110
125000,368504832,40001536,2634272,9110
150000,382361600,40001536,2634272,9110
175000,412848128,40525824,2634408,9110
200000,428564480,41050112,2635248,9110
225000,438902784,40001536,2635264,9110
250000,451485696,40525824,2635264,9110
275000,478711808,41050112,2635264,9110
300000,490356736,40001536,2635264,9110
325000,505950208,40525824,2635264,9110
350000,517595136,40525824,2635264,9110
375000,555458560,40525824,2635264,9110
400000,567783424,40001536,2635264,9110
425000,582340608,40525824,2635264,9110
450000,608468992,41050112,2635264,9110
475000,619352064,40525824,2635264,9110
500000,636342272,40525824,2635264,9110
525000,648298496,40525824,2635264,9110
550000,672575488,40001536,2635264,9110
575000,687759360,40001536,2635264,9110 ==24888== HEAP SUMMARY:
51 ==24888== in use at exit: 66,032,860 bytes in 612,932 blocks
52 ==24888== total heap usage: 14,117,119 allocs, 13,504,187 frees, 12,227,047,596 bytes allocated
.
.
.
==24888== LEAK SUMMARY:
==24888== definitely lost: 0 bytes in 0 blocks
==24888== indirectly lost: 0 bytes in 0 blocks
==24888== possibly lost: 47,237 bytes in 583 blocks
==24888== still reachable: 65,985,623 bytes in 612,349 blocks
==24888== suppressed: 0 bytes in 0 blocks
==24888== Reachable blocks (those to which a pointer was found) are not shown.
==24888== To see them, rerun with: --leak-check=full --show-leak-kinds=all
==24888==
==24888== For counts of detected and suppressed errors, rerun with: -v
==24888== ERROR SUMMARY: 1188705 errors from 442 contexts (suppressed: 0 from 0) |
Looking at the details in the valgrind results, nothing stands out and it only reports |
I'll try a run with --show-leak-kinds=all if anything interesting is shown for reachable blocks. |
I ran valgrind with Node.js 10.15.2 and 10.15.3 for 250,000 iterations using this script. Here are the results for the different versions: You'll note that the one for 10.15.3 shows leaks in napi_create_reference() whereas the one for 10.15.2 does not mention it at all. Hope this helps! |
With --show-leak-kinds=all I do see the following: ==29106== 24,803,016 bytes in 442,911 blocks are still reachable in loss record 1,986 of 1,987
==29106== at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==29106== by 0x94D451: napi_create_reference (in /home/mhdawson/newpull/land/node/out/Release/node)
==29106== by 0x98527C6: njsBaton_create (in /home/mhdawson/check2/issue_26667/build/Release/issue_26667.node)
==29106== by 0x98550C5: njsUtils_createBaton (in /home/mhdawson/check2/issue_26667/build/Release/issue_26667.node)
==29106== by 0x98543DF: njsSodaCollection_insertOneAndGet (in /home/mhdawson/check2/issue_26667/build/Release/issue_26667.node)
==29106== by 0x9464A4: v8impl::(anonymous namespace)::FunctionCallbackWrapper::Invoke(v8::FunctionCallbackInfo<v8::Value> const&) (in /home/mhdawson/newpull/land/node/out/Release/node)
==29106== by 0x3F8B80EC72C0: ???
==29106== by 0x3F8B80EC7833: ???
==29106== by 0x3F8B80ED3D3A: ???
==29106== by 0x19C0128: ??? (in /home/mhdawson/newpull/land/node/out/Release/node)
==29106== by 0x19E7B0D: ??? (in /home/mhdawson/newpull/land/node/out/Release/node)
==29106== by 0x19A6B86: ??? (in /home/mhdawson/newpull/land/node/out/Release/node)
==29106==
==29106== 29,665,888 bytes in 529,748 blocks are still reachable in loss record 1,987 of 1,987
==29106== at 0x4C2B0E0: operator new(unsigned long) (in /usr/lib/valgrind/vgpreload_memcheck-amd64-linux.so)
==29106== by 0x94D451: napi_create_reference (in /home/mhdawson/newpull/land/node/out/Release/node)
==29106== by 0x98527C6: njsBaton_create (in /home/mhdawson/check2/issue_26667/build/Release/issue_26667.node)
==29106== by 0x98550C5: njsUtils_createBaton (in /home/mhdawson/check2/issue_26667/build/Release/issue_26667.node)
==29106== by 0x98543DF: njsSodaCollection_insertOneAndGet (in /home/mhdawson/check2/issue_26667/build/Release/issue_26667.node)
==29106== by 0x9464A4: v8impl::(anonymous namespace)::FunctionCallbackWrapper::Invoke(v8::FunctionCallbackInfo<v8::Value> const&) (in /home/mhdawson/newpull/land/node/out/Release/node)
==29106== by 0x3F8B80EC72C0: ???
==29106== by 0x3F8B80EC7833: ???
==29106== by 0x3F8B80ECE51A: ???
==29106== by 0x19C0128: ??? (in /home/mhdawson/newpull/land/node/out/Release/node)
==29106== by 0x19E7B0D: ??? (in /home/mhdawson/newpull/land/node/out/Release/node)
==29106== by 0x19A6B86: ??? (in /home/mhdawson/newpull/land/node/out/Release/node) |
Staring at the code I don't see how it won't be deleted either when the Delete is called or when the finalizer runs. I guess next step would be to try to identify if the finalizers have been run or not. If there is an option to print information on what's in the finalization queue for V8 that would be helpful. |
Added a printf to show the number of References created versus the number times the finalizer was called. It shows that the gap between References created and times a finalizer was called is slowing growing: ReferenceCount: 120615, Finalize count: 100000
Processed 25000 iterations...
ReferenceCount: 245215, Finalize count: 200000
Processed 50000 iterations...
ReferenceCount: 363529, Finalize count: 300000
Processed 75000 iterations...
ReferenceCount: 481237, Finalize count: 400000
Processed 100000 iterations...
ReferenceCount: 610009, Finalize count: 500000
ReferenceCount: 726385, Finalize count: 600000
Processed 125000 iterations...
ReferenceCount: 844669, Finalize count: 700000
Processed 150000 iterations...
ReferenceCount: 962929, Finalize count: 800000
Processed 175000 iterations...
ReferenceCount: 1080861, Finalize count: 900000
Processed 200000 iterations...
ReferenceCount: 1210063, Finalize count: 1000000
ReferenceCount: 1326451, Finalize count: 1100000
Processed 225000 iterations...
ReferenceCount: 1444675, Finalize count: 1200000
Processed 250000 iterations...
ReferenceCount: 1562929, Finalize count: 1300000
Processed 275000 iterations...
ReferenceCount: 1681177, Finalize count: 1400000
Processed 300000 iterations...
ReferenceCount: 1810015, Finalize count: 1500000
ReferenceCount: 1926411, Finalize count: 1600000
Processed 325000 iterations...
ReferenceCount: 2044675, Finalize count: 1700000
Processed 350000 iterations...
ReferenceCount: 2163493, Finalize count: 1800000
Processed 375000 iterations...
ReferenceCount: 2281773, Finalize count: 1900000 |
From the instrumentation, I added it looks to me that despite this being called:
FinalizeCallback is not invoked in a percentage of the times (maybe 25%). Using this check in Delete
instead of
Might be a good optimization and does remove the leak, however, it would likely just hide whatever the real problem is and I could not easily reproduce the original problem #24494 addressed to be able to validate it would be safe (the test case did not recreate with the parts I'd backed out maybe a fuller backout work replicate). Unfortunately, I'm on vacation next week and I have a lot I have to get done tomorrow in advance of being away so I won't be able to investigate further a while. The real questions is why FinalizeCallback is not being called after
is called in some percentage of the calls. I think the code related to that is mostly in the node Persistent and V8 persistent as opposed to anything in napi itself. EDIT: updated to clarify that it is only some % of the time where FinalizeCallback is not called. |
anthony-tuininga if you have time checking
It would at least confirm if its an optimization worth considering. |
@anthony-tuininga thanks for checking that out. Just back from vacation last week. I think I'll submit a PR with that optimization as I think it would be good to reduce pressure on the gc. I think we'd still want to look at the root cause of the problem as it would just cover that up. |
You're welcome. And discovering the root cause of the problem would definitely be a good idea! Let me know if there is anything else I can do to help. |
If you have time to help instrument/debug at the Node.js/V8 level that would help moves things along faster. If that is the case let me know and we can get together to work on some specific next steps. |
I suppose that depends on how much time is required, but I can spare a few hours if that will help. :-) |
PR for optimization #27085. Needs an update once I add it @anthony-tuininga I'm hoping you can help validate it. |
nodejs#24494 fixed a crash but resulted in increased stress on gc finalization. A leak was reported in nodejs#26667 which we are still investigating. As part of this investigation I realized we can optimize to reduce amount of deferred finalization. Regardless of the root cause of the leak this should be a good optimization. It also resolves the leak for the case being reported in nodejs#26667. The OP in 26667 has confirmed that he can still recreate the original problem that 24494 fixed and that the fix still works with this optimization
@anthony-tuininga would be great if you could recheck with the updated content of PR in #27085 in 10.x and your test case. Change should have the same result as what you tested but good to be sure. |
Will do. |
I have verified that a copy of 10.15.3, patched with the PR for optimization #27085 resolves the memory leak. |
@anthony-tuininga, plan to land #27085 in master tomorrow. |
Great. Thanks! |
#24494 fixed a crash but resulted in increased stress on gc finalization. A leak was reported in #26667 which we are still investigating. As part of this investigation I realized we can optimize to reduce amount of deferred finalization. Regardless of the root cause of the leak this should be a good optimization. It also resolves the leak for the case being reported in #26667. The OP in 26667 has confirmed that he can still recreate the original problem that 24494 fixed and that the fix still works with this optimization PR-URL: #27085 Reviewed-By: Ben Noordhuis <info@bnoordhuis.nl> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com>
Was this fixed in v10, or only in v12? |
I believe it landed in master before 12.x was cut so it has #27085 as well. |
But this is a regression in 10, right? What's the backporting policy? |
I am running into this issue or something very closely related. I am using buffers with external memory. The fix does not seem to help – I am testing this on Node.js v12.10.0 on macOS. I have modified the test case to demonstrate this behaviour with These are my results:
However the finalizers are run if we occasionally allow the event loop to continue with
I hope this helps! |
@rolftimmermans That's working as expected. Your code has a finalizer and those are always deferred to the next event loop tick: Lines 38 to 59 in 064e111
(Note that SetImmediate() call.)
|
@bnoordhuis Thanks for the quick response. Is there any way to release memory earlier, or is this just the way things are expected to work with external buffers? Seems like it should be possible to release the memory sooner... :/ |
@rolftimmermans It's by design. Finalizers can call into JS land. If n-api didn't defer the callback, it'd be pretty easy to create a cb→js→cb→js→etc. stack overflow |
Seems the behaviour I saw only applies to I'll open a new issue to address this. |
That's because it doesn't defer the callback: Lines 305 to 325 in 064e111
I'd say that's an oversight, possibly due to a misunderstanding of how the two GC passes work. |
I think this can be closed out as the issue was addressed, and 10.x is also now out of service. Please let me know if that was not the right thing to do. |
Our module that has been migrated to N-API is experiencing a memory leak in version 10.15.3 but not in version 10.15.2. I verified that reverting the changes made in this pull request eliminates the memory leak. Clearly that isn't the right solution but hopefully it will point in the right direction!
The text was updated successfully, but these errors were encountered: