Skip to content
This repository has been archived by the owner on Apr 22, 2023. It is now read-only.

tls: catch race condition to prevent memory leak #9064

Closed
wants to merge 1 commit into from

Conversation

wraithan
Copy link

My co-worker @hayes and I spent the better part of last week hunting this bug. Today we presented it to @chrisdickinson and he helped us further debug the issue and assisted in writing the patch/test.

Digging into the issue. The node-newrelic module has had a very small leak for a long time. In the core dumps we'd find many copies of our custom certs that we use for outbound connections to new relic servers. A bug was fixed in yakaa one of our dependencies which should have removed the ability for it to hold onto dead sockets, and I confirmed that it was indeed not keeping any references to those sockets.

Further digging revealed that CryptoStream pair that is returned from tls.connect() was never being garbage collected, and the Connection (from src/node_crypto) was never freeing its ssl_ property nor was its destructor ever called. After spending many hours tracing all the states that the duplex stream pair could be in and how this could happen, we found that the writable side of the ClearTextStream was emitting 'finish' before either of the streams emitted 'end'. This lead to _done not being called on EncryptedStream which means that the pair would be leaked and the underlying SSL object would never be freed.

We tried to create a "simple" test but could not reproduce against the node tls server nor the node https server. We could, however, reliably reproduce the issue hitting https://google.com/ so we put the test in "internet" but hope that someone else can make a "simple" regression test.

r=@indutny

The CrytpoStream pair could be ended in an order that would result in
not only the streams themselves being leaked, but the underlying ssl
objects as well. This manifests itself as a native heap leak and can
be reproduced against google.com
@othiym23
Copy link

👏 🐨 💥

That's some quality debugging; kudos to everyone involved!

agent: false
};

https.get(options, function(res) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this problem manifest itself when connecting to node.js server? If not - what is the difference?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, without doing anything fancy and just connecting and sending data is not enough to reproduce this issue. We'll need to reconstruct what happens with communication with google, likely down to timing and such.

@tjfontaine
Copy link

@wraithan @chrisdickinson great work on this!

Is it that the state machine isn't properly transitioning through onCryptoStreamFinish and hitting https://github.com/joyent/node/blob/v0.10/lib/tls.js#L306-L311 ? as you indicate because _ended is not there for the CleartextStream? Is it feasible to get it to transition through this path as opposed from having to catch the state transition at another point?

@tjfontaine
Copy link

So since the other side is still reading/readable, we're kicking off another read, this is not how I want to fix it, but really we need to come back through this path again:

   // Try to read just to get sure that we won't miss EOF
-  if (this._opposite.readable) this._opposite.read(0);
+  if (this._opposite.readable) {
+    debug('kicking off opposite read');
+    this._opposite.read(0);
+    var self = this;
+    process.nextTick(function() {
+      onCryptoStreamFinish.apply(self);
+    });
+  }

This mostly is just working by accident as well, really the .read(0) is queuing a nextTick that will emit end, and then this nextTick just happens to fire after that.

Maybe we could do something like:

-    if (this === this.pair.cleartext) this._opposite._done();
+    if (this === this.pair.cleartext) {
+      debug('calling other done');
+      this._opposite._done();
+    }
+  } else {
+    debug('opposite has not ended');
+    var self = this;
+    this._opposite.once('end', onCryptoStreamFinish);

@indutny
Copy link
Member

indutny commented Jan 20, 2015

@tjfontaine I'd rather investigate the cause of the issue first...

@chrisdickinson
Copy link

So, stream-wise: yes, @tjfontaine you got it in one with the ClearTextStream hitting this check in onCryptoStreamFinish before ending, and thus missing the chance to _done the opposite side. The end state being that this code is hit by EncryptedStream, leading us to onCryptoStreamEnd, which at present doesn't call _done on anything – and since it is the last event, misses the final opportunity to _done both sides. This change was designed to address that case specifically – where the opposite side has finished but not ended, and hasn't marked itself as "done", meaning that when it hit onCryptoStreamFinish it opted not to _done both streams. (Whew)

Re: reproducibility: we attempted to reproduce with a node tls server sending the exact same content as the google server. It was nearing the end of the day, and I think we were both a bit frazzled, so I'm not sure if we tried sending the appropriate data, so that might be a good course of inquiry to rule out timing issues / TCP issues.

@paglias
Copy link

paglias commented Jun 26, 2015

Any update on this?

@Fishrock123
Copy link

@indutny any idea if the fixes to io.js caught this?

@indutny
Copy link
Member

indutny commented Jul 7, 2015

It doesn't really apply much to io.js

@Fishrock123
Copy link

It doesn't really apply much to io.js

I talked to Fedor on IRC and this is only remotely possibly an issue on io.js if you force it into some legacy TLS mode.

@bnetter
Copy link

bnetter commented Jul 16, 2015

👍

@gireeshpunathil
Copy link
Member

Can you please review the test case I developed and uploaded in this link? I am able to see leaks in https module without external links and believe this is a true recreate of the reported issue. 200+ Buffer objects holding 32 MB of memory + IncomingMessage, Redirect, BufferList, ClientRequest, Multipart.. 5 each of these objects are leaked, in one iteration of the test. (I ran with v0.10 on Linux IA32)
https://github.com/gireeshpunathil/nodev0.10httpsmemoryleak

One of the 10MB Buffer object is held through the below retainers chain:
image

while most of the small buffers are retained through:
image

Thanks.

@gireeshpunathil
Copy link
Member

I made a standalone version of the previous one, detached from request module, and produced the leak.
https://github.com/gireeshpunathil/nodev0.10httpsmemoryleak/tree/standalone

Two snapshots are attached, which is relevant to the above discussion.
image

image

@jasnell
Copy link
Member

jasnell commented Aug 29, 2015

@indutny @Fishrock123 ... any updates on this one?

@Fishrock123
Copy link

@jasnell Virtually doesn't exist in io.js / node 4.0.0

@jasnell
Copy link
Member

jasnell commented Aug 29, 2015

Yep, that I know. Should we try to fix in v0.10 and v0.12? Do we know what change in nodejs/node fixed it?

@Fishrock123
Copy link

@jasnell: @indutny's c++ TLS overhaul a few months ago. (as far as I understand)

@jasnell
Copy link
Member

jasnell commented Aug 29, 2015

@indutny .. what do you think then? I suspect fixing the root cause of the issue in v0.10 would be much more than we'd want to do here. So is this PR enough to bandaid over the issue or do we just not worry about this one and close?

@indutny
Copy link
Member

indutny commented Aug 29, 2015

@jasnell I still have very shallow understanding of this patch, because I doesn't have proper test.

@jasnell
Copy link
Member

jasnell commented Aug 29, 2015

Ok. I'm generally leaning towards closing this without further action. If
it's something that needs to be fixed, we can revisit.

On Sat, Aug 29, 2015 at 11:44 AM, Fedor Indutny notifications@github.com
wrote:

@jasnell https://github.com/jasnell I still have very shallow
understanding of this patch, because I doesn't have proper test.


Reply to this email directly or view it on GitHub
#9064 (comment).

@AndrewHuffman
Copy link

@jasnell This is an issue that is quite prevalent throughout our system and I'm certain others are having this problem as well. Given our number of dependencies it's not trivial to move up versions of Node (we're on v0.12.*).

This New Relic forum post shows 5.2K views & 45 responses. Another post has 3.4K views & 27 replies, so it's certainly a well known issue within the New Relic community.

@Fishrock123
Copy link

@AndrewHuffman if you can nail down a test, that'd be your best bet if you want to try to get this into 0.12.

Really though, 0.12 is not so great in ways other than this and you should really just upgrade to 4.0.0 asap.

@AndrewHuffman
Copy link

@Fishrock123 Was what @gireeshpunathil provided here not that?

@AndrewHuffman
Copy link

I count 17 dependencies that specify specific engine versions. As much as I'd like to upgrade ASAP it's not trivial.

I'm just asking for some due diligence on this defect.

@ehacke
Copy link

ehacke commented Sep 16, 2015

@AndrewHuffman I had huge problems with memory leaks on 0.12.x, very likely due to the https issue mentioned here and elsewhere. Our app would exhaust memory on the server (2-3gb) within a day or two, and run at 90-100% CPU while doing it.

Switching to 4.0.0 allows the exact same application to run with 200-300mb of memory and 10-15% CPU.

My app is modest in complexity, but I have 46 dependencies and not a single one had an issue with the upgrade, regardless of the engine specified. I'd try to work towards using node 4.0.0 ASAP if I were you, it's significantly improved.

@AndrewHuffman
Copy link

@ehacke I definitely will, just trying to mitigate our current risks. Glad to hear 4.0.0 solves many of these memory issues.

@jasnell
Copy link
Member

jasnell commented Sep 16, 2015

Can definitely understand the pain. Unfortunately I'm going to be pretty tied up for the next two weeks but when I'm back from traveling I'll take another look to see what, if anything, can be done in v0.12.x on this. As @Fishrock123 mentioned, however, the best fix long term is to move to v4.x so I'd highly recommend starting the wheels turning on that.

@jasnell jasnell self-assigned this Sep 16, 2015
@Fishrock123
Copy link

@AndrewHuffman
Copy link

Thanks guys! I'm excited to move to 4.0.0 as soon as we can push our deps to upgrade ;)

@AndrewHuffman
Copy link

@jasnell Any updates?

@ChALkeR
Copy link
Member

ChALkeR commented Oct 12, 2015

many copies of our custom certs that we use for outbound connections

That looks like nodejs/node#1522, which is already fixed by nodejs/node#1529 (since v1.8.2).

@wraithan @gireeshpunathil @AndrewHuffman
Could you confirm if the bug is still valid, and if it is, report an issue at nodejs/node?

@wraithan
Copy link
Author

@ChALkeR I'm unable to confirm the bug on newer versions and we are seeing (for the first time since 0.8) an official node release that doesn't appear to leak memory when using SSL with Node 4.x

But 0.10 and 0.12 both leak memory super hard when you use custom certs, so much so that we just disabled our bundle on those. The 1.8.2 fix wouldn't be appropriate to directly backport though, because TLS had be heavily rewritten by that point in the io.js fork, unless I'm mistaken.

My concern is from the position of a library author with users who are still on 0.10 and 0.12. Regardless of if a newer major fixed it, we have users that only upgrade patch releases of 0.x currently due to cost of upgrade. We will be making an effort to help educate our users and get them moved to 4.x but as all library authors (especially those whose business is to make that library) will tell you, it is hard to get users to upgrade (even just patch releases and using scary words like denial of service sometimes!)

Should this be reported on nodejs/node if it is a 0.10 and 0.12 issue or is there another place to put things that are more LTS related?

@ChALkeR
Copy link
Member

ChALkeR commented Dec 7, 2015

@wraithan As per nodejs/node#2813 (comment), I guess that the fixes are not going to be backported to 0.12 branch.

@Trott Trott closed this Apr 22, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.