-
Notifications
You must be signed in to change notification settings - Fork 7.3k
tls: catch race condition to prevent memory leak #9064
Conversation
The CrytpoStream pair could be ended in an order that would result in not only the streams themselves being leaked, but the underlying ssl objects as well. This manifests itself as a native heap leak and can be reproduced against google.com
👏 🐨 💥 That's some quality debugging; kudos to everyone involved! |
agent: false | ||
}; | ||
|
||
https.get(options, function(res) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this problem manifest itself when connecting to node.js server? If not - what is the difference?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, without doing anything fancy and just connecting and sending data is not enough to reproduce this issue. We'll need to reconstruct what happens with communication with google, likely down to timing and such.
@wraithan @chrisdickinson great work on this! Is it that the state machine isn't properly transitioning through |
So since the other side is still reading/readable, we're kicking off another read, this is not how I want to fix it, but really we need to come back through this path again: // Try to read just to get sure that we won't miss EOF
- if (this._opposite.readable) this._opposite.read(0);
+ if (this._opposite.readable) {
+ debug('kicking off opposite read');
+ this._opposite.read(0);
+ var self = this;
+ process.nextTick(function() {
+ onCryptoStreamFinish.apply(self);
+ });
+ } This mostly is just working by accident as well, really the Maybe we could do something like: - if (this === this.pair.cleartext) this._opposite._done();
+ if (this === this.pair.cleartext) {
+ debug('calling other done');
+ this._opposite._done();
+ }
+ } else {
+ debug('opposite has not ended');
+ var self = this;
+ this._opposite.once('end', onCryptoStreamFinish); |
@tjfontaine I'd rather investigate the cause of the issue first... |
So, stream-wise: yes, @tjfontaine you got it in one with the ClearTextStream hitting this check in onCryptoStreamFinish before ending, and thus missing the chance to Re: reproducibility: we attempted to reproduce with a node tls server sending the exact same content as the google server. It was nearing the end of the day, and I think we were both a bit frazzled, so I'm not sure if we tried sending the appropriate data, so that might be a good course of inquiry to rule out timing issues / TCP issues. |
Any update on this? |
@indutny any idea if the fixes to io.js caught this? |
It doesn't really apply much to io.js |
I talked to Fedor on IRC and this is only remotely possibly an issue on io.js if you force it into some legacy TLS mode. |
👍 |
Can you please review the test case I developed and uploaded in this link? I am able to see leaks in https module without external links and believe this is a true recreate of the reported issue. 200+ Buffer objects holding 32 MB of memory + IncomingMessage, Redirect, BufferList, ClientRequest, Multipart.. 5 each of these objects are leaked, in one iteration of the test. (I ran with v0.10 on Linux IA32) One of the 10MB Buffer object is held through the below retainers chain: while most of the small buffers are retained through: Thanks. |
I made a standalone version of the previous one, detached from request module, and produced the leak. Two snapshots are attached, which is relevant to the above discussion. |
@indutny @Fishrock123 ... any updates on this one? |
@jasnell Virtually doesn't exist in io.js / node 4.0.0 |
Yep, that I know. Should we try to fix in v0.10 and v0.12? Do we know what change in nodejs/node fixed it? |
@indutny .. what do you think then? I suspect fixing the root cause of the issue in v0.10 would be much more than we'd want to do here. So is this PR enough to bandaid over the issue or do we just not worry about this one and close? |
@jasnell I still have very shallow understanding of this patch, because I doesn't have proper test. |
Ok. I'm generally leaning towards closing this without further action. If On Sat, Aug 29, 2015 at 11:44 AM, Fedor Indutny notifications@github.com
|
@jasnell This is an issue that is quite prevalent throughout our system and I'm certain others are having this problem as well. Given our number of dependencies it's not trivial to move up versions of Node (we're on v0.12.*). This New Relic forum post shows 5.2K views & 45 responses. Another post has 3.4K views & 27 replies, so it's certainly a well known issue within the New Relic community. |
@AndrewHuffman if you can nail down a test, that'd be your best bet if you want to try to get this into 0.12. Really though, 0.12 is not so great in ways other than this and you should really just upgrade to 4.0.0 asap. |
@Fishrock123 Was what @gireeshpunathil provided here not that? |
I count 17 dependencies that specify specific engine versions. As much as I'd like to upgrade ASAP it's not trivial. I'm just asking for some due diligence on this defect. |
@AndrewHuffman I had huge problems with memory leaks on 0.12.x, very likely due to the https issue mentioned here and elsewhere. Our app would exhaust memory on the server (2-3gb) within a day or two, and run at 90-100% CPU while doing it. Switching to 4.0.0 allows the exact same application to run with 200-300mb of memory and 10-15% CPU. My app is modest in complexity, but I have 46 dependencies and not a single one had an issue with the upgrade, regardless of the engine specified. I'd try to work towards using node 4.0.0 ASAP if I were you, it's significantly improved. |
@ehacke I definitely will, just trying to mitigate our current risks. Glad to hear 4.0.0 solves many of these memory issues. |
Can definitely understand the pain. Unfortunately I'm going to be pretty tied up for the next two weeks but when I'm back from traveling I'll take another look to see what, if anything, can be done in v0.12.x on this. As @Fishrock123 mentioned, however, the best fix long term is to move to v4.x so I'd highly recommend starting the wheels turning on that. |
@indutny could you take a look at @gireeshpunathil 's https://github.com/gireeshpunathil/nodev0.10httpsmemoryleak/tree/standalone? |
Thanks guys! I'm excited to move to 4.0.0 as soon as we can push our deps to upgrade ;) |
@jasnell Any updates? |
That looks like nodejs/node#1522, which is already fixed by nodejs/node#1529 (since v1.8.2). @wraithan @gireeshpunathil @AndrewHuffman |
@ChALkeR I'm unable to confirm the bug on newer versions and we are seeing (for the first time since 0.8) an official node release that doesn't appear to leak memory when using SSL with Node 4.x But 0.10 and 0.12 both leak memory super hard when you use custom certs, so much so that we just disabled our bundle on those. The 1.8.2 fix wouldn't be appropriate to directly backport though, because TLS had be heavily rewritten by that point in the io.js fork, unless I'm mistaken. My concern is from the position of a library author with users who are still on 0.10 and 0.12. Regardless of if a newer major fixed it, we have users that only upgrade patch releases of 0.x currently due to cost of upgrade. We will be making an effort to help educate our users and get them moved to 4.x but as all library authors (especially those whose business is to make that library) will tell you, it is hard to get users to upgrade (even just patch releases and using scary words like denial of service sometimes!) Should this be reported on nodejs/node if it is a 0.10 and 0.12 issue or is there another place to put things that are more LTS related? |
@wraithan As per nodejs/node#2813 (comment), I guess that the fixes are not going to be backported to 0.12 branch. |
My co-worker @hayes and I spent the better part of last week hunting this bug. Today we presented it to @chrisdickinson and he helped us further debug the issue and assisted in writing the patch/test.
Digging into the issue. The node-newrelic module has had a very small leak for a long time. In the core dumps we'd find many copies of our custom certs that we use for outbound connections to new relic servers. A bug was fixed in yakaa one of our dependencies which should have removed the ability for it to hold onto dead sockets, and I confirmed that it was indeed not keeping any references to those sockets.
Further digging revealed that
CryptoStream
pair that is returned fromtls.connect()
was never being garbage collected, and theConnection
(from src/node_crypto) was never freeing itsssl_
property nor was its destructor ever called. After spending many hours tracing all the states that the duplex stream pair could be in and how this could happen, we found that the writable side of theClearTextStream
was emitting'finish'
before either of the streams emitted'end'
. This lead to_done
not being called onEncryptedStream
which means that the pair would be leaked and the underlying SSL object would never be freed.We tried to create a "simple" test but could not reproduce against the node tls server nor the node https server. We could, however, reliably reproduce the issue hitting https://google.com/ so we put the test in "internet" but hope that someone else can make a "simple" regression test.
r=@indutny