-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ALPN callback function sometimes leads to segfault in node.js >= 18.13.0 #47207
Comments
And another one few second ago:
|
It's 2AM and I finally extracted core dump from kubernetes. Does not tell much because I need debug symbols for node 18.15.0
and
|
A |
|
ALPN callback
|
OpenSSL:
|
I can't be 100% sure because the data is buried deeper in the stack than I anticipated (isn't it always like that?) but it looks like the ALPN string from the ClientHello packet is zero-sized. SSL_select_next_proto() selects the first item from the client's string if there's no matching entry in the server's string but yeah, that won't work if the client's string is empty. Do you have the opportunity to try out a patch locally? Does this one-liner fix it? diff --git a/src/crypto/crypto_tls.cc b/src/crypto/crypto_tls.cc
index f14adec767a..fea3186c17e 100644
--- a/src/crypto/crypto_tls.cc
+++ b/src/crypto/crypto_tls.cc
@@ -225,6 +225,8 @@ int SelectALPNCallback(
const unsigned char* in,
unsigned int inlen,
void* arg) {
+ if (inlen == 0) return SSL_TLSEXT_ERR_ALERT_FATAL;
+
TLSWrap* w = static_cast<TLSWrap*>(arg);
const std::vector<unsigned char>& alpn_protos = w->alpn_protos_; |
No, but I'll try to patch openssl locally to try to reproduce this error on my dev end. If it is true it looks like a possibility DoS attack on node.js |
Found another one, tls client:
|
inlen equal to zero does not cause any issues, the problem is somewhere in out variable I think. Instead of prod I'm trying to reproduce it in this simple program: #include <iostream>
#include <openssl/ssl.h>
using namespace std;
int main() {
unsigned char *out;
unsigned char outlen;
const char *server = "\x08http/1.1";
const char *client = "\x08http/1.1";
cout << strlen(server) << endl;
int status = SSL_select_next_proto(&out, &outlen, (const unsigned char*)server, strlen(server), (const unsigned char*)client, strlen(client));
cout << status << endl;
} UPD: Setting |
What was it set to before? Which createServer method are you using (net/http/https/http2)? |
it is https.createServer(), default value is |
FWIW, I've not been able to reproduce the crash (so far at least.) I tried crafting a ClientHello with an empty ALPN extension record but I get back a handshake_failure alert and the connection is subsequently closed. No crash. |
I tried various combinations of invalid ALPN extensions last week and also wasn't able to reproduce a crash on the server side. Is there any chance you might be able to share a traffic dump of just the TLS handshake that causes the server to crash @gugu? |
It required me to deploy crashing version of the app to production. If I can get debug symbols of the version released on nodejs.org - I can get much more data. But if it is not possible - then I'll deploy. One more thing from the available debug symbols of libuv:
|
Just a random observation but this:
Means the buffer contains a handshake frame that's between 512 and 767 bytes big. That's about average for a ClientHello. (That \002 is the upper octet of a 16 bits big-endian value so at best it's 2*256 + 255 bytes big.) If you can get me those bytes, I can probably figure out what's wrong. |
I forgot about network byte order, now things looking much better. ALPN is completely ok. Here is what I've described so far:
my decoding journey continues |
Here is a full message. I tried to repeat it locally in a single thread in a loop 1M iterations, nothing fails. I'll try running in parallel |
Just a quick note but this:
Isn't right. 0017 0000 is the extended master secret extension, ff01 0001 00 is Renegotiation Indication with a (expected for a new connection) empty payload, and then 000a is EC Supported Groups. |
I've checked 3 dumps, all of them have perfectly valid ALPN record. My guess now it is some use after free in ALPN callback |
One more note: %rax points to incorrect memory address.
Fails always this line:
PS: I see |
@bnoordhuis yep, reverting to version before bf17f8d fixes the problem (18.12.0) |
It seems really unlikely to me that bf17f8d is the culprit but maybe it exposes a preexisting condition. Apropros that diff, can you explain why you think it makes a difference? |
It makes a difference according to my production as well, 18.12.0 has 0 core dumps since yesterday on 10 servers. Default ALPN produced core dump around once per hour, empty ALPN - once per few hours.
I think it can be either GC running in a wrong time or client closing connection immediately after sending client hello. My service is a link shortener and we have tons of short living requests |
But why specifically do you think retaining a reference to |
Okay, I can not prove 100% it is this line. Do you have config params you use to build nodejs.org/download/... builds? I'll try to create builder docker container and try to remove this line |
@bnoordhuis : My knowledge of openSSL internals is a bit rusty, but I believe the problem here was introduced with bf17f8d. In the new code, the TLSWrap pointer is stored as a fixed arg on the SSL_CTX object, which IIUC, is shared by multiple TLSWrap objects. (So, if the last one to update it goes away, bad things happen.) Like the other commenters, we have been unable to reproduce this under controlled circumstances, but it happened routinely when we tries to deploy node 18.17.1 in a production environment. |
Retrieve the TLSWrap from the SSL object, not SSL_CTX. A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a wrapper around SSL, SecureContext around SSL_CTX. Node.js normally uses a SecureContext per TLSWrap but it is possible to use a SecureContext object more than once. It was therefore possible for an ALPN callback to use the wrong (possibly already freed) TLSWrap object. Having said that, while the bug is clear once you see it, I'm not able to trigger it (and hence no test, not for lack of trying.) None of the bug reporters were able to reliably reproduce it either so the stars probably need to align just right in order to hit it. Fixes: nodejs#47207
Right, that sounds plausible. Can you see if #49635 fixes it for you? I tried writing a regression test but I'm not actually able to trigger the bug. |
Thank you so much for responding quickly. As I said, I've been unable to repro this under test conditions. But maybe now that we know what's happening, it may be possible to contrive a situation to reproduce it... In terms of a more typical regression test... I think it's possible to drive this machinery from the JS side of things without using any actual sockets/networking... it might be possible to do it that way, but honestly, I'm not familiar enough with the code to know how to do that. |
My test is basically: start two servers with different ALPN callback but sharing the same SecureContext through You'd expect to end up in the second one's ALPN callback but that doesn't happen (or vice versa.) |
I'm still a bit fuzzy on how all this works, but doesn't the callback arg in the SecureContext get updated whenever a connection comes in? |
Could be. I tried stress testing but that didn't change anything. |
I still haven't been able to reproduce the issue. After reading the code more carefully, I don't believe the SecureContext objects should be reused in our application. (It's a pretty basic websocket server, as far as this stuff goes.) |
@bnoordhuis : I can confirm that if I do just a basic server like:
And have 2 client connections do the following: Run this with gdb attached, and break on node::crypto::(anonymous namespace)::SelectALPNCallback. |
And IIUC, that agrees with the theory... the TLSWrap objects will share an SSL_CTX... it's the one associated with |
@LewisWolfgang that looks plausible, but it's probably too non-determinate to turn into a reliable regression test (not that I mind being proven wrong) |
Yeah. In an ideal world, objects (TLSWrap) associated with a "child" SSL object shouldn't be making changes to a "parent", shared SSL_CTX object. But as long as all the child objects are making the same change (just a pointer to the class callback), things are fine, and it was working this way for a long time. (Writing a pointer to a specific child instance into the parent has the bad consequences that one might expect...) |
Retrieve the TLSWrap from the SSL object, not SSL_CTX. A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a wrapper around SSL, SecureContext around SSL_CTX. Node.js normally uses a SecureContext per TLSWrap but it is possible to use a SecureContext object more than once. It was therefore possible for an ALPN callback to use the wrong (possibly already freed) TLSWrap object. Having said that, while the bug is clear once you see it, I'm not able to trigger it (and hence no test, not for lack of trying.) None of the bug reporters were able to reliably reproduce it either so the stars probably need to align just right in order to hit it. Fixes: nodejs#47207
We can confirm that this issue still exists on node v18.18.0 and node v20.8.0. We can also confirm that setting ALPNProtocols: [] does NOT resolve the issue. About 1 of 2M HTTP requests in a representative production environment crashes the node.js process. Our production workload includes HTTP connection from almost all active browser versions. We do not believe this are malicious attempts as the failure rate follows seasonal patterns so its just a function of how many connections are made overall. |
We can confirm that the fix resolves this crash in node 20.8.0. |
Retrieve the TLSWrap from the SSL object, not SSL_CTX. A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a wrapper around SSL, SecureContext around SSL_CTX. Node.js normally uses a SecureContext per TLSWrap but it is possible to use a SecureContext object more than once. It was therefore possible for an ALPN callback to use the wrong (possibly already freed) TLSWrap object. Having said that, while the bug is clear once you see it, I'm not able to trigger it (and hence no test, not for lack of trying.) None of the bug reporters were able to reliably reproduce it either so the stars probably need to align just right in order to hit it. Fixes: #47207 PR-URL: #49635 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Minwoo Jung <nodecorelab@gmail.com>
Retrieve the TLSWrap from the SSL object, not SSL_CTX. A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a wrapper around SSL, SecureContext around SSL_CTX. Node.js normally uses a SecureContext per TLSWrap but it is possible to use a SecureContext object more than once. It was therefore possible for an ALPN callback to use the wrong (possibly already freed) TLSWrap object. Having said that, while the bug is clear once you see it, I'm not able to trigger it (and hence no test, not for lack of trying.) None of the bug reporters were able to reliably reproduce it either so the stars probably need to align just right in order to hit it. Fixes: #47207 PR-URL: #49635 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Minwoo Jung <nodecorelab@gmail.com>
Retrieve the TLSWrap from the SSL object, not SSL_CTX. A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a wrapper around SSL, SecureContext around SSL_CTX. Node.js normally uses a SecureContext per TLSWrap but it is possible to use a SecureContext object more than once. It was therefore possible for an ALPN callback to use the wrong (possibly already freed) TLSWrap object. Having said that, while the bug is clear once you see it, I'm not able to trigger it (and hence no test, not for lack of trying.) None of the bug reporters were able to reliably reproduce it either so the stars probably need to align just right in order to hit it. Fixes: nodejs#47207 PR-URL: nodejs#49635 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Minwoo Jung <nodecorelab@gmail.com>
Retrieve the TLSWrap from the SSL object, not SSL_CTX. A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a wrapper around SSL, SecureContext around SSL_CTX. Node.js normally uses a SecureContext per TLSWrap but it is possible to use a SecureContext object more than once. It was therefore possible for an ALPN callback to use the wrong (possibly already freed) TLSWrap object. Having said that, while the bug is clear once you see it, I'm not able to trigger it (and hence no test, not for lack of trying.) None of the bug reporters were able to reliably reproduce it either so the stars probably need to align just right in order to hit it. Fixes: #47207 PR-URL: #49635 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Minwoo Jung <nodecorelab@gmail.com>
For posterity: the fix hasn't been released in v20.x yet. v20.10.0 is about to be released and contains the fix. |
@bnoordhuis are there plans to backport this to 18.x? |
Retrieve the TLSWrap from the SSL object, not SSL_CTX. A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a wrapper around SSL, SecureContext around SSL_CTX. Node.js normally uses a SecureContext per TLSWrap but it is possible to use a SecureContext object more than once. It was therefore possible for an ALPN callback to use the wrong (possibly already freed) TLSWrap object. Having said that, while the bug is clear once you see it, I'm not able to trigger it (and hence no test, not for lack of trying.) None of the bug reporters were able to reliably reproduce it either so the stars probably need to align just right in order to hit it. Fixes: nodejs#47207 PR-URL: nodejs#49635 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Minwoo Jung <nodecorelab@gmail.com>
Retrieve the TLSWrap from the SSL object, not SSL_CTX. A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a wrapper around SSL, SecureContext around SSL_CTX. Node.js normally uses a SecureContext per TLSWrap but it is possible to use a SecureContext object more than once. It was therefore possible for an ALPN callback to use the wrong (possibly already freed) TLSWrap object. Having said that, while the bug is clear once you see it, I'm not able to trigger it (and hence no test, not for lack of trying.) None of the bug reporters were able to reliably reproduce it either so the stars probably need to align just right in order to hit it. Fixes: nodejs/node#47207 PR-URL: nodejs/node#49635 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Minwoo Jung <nodecorelab@gmail.com>
Retrieve the TLSWrap from the SSL object, not SSL_CTX. A SSL_CTX object is the parent of zero or more SSL objects. TLSWrap is a wrapper around SSL, SecureContext around SSL_CTX. Node.js normally uses a SecureContext per TLSWrap but it is possible to use a SecureContext object more than once. It was therefore possible for an ALPN callback to use the wrong (possibly already freed) TLSWrap object. Having said that, while the bug is clear once you see it, I'm not able to trigger it (and hence no test, not for lack of trying.) None of the bug reporters were able to reliably reproduce it either so the stars probably need to align just right in order to hit it. Fixes: nodejs/node#47207 PR-URL: nodejs/node#49635 Reviewed-By: Luigi Pinca <luigipinca@gmail.com> Reviewed-By: Minwoo Jung <nodecorelab@gmail.com>
We are using v20.11.0. The error is still present. |
I confirm this bug is present with Node 18.18.2 and can be reproduced with 60% probability on the production system where there is quite a heavy load. I cannot reproduce it at all on the pre-production system which is almost identical. |
Version
18.15.0
Platform
Linux 2a53a1799e0b 5.15.0-67-generic #74-Ubuntu SMP Wed Feb 22 14:14:39 UTC 2023 x86_64 GNU/Linux
Subsystem
tls
What steps will reproduce the bug?
I see this from logs of the my node.js server. I did not find a way to reproduce yet, and need some help with that. I've attached stacktrace, but did not yet found ALPN header value, which causes this error. According to my investigations, function
SelectALPNCallback
callsSSL_select_next_proto
with NULL instead of correct pointer. Looks like some maliciously crafted ALPN header can lead to such errorHow often does it reproduce? Is there a required condition?
It is a rare case I capture from logs (around 1 req/million). I can add some code to get more information about the bug, but don't know what to do
What is the expected behavior? Why is that the expected behavior?
Do not produce segfault
What do you see instead?
Segfoult with stacktrace:
Additional information
No response
The text was updated successfully, but these errors were encountered: