-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test-worker-cleanexit-with-moduleload crashing on multiple platforms in CI #39036
Comments
Still occurring
also |
@nodejs/workers |
Although I guess this isn't exactly a worker threads test. From Gireesh's commit message introducing the test:
|
I can replicate locally (on macOS) by just repeating the test many times: tools/test.py --repeat=1024 test/parallel/test-worker-cleanexit-with-moduleload.js
=== release test-worker-cleanexit-with-moduleload ===
Path: parallel/test-worker-cleanexit-with-moduleload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-worker-cleanexit-with-moduleload.js
--- CRASHED (Signal: 11) ---
=== release test-worker-cleanexit-with-moduleload ===
Path: parallel/test-worker-cleanexit-with-moduleload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-worker-cleanexit-with-moduleload.js
--- CRASHED (Signal: 11) ---
=== release test-worker-cleanexit-with-moduleload ===
Path: parallel/test-worker-cleanexit-with-moduleload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-worker-cleanexit-with-moduleload.js
--- CRASHED (Signal: 11) ---
=== release test-worker-cleanexit-with-moduleload ===
Path: parallel/test-worker-cleanexit-with-moduleload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-worker-cleanexit-with-moduleload.js
--- CRASHED (Signal: 11) ---
=== release test-worker-cleanexit-with-moduleload ===
Path: parallel/test-worker-cleanexit-with-moduleload
Command: out/Release/node /Users/trott/io.js/test/parallel/test-worker-cleanexit-with-moduleload.js
--- CRASHED (Signal: 11) ---
[08:10|% 100|+ 1019|- 5]: Done
$ |
Here's the information about the crashed thread from the diagnostic log:
|
reproduced on a linux bare metal with the latest. here is my crash scenario: (gdb) where
#0 0x00000000010ae5c3 in v8::internal::JSReceiver::GetCreationContext() ()
#1 0x0000000000d155c8 in v8::Object::GetCreationContext() ()
#2 0x0000000000b6cab3 in node::worker::MessagePort::OnMessage(node::worker::MessagePort::MessageProcessingMode) ()
#3 0x00000000015b37f6 in uv__async_io (loop=0x7f75d3ffea98,
w=<optimized out>, events=<optimized out>)
at ../deps/uv/src/unix/async.c:163
#4 0x00000000015c6576 in uv__io_poll (loop=loop@entry=0x7f75d3ffea98,
timeout=<optimized out>) at ../deps/uv/src/unix/linux-core.c:462
#5 0x00000000015b4128 in uv_run (loop=0x7f75d3ffea98, mode=UV_RUN_ONCE)
at ../deps/uv/src/unix/core.c:385
#6 0x0000000000aa9550 in node::Environment::CleanupHandles() ()
#7 0x0000000000ab67cb in node::Environment::RunCleanup() ()
#8 0x0000000000a668e9 in node::FreeEnvironment(node::Environment*) ()
#9 0x0000000000be8e91 in node::worker::Worker::Run() ()
#10 0x0000000000be93d8 in node::worker::Worker::StartThread(v8::FunctionCallbackInfo<v8::Value> const&)::{lambda(void*)#1}::_FUN(void*) ()
#11 0x00007f75f438814a in start_thread () from /lib64/libpthread.so.0
#12 0x00007f75f40b7dc3 in clone () from /lib64/libc.so.6
(gdb)
From this and the one which @Trott captured earlier, it is evident that we are shutting down the worker ( |
@gireeshpunathil Can you figure out why it is crashing in
That would be surprising, yes, but I’m not sure if we should really consider it illegal (unless the |
node/deps/v8/src/objects/js-objects.cc Line 543 in 82b44f4
@addaleax - the dereferencing is what is crashing, because the |
sorry, I may be wrong here too; let me dig a little more. |
(gdb) x/2i 0x10ae5c0
// JSReceiver receiver = *this;
0x10ae5c0 <v8::internal::JSReceiver::GetCreationContext>: mov rdx,QWORD PTR [rdi]
// Get the vft for the `receiver` object, crash
=> 0x10ae5c3 <v8::internal::JSReceiver::GetCreationContext+3>: mov rcx,QWORD PTR [rdx-0x1] so it is not the (gdb) i r rdi
rdi 0x7fcae6bf5f48 140509431422792
(gdb) x/2x 0x7fcae6bf5f48
0x7fcae6bf5f48: 0x00070001 0x00050004
(gdb) x/w 0x0005000400070000
0x5000400070000: Cannot access memory at address 0x5000400070000 @addaleax - does it throw any hint? |
@gireeshpunathil Well – it probably means that the JS object associated with the |
If you are referring to or are you thinking that this is not a worker-specific issue, but a wider issue with the libuv itself? |
I think this is expected.
I don’t understand this question, tbh. Which race? Which primitives?
Again, not sure what “such a check” or “first async callback” refer to, but for the second part, no, we should rely on the fact that we should not be receiving callbacks from libuv after having fully closed the handle, because if that wasn’t true that would be unsafe behavior on libuv’s end anyway.
Maybe? That’s what my question above was focused on finding out. |
let me explain my understanding of the issue. I am not claiming complete comprehension of it, only trying to make meaning out of the callstack at the failing site. In Lines 635 to 636 in 1bbe66f
This makes sure any subsequent message handling will not progress, as we don't send anything if the handles are closed or closing: Lines 625 to 628 in bcf73d6
Then we run uv_run. There, we handle events, potentially on closed / closing handles, (due to the asynchrony of the does it sound reasonable to you? |
@gireeshpunathil Yes – but it should actually be fine for a message to be received on a closing handle, as long as it’s not fully closed yet. |
diff --git a/deps/uv/src/unix/async.c b/deps/uv/src/unix/async.c
index e1805c3237..24965a3c84 100644
--- a/deps/uv/src/unix/async.c
+++ b/deps/uv/src/unix/async.c
@@ -160,6 +160,7 @@ static void uv__async_io(uv_loop_t* loop, uv__io_t* w, unsigned int events) {
if (h->async_cb == NULL)
continue;
+ assert(!(h->flags & UV_HANDLE_CLOSED));
h->async_cb(h);
}
} @addaleax - will this above assertion answer your question ? If so, no is the answer, as I got core dumps again with the same crash context, passing the assertion. If not, let me know if you have another proposal that I can check. |
ok, I even tried this: diff --git a/deps/uv/src/unix/async.c b/deps/uv/src/unix/async.c
index e1805c3237..3f16b37666 100644
--- a/deps/uv/src/unix/async.c
+++ b/deps/uv/src/unix/async.c
@@ -160,6 +160,8 @@ static void uv__async_io(uv_loop_t* loop, uv__io_t* w, unsigned int events) {
if (h->async_cb == NULL)
continue;
+ assert(!uv_is_closing(h));
h->async_cb(h);
}
} and still it does not hit the assertion, yet it crashes in the old way. |
@gireeshpunathil Yeah, that does answer it – but it’s worrying. How hard is it to reproduce this for you? Would you be able to make the test crash while running unter valgrind? |
@addaleax - sure. usually it crashes like once in ten runs or so, so not hard at all. will try with valgrind. |
@addaleax - I tried doing a bisect, but got inconclusive and contradicting results (mostly because the failure is not deterministic), so deeming it as non-trustworthy, and not posting here. I am taking another approach of unwinding the history of |
When all strong `BaseObjectPtr`s to a `HandleWrap` are gone, we should not delete the `HandleWrap` outright, but instead close it and then delete it only once the libuv close callback has been called. Based on the valgrind output from the issue below, this has a good chance of fixing it. Fixes: nodejs#39036
@gireeshpunathil Thanks, the valgrind report looks very helpful. In particular, we can tell from the valgrind report that the problem is that a MessagePort object was deleted while it was being transferred to another MessagePort that was unable to deserialize the message as a whole (because the worker was shutting down, I presume). That should not be happening – the transferred MessagePort should be closed, not outright deleted. I’ve opened a potential fix based on this in #39441 (but haven’t confirmed that it actually fixes this issue here). Thank you for the helpful debugging! |
@addaleax - that is very nice, thanks! So what you are saying is that a MessagePort should not be deleted outright if the transfer fails; what if the transfer succeeds? Does the new MessagePort created (the deserialized copy) makes the old one safe to delete? What causes the stale objects to escape into libuv? I have still a gap in my understanding here, it would be great if you can fill me up. |
👍
There’s no copy, that’s the point of transfering (as opposed to cloning). :) If the transfer succeeds, there’s a new
I’m not sure what you mean by ‘escape’ – the receiving side creates a
I guess it might help to draw out the timeline of events in the crashing case:
|
ok, thanks @addaleax for the detailed explanation, this is really useful! |
When all strong `BaseObjectPtr`s to a `HandleWrap` are gone, we should not delete the `HandleWrap` outright, but instead close it and then delete it only once the libuv close callback has been called. Based on the valgrind output from the issue below, this has a good chance of fixing it. Fixes: #39036 PR-URL: #39441 Reviewed-By: Tobias Nießen <tniessen@tnie.de> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com>
When all strong `BaseObjectPtr`s to a `HandleWrap` are gone, we should not delete the `HandleWrap` outright, but instead close it and then delete it only once the libuv close callback has been called. Based on the valgrind output from the issue below, this has a good chance of fixing it. Fixes: #39036 PR-URL: #39441 Reviewed-By: Tobias Nießen <tniessen@tnie.de> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com>
When all strong `BaseObjectPtr`s to a `HandleWrap` are gone, we should not delete the `HandleWrap` outright, but instead close it and then delete it only once the libuv close callback has been called. Based on the valgrind output from the issue below, this has a good chance of fixing it. Fixes: #39036 PR-URL: #39441 Reviewed-By: Tobias Nießen <tniessen@tnie.de> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com>
When all strong `BaseObjectPtr`s to a `HandleWrap` are gone, we should not delete the `HandleWrap` outright, but instead close it and then delete it only once the libuv close callback has been called. Based on the valgrind output from the issue below, this has a good chance of fixing it. Fixes: #39036 PR-URL: #39441 Reviewed-By: Tobias Nießen <tniessen@tnie.de> Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: James M Snell <jasnell@gmail.com>
https://ci.nodejs.org/job/node-test-binary-arm-12+/11278/RUN_SUBSET=0,label=pi2-docker/console
https://ci.nodejs.org/job/node-test-commit-linux/41808/nodes=debian10-x64/console
The text was updated successfully, but these errors were encountered: