-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate flaky test-cli-node-options #25028
Comments
And again. Probably time to mark this as flaky. Will open a PR. test-packetnet-ubuntu1604-arm64-2 https://ci.nodejs.org/job/node-test-commit-arm/20790/nodes=ubuntu1604-arm64/consoleText not ok 175 parallel/test-cli-node-options
---
duration_ms: 0.986
severity: fail
exitcode: 1
stack: |-
assert.js:753
throw newErr;
^
AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")
#
# Fatal error in , line 0
# Check failed: (perf_output_handle_) != nullptr.
#
#
#
#FailureMessage Object: 0xffffcb5fc4d8
at ChildProcess.exithandler (child_process.js:294:12)
at ChildProcess.emit (events.js:189:13)
at maybeClose (internal/child_process.js:978:16)
at Socket.stream.socket.on (internal/child_process.js:396:11)
at Socket.emit (events.js:189:13)
at Pipe._handle.close (net.js:612:12)
... |
Refs: nodejs#25028 PR-URL: nodejs#25032 Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
easily recreated if you run this a hundred times in freebsd
(gdb) where
#0 0x00000000015a8eb2 in v8::base::OS::Abort ()
#1 0x0000000000f79516 in v8::internal::PerfBasicLogger::PerfBasicLogger ()
#2 0x0000000000f7fb92 in v8::internal::Logger::SetUp ()
#3 0x0000000000f5b25c in v8::internal::Isolate::Init ()
#4 0x000000000117dd3f in v8::internal::Snapshot::Initialize ()
#5 0x0000000000ace03b in v8::Isolate::Initialize ()
#6 0x00000000008fba32 in node::NewIsolate ()
#7 0x00000000008fdc88 in node::Start ()
#8 0x00000000008fbfdb in node::Start ()
#9 0x00000000008ae095 in _start ()
#10 0x00000008022f0000 in ?? ()
#11 0x0000000000000000 in ?? ()
(gdb) this corresponds to: Lines 294 to 296 in a6f69eb
truss output showed this:
so this is basically a temporary name collision. Unlikely to happen on production. One of:
|
Am I reading this correctly, its pid reuse which is causing the tmpfile name collision? Is it possible to configure freebsd to use a larger pid space so the reuse doesn't occur during our tests runs? And does this point to a problem with the tempfile name generation? |
@sam-github - yes, this is an issue with PIDs in small ranges that get re-used often, but when composed into file names that is being used by different users causes permission issues, leading to temp file generation failure. I don't know the admin command to change the PID pattern, and do not have the permission to try that even if I know; though widening the PID even by one digit should resolve collision I guess. |
on a side note: though I recreated this in |
Perhaps the test should use the env to set the temp directory to per-user location, like Though I'm a bit confused, are multiple users running the node tests at the same time? If there are multiple parallel test runs by the same user, permission problems shouldn't occur, though other conflicts could. |
I don't know; let us ask @nodejs/testing and @Trott |
for the first one (setting |
for this |
I believe FreeBSD uses random PIDs which increases the likelihood of reuse. I was hoping if (!common.isWindows) {
process.env.TMPDIR = '/dev/null';
expect('--perf-basic-prof', 'B\n');
} (Reading more carefully now, I see @gireeshpunathil already tried that too.) @nodejs/v8 Is it possible to have the |
It looks to be hardcoded :( Line 280 in b2f74f7
|
IIRC The linux perf tool (for which this file is generated) looks specifically for files with that name. IOW, the hardcoding patter is probably necessary. Do we need to test |
Failure due to this should never happen if the test is always run with the same user. This might be best as a wontfix. We might be overthinking this. |
Refs: nodejs#25028 PR-URL: nodejs#25032 Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com> Reviewed-By: Colin Ihrig <cjihrig@gmail.com> Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
test-packetnet-ubuntu1604-arm64-2 assert.js:768
throw newErr;
^
AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")
#
# Fatal error in , line 0
# Check failed: (perf_output_handle_) != nullptr.
#
#
#
#FailureMessage Object: 0xffffc0d5e708
at ChildProcess.exithandler (child_process.js:295:12)
at ChildProcess.emit (events.js:193:13)
at maybeClose (internal/child_process.js:1000:16)
at Socket.<anonymous> (internal/child_process.js:405:11)
at Socket.emit (events.js:193:13)
at Pipe.<anonymous> (net.js:593:12) |
I see one of these two options as the way out:
I favor second option, but would like to know how others think. |
https://ci.nodejs.org/job/node-test-commit-arm/23304/nodes=ubuntu1604-arm64/consoleFull test-packetnet-ubuntu1604-arm64-2 00:07:41 not ok 192 parallel/test-cli-node-options # TODO : Fix flaky test
00:07:41 ---
00:07:41 duration_ms: 1.193
00:07:41 severity: flaky
00:07:41 exitcode: 1
00:07:41 stack: |-
00:07:41 assert.js:769
00:07:41 throw newErr;
00:07:41 ^
00:07:41
00:07:41 AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")
00:07:41
00:07:41
00:07:41 #
00:07:41 # Fatal error in , line 0
00:07:41 # Check failed: (perf_output_handle_) != nullptr.
00:07:41 #
00:07:41 #
00:07:41 #
00:07:41 #FailureMessage Object: 0xffffe5e1b4d8
00:07:41 at ChildProcess.exithandler (child_process.js:298:12)
00:07:41 at ChildProcess.emit (events.js:194:13)
00:07:41 at maybeClose (internal/child_process.js:1000:16)
00:07:41 at Socket.<anonymous> (internal/child_process.js:405:11)
00:07:41 at Socket.emit (events.js:194:13)
00:07:41 at Pipe.<anonymous> (net.js:593:12)
00:07:41 ... |
The test launches more than 20 asynchronous Node.js processes. That seems like it may be too many for under-powered machines (like Raspberry Pi). I wonder if a sufficient fix is to either split it up into 2 or more separate test files that each launch only 4-10 processes asynchronously, or else simply alter the test to use |
The test launches over 20 processes asynchronously. That may be too many for underpowered machines in CI with limited PID space. Fixes: nodejs#25028
https://ci.nodejs.org/job/node-test-commit-arm/23354/nodes=ubuntu1604-arm64/consoleFull test-packetnet-ubuntu1604-arm64-2 00:06:58 not ok 187 parallel/test-cli-node-options # TODO : Fix flaky test
00:06:58 ---
00:06:58 duration_ms: 1.163
00:06:58 severity: flaky
00:06:58 exitcode: 1
00:06:58 stack: |-
00:06:58 assert.js:769
00:06:58 throw newErr;
00:06:58 ^
00:06:58
00:06:58 AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")
00:06:58
00:06:58
00:06:58 #
00:06:58 # Fatal error in , line 0
00:06:58 # Check failed: (perf_output_handle_) != nullptr.
00:06:58 #
00:06:58 #
00:06:58 #
00:06:58 #FailureMessage Object: 0xffffc1535638
00:06:58 at ChildProcess.exithandler (child_process.js:298:12)
00:06:58 at ChildProcess.emit (events.js:194:13)
00:06:58 at maybeClose (internal/child_process.js:998:16)
00:06:58 at Socket.<anonymous> (internal/child_process.js:403:11)
00:06:58 at Socket.emit (events.js:194:13)
00:06:58 at Pipe.<anonymous> (net.js:593:12)
00:06:58 ... |
@Trott This seems like it may or may not be a V8 bug. Is there any machine on which this is reproduces sufficiently often so that one could debug this? |
@addaleax test-packetnet-ubuntu1604-arm64-2 is the one it's happened on the last few times so that might be a good candidate. |
parallelism isn't a problem on these machines fwiw @addaleax I've added you to root@147.75.74.174 nodejs/build#1747 If we're seeing errors limited to a single machine then we might be running into hardware problems, we've been tracking these but never been able to nail it firmly down #23913 |
https://ci.nodejs.org/job/node-test-commit-freebsd/25352/nodes=freebsd11-x64/consoleFull test-digitalocean-freebsd11-x64-1 00:28:53 not ok 242 parallel/test-cli-node-options
00:28:53 ---
00:28:53 duration_ms: 2.676
00:28:53 severity: fail
00:28:53 exitcode: 1
00:28:53 stack: |-
00:28:53 assert.js:769
00:28:53 throw newErr;
00:28:53 ^
00:28:53
00:28:53 AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/out/Release/node -e console.log("B")
00:28:53
00:28:53
00:28:53 #
00:28:53 # Fatal error in , line 0
00:28:53 # Check failed: (perf_output_handle_) != nullptr.
00:28:53 #
00:28:53 #
00:28:53 #
00:28:53 #FailureMessage Object: 0x7fffffffcfe0
00:28:53 at ChildProcess.exithandler (child_process.js:298:12)
00:28:53 at ChildProcess.emit (events.js:194:13)
00:28:53 at maybeClose (internal/child_process.js:998:16)
00:28:53 at Socket.<anonymous> (internal/child_process.js:403:11)
00:28:53 at Socket.emit (events.js:194:13)
00:28:53 at Pipe.<anonymous> (net.js:593:12)
00:28:53 ... |
I'm not saying that parallelism is necessarily the issue with these tests, but the history of our test issues on CI suggests that a high processor count is suggestive of parallelism being the problem, not the other way around. When there are tests that fail when run at the same time as lots of other tests, it tends to show up on machines in CI that have very high processor counts. Counterintuitive at first, but then when you start troubleshooting, it becomes more obvious: The processor count determines how many tests are run at the same time. If you are running 96 tests at once, you are far more likely to hit two tests that have a previously-unrecognized incompatibility. If you run 4 tests at once, this is far less likely. |
Fwiw, I’m currently trying to debug this on the machine itself, and the test also fails rather frequently when it’s being run as a standalone script, without other tests being run at the same time. |
In response to the "Fwiw", I'd say that's worth an awful lot. 😀 |
Oh, yeah, and the results from #26994 sure seem to rule out parallelism being the problem here too. Given that PR, I probably shouldn't have said anything at all. Sorry about the distraction! (EDIT: Although it seemed to feel like the right explanation with the PID re-use thing....) |
Yup, that’s it. All in all, the issue doesn’t appear to be super complicated: When running Node.js with Lines 283 to 299 in 5c2ee4e
If the file already exists and is not writable, the The host to which I have SSH access has a PID range from 0 to 100000, and currently has 52453 files of that format in (I’m deleting all of these files now on that host, just so that the test passes a bit more often. You can re-create the issue locally like this: https://gist.github.com/addaleax/e7d6db099ae194a3f56e473c9d4c49a6.) So:
/cc @nodejs/build @nodejs/testing @nodejs/v8 |
The only circumstances in which tests are run as root in our infra is if they are run manually. We run everything as 'iojs' and it never has sudo access. So I'm suspecting this is a problem with our manual access processes, maybe we need to make it clear that anyone given manual access needs to run tests as 'iojs' and the root SSH access is mainly for convenience. Or maybe we need to give only 'iojs' SSH access most of the time instead? |
I think this can be closed because the running-tests-as-root thing isn't something that happens in regular CI runs, but is operator error? Or is that letting ourselves off the hook too easily? |
That was on test-digitalocean-freebsd11-x64-1. For some reason, I am unable to log on to that host to check for root-owned profiling files in |
The test failure is not platform-specific and is the result of manual/human error. Some improvements may be possible, but there is nothing fundamentally unsound about the test insofar as when it fails in CI, there is a problem on the host that needs to be addressed and not an inherent issue with the test. Refs: nodejs#25028 (comment) Closes: nodejs#25028
test-cli-node-options has been failing a lot on arm lately in CI. I assume it's the bug reported in #21383 ("make test: use after free: parallel/test-cli-node-options").
Sample failure: https://ci.nodejs.org/job/node-test-commit-arm/20786/nodes=ubuntu1604-arm64/consoleText
Host: test-packetnet-ubuntu1604-arm64-2
The text was updated successfully, but these errors were encountered: