Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate flaky test-cli-node-options #25028

Closed
Trott opened this issue Dec 14, 2018 · 30 comments
Closed

Investigate flaky test-cli-node-options #25028

Trott opened this issue Dec 14, 2018 · 30 comments
Labels
arm Issues and PRs related to the ARM platform. c++ Issues and PRs that require attention from people who are familiar with C++. flaky-test Issues and PRs related to the tests with unstable failures on the CI.

Comments

@Trott
Copy link
Member

Trott commented Dec 14, 2018

test-cli-node-options has been failing a lot on arm lately in CI. I assume it's the bug reported in #21383 ("make test: use after free: parallel/test-cli-node-options").

Sample failure: https://ci.nodejs.org/job/node-test-commit-arm/20786/nodes=ubuntu1604-arm64/consoleText

Host: test-packetnet-ubuntu1604-arm64-2

not ok 184 parallel/test-cli-node-options
  ---
  duration_ms: 1.266
  severity: fail
  exitcode: 1
  stack: |-
    assert.js:753
        throw newErr;
        ^
    
    AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")
    
    
    #
    # Fatal error in , line 0
    # Check failed: (perf_output_handle_) != nullptr.
    #
    #
    #
    #FailureMessage Object: 0xffffe553e558
        at ChildProcess.exithandler (child_process.js:294:12)
        at ChildProcess.emit (events.js:189:13)
        at maybeClose (internal/child_process.js:978:16)
        at Socket.stream.socket.on (internal/child_process.js:396:11)
        at Socket.emit (events.js:189:13)
        at Pipe._handle.close (net.js:612:12)
  ...
@Trott Trott added c++ Issues and PRs that require attention from people who are familiar with C++. flaky-test Issues and PRs related to the tests with unstable failures on the CI. arm Issues and PRs related to the ARM platform. labels Dec 14, 2018
@Trott
Copy link
Member Author

Trott commented Dec 14, 2018

And again. Probably time to mark this as flaky. Will open a PR.

test-packetnet-ubuntu1604-arm64-2

https://ci.nodejs.org/job/node-test-commit-arm/20790/nodes=ubuntu1604-arm64/consoleText

not ok 175 parallel/test-cli-node-options
  ---
  duration_ms: 0.986
  severity: fail
  exitcode: 1
  stack: |-
    assert.js:753
        throw newErr;
        ^
    
    AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")
    
    
    #
    # Fatal error in , line 0
    # Check failed: (perf_output_handle_) != nullptr.
    #
    #
    #
    #FailureMessage Object: 0xffffcb5fc4d8
        at ChildProcess.exithandler (child_process.js:294:12)
        at ChildProcess.emit (events.js:189:13)
        at maybeClose (internal/child_process.js:978:16)
        at Socket.stream.socket.on (internal/child_process.js:396:11)
        at Socket.emit (events.js:189:13)
        at Pipe._handle.close (net.js:612:12)
  ...

Trott added a commit to Trott/io.js that referenced this issue Dec 14, 2018
Trott added a commit to Trott/io.js that referenced this issue Dec 14, 2018
Refs: nodejs#25028

PR-URL: nodejs#25032
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
BethGriggs pushed a commit that referenced this issue Dec 18, 2018
Refs: #25028

PR-URL: #25032
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
@gireeshpunathil
Copy link
Member

easily recreated if you run this a hundred times in freebsd

node --perf-basic-prof -e 'console.log("B")'

#
# Fatal error in , line 0
# Check failed: (perf_output_handle_) != nullptr.
#
#
#
(gdb) where
#0  0x00000000015a8eb2 in v8::base::OS::Abort ()
#1  0x0000000000f79516 in v8::internal::PerfBasicLogger::PerfBasicLogger ()
#2  0x0000000000f7fb92 in v8::internal::Logger::SetUp ()
#3  0x0000000000f5b25c in v8::internal::Isolate::Init ()
#4  0x000000000117dd3f in v8::internal::Snapshot::Initialize ()
#5  0x0000000000ace03b in v8::Isolate::Initialize ()
#6  0x00000000008fba32 in node::NewIsolate ()
#7  0x00000000008fdc88 in node::Start ()
#8  0x00000000008fbfdb in node::Start ()
#9  0x00000000008ae095 in _start ()
#10 0x00000008022f0000 in ?? ()
#11 0x0000000000000000 in ?? ()
(gdb)      

this corresponds to:

node/deps/v8/src/log.cc

Lines 294 to 296 in a6f69eb

perf_output_handle_ =
base::OS::FOpen(perf_dump_name.start(), base::OS::LogFileOpenMode);
CHECK_NOT_NULL(perf_output_handle_);

truss output showed this:

24134: open("/tmp/perf-24134.map",O_WRONLY|O_CREAT|O_TRUNC,0666) ERR#13 'Permission denied'


#
# Fatal error in , line 0
# 24134: write(2,"\n\n#\n# Fatal error in , line 0"...,32) = 32 (0x20)
Check failed: (perf_output_handle_) != nullptr.24134: write(2,"Check failed: (perf_output_handl"...,47) =
 47 (0x2f)

#
#
#
#FailureMessage Object: 0x7fffffffdc4024134: write(2,"\n#\n#\n#\n#FailureMessage Objec"...,45) = 45 (0x2d
)
$ l /tmp/perf-24134.map
-rw-r--r--  1 iojs  wheel  68110 May 14  2018 /tmp/perf-24134.map
$ whoami
freebsd
$ 

so this is basically a temporary name collision. Unlikely to happen on production. One of:

  • clean up the temp folder prior to runs (test / test harness)
  • only one type of user run things
  • retry on failure such as EACCES (PerfBasicLogger::PerfBasicLogger)

@sam-github
Copy link
Contributor

Am I reading this correctly, its pid reuse which is causing the tmpfile name collision? Is it possible to configure freebsd to use a larger pid space so the reuse doesn't occur during our tests runs? And does this point to a problem with the tempfile name generation?

@gireeshpunathil
Copy link
Member

@sam-github - yes, this is an issue with PIDs in small ranges that get re-used often, but when composed into file names that is being used by different users causes permission issues, leading to temp file generation failure.

I don't know the admin command to change the PID pattern, and do not have the permission to try that even if I know; though widening the PID even by one digit should resolve collision I guess.

@gireeshpunathil
Copy link
Member

on a side note: though I recreated this in __freebsd__ and the issue is 100% same with the original report, the original failure was reported in arm. So I am wondering whether fixing it at the host level would be the right thing or not.

@sam-github
Copy link
Contributor

Perhaps the test should use the env to set the temp directory to per-user location, like $PWD/out/Release, to avoid conflict? Perhaps in tools/test.py?

Though I'm a bit confused, are multiple users running the node tests at the same time? If there are multiple parallel test runs by the same user, permission problems shouldn't occur, though other conflicts could.

@gireeshpunathil
Copy link
Member

I don't know; let us ask @nodejs/testing and @Trott

@gireeshpunathil
Copy link
Member

for the first one (setting TMPDIR prior to the test) I tried; but does not look like being honored; the '/tmp' location seem to be assumed in the v8 code.

@gireeshpunathil
Copy link
Member

for this __freebsd__ failure instance: I use freebsd user name whereas the username for CI runs seem to be iojs and that could have triggered this issue. However, given the failure is exactly the same in CI runs as well, its root cause also can be assumed to be the same; i.e, multiple usernames in CI that conflicts when temp files are composed.

@Trott
Copy link
Member Author

Trott commented Jan 11, 2019

I believe FreeBSD uses random PIDs which increases the likelihood of reuse.

I was hoping --perf-basic-prof could be instructed to put the .map file someplace else than the default (system temp dir) location with the TMPDIR environment variable, but alas, that appears to not be the case. If it were, we could do something like this perhaps:

if (!common.isWindows) {
  process.env.TMPDIR = '/dev/null';
  expect('--perf-basic-prof', 'B\n');
}

(Reading more carefully now, I see @gireeshpunathil already tried that too.)

@nodejs/v8 Is it possible to have the .map file end up someplace else, or is it pretty much always going to end up in /tmp?

@richardlau
Copy link
Member

It looks to be hardcoded :(

const char PerfBasicLogger::kFilenameFormatString[] = "/tmp/perf-%d.map";

@ofrobots
Copy link
Contributor

IIRC The linux perf tool (for which this file is generated) looks specifically for files with that name. IOW, the hardcoding patter is probably necessary. Do we need to test --perf-basic-prof on FreeBSD?

@Trott
Copy link
Member Author

Trott commented Jan 11, 2019

Do we need to test --perf-basic-prof on FreeBSD?

Failure due to this should never happen if the test is always run with the same user. This might be best as a wontfix. We might be overthinking this.

refack pushed a commit to refack/node that referenced this issue Jan 14, 2019
Refs: nodejs#25028

PR-URL: nodejs#25032
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
MylesBorins pushed a commit that referenced this issue Jan 29, 2019
Refs: #25028

PR-URL: #25032
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
codebytere pushed a commit that referenced this issue Jan 29, 2019
Refs: #25028

PR-URL: #25032
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
codebytere pushed a commit that referenced this issue Jan 29, 2019
Refs: #25028

PR-URL: #25032
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
rvagg pushed a commit that referenced this issue Feb 28, 2019
Refs: #25028

PR-URL: #25032
Reviewed-By: Gireesh Punathil <gpunathi@in.ibm.com>
Reviewed-By: Colin Ihrig <cjihrig@gmail.com>
Reviewed-By: Daijiro Wachi <daijiro.wachi@gmail.com>
@Trott
Copy link
Member Author

Trott commented Mar 22, 2019

https://ci.nodejs.org/job/node-test-commit-arm/23063/nodes=ubuntu1604-arm64/testReport/(root)/test/parallel_test_cli_node_options_/

test-packetnet-ubuntu1604-arm64-2

assert.js:768
    throw newErr;
    ^

AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")


#
# Fatal error in , line 0
# Check failed: (perf_output_handle_) != nullptr.
#
#
#
#FailureMessage Object: 0xffffc0d5e708
    at ChildProcess.exithandler (child_process.js:295:12)
    at ChildProcess.emit (events.js:193:13)
    at maybeClose (internal/child_process.js:1000:16)
    at Socket.<anonymous> (internal/child_process.js:405:11)
    at Socket.emit (events.js:193:13)
    at Pipe.<anonymous> (net.js:593:12)

@gireeshpunathil
Copy link
Member

I see one of these two options as the way out:

  • arm and freebsd systems have much less kern.pid_max than linux or aix that means this is test is prone to PID collision in those systems. Manually increase this value in all the systems: sysctl -w kern.pid_max=122880

  • aligning to what @Trott held in Investigate flaky test-cli-node-options #25028 (comment) , skip the --perf-basic-prof for arm, and preferably freebsd too.

I favor second option, but would like to know how others think.

@Trott
Copy link
Member Author

Trott commented Mar 29, 2019

https://ci.nodejs.org/job/node-test-commit-arm/23304/nodes=ubuntu1604-arm64/consoleFull

test-packetnet-ubuntu1604-arm64-2

00:07:41 not ok 192 parallel/test-cli-node-options # TODO : Fix flaky test
00:07:41   ---
00:07:41   duration_ms: 1.193
00:07:41   severity: flaky
00:07:41   exitcode: 1
00:07:41   stack: |-
00:07:41     assert.js:769
00:07:41         throw newErr;
00:07:41         ^
00:07:41     
00:07:41     AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")
00:07:41     
00:07:41     
00:07:41     #
00:07:41     # Fatal error in , line 0
00:07:41     # Check failed: (perf_output_handle_) != nullptr.
00:07:41     #
00:07:41     #
00:07:41     #
00:07:41     #FailureMessage Object: 0xffffe5e1b4d8
00:07:41         at ChildProcess.exithandler (child_process.js:298:12)
00:07:41         at ChildProcess.emit (events.js:194:13)
00:07:41         at maybeClose (internal/child_process.js:1000:16)
00:07:41         at Socket.<anonymous> (internal/child_process.js:405:11)
00:07:41         at Socket.emit (events.js:194:13)
00:07:41         at Pipe.<anonymous> (net.js:593:12)
00:07:41   ...

@Trott
Copy link
Member Author

Trott commented Mar 29, 2019

The test launches more than 20 asynchronous Node.js processes. That seems like it may be too many for under-powered machines (like Raspberry Pi). I wonder if a sufficient fix is to either split it up into 2 or more separate test files that each launch only 4-10 processes asynchronously, or else simply alter the test to use execFileSync(). That second approach might make the test simpler, and of course will still test NODE_OPTIONS. Opening a PR soon....

Trott added a commit to Trott/io.js that referenced this issue Mar 29, 2019
The test launches over 20 processes asynchronously. That may be too many
for underpowered machines in CI with limited PID space.

Fixes: nodejs#25028
@Trott
Copy link
Member Author

Trott commented Mar 31, 2019

https://ci.nodejs.org/job/node-test-commit-arm/23354/nodes=ubuntu1604-arm64/consoleFull

test-packetnet-ubuntu1604-arm64-2

00:06:58 not ok 187 parallel/test-cli-node-options # TODO : Fix flaky test
00:06:58   ---
00:06:58   duration_ms: 1.163
00:06:58   severity: flaky
00:06:58   exitcode: 1
00:06:58   stack: |-
00:06:58     assert.js:769
00:06:58         throw newErr;
00:06:58         ^
00:06:58     
00:06:58     AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /home/iojs/build/workspace/node-test-commit-arm/nodes/ubuntu1604-arm64/out/Release/node -e console.log("B")
00:06:58     
00:06:58     
00:06:58     #
00:06:58     # Fatal error in , line 0
00:06:58     # Check failed: (perf_output_handle_) != nullptr.
00:06:58     #
00:06:58     #
00:06:58     #
00:06:58     #FailureMessage Object: 0xffffc1535638
00:06:58         at ChildProcess.exithandler (child_process.js:298:12)
00:06:58         at ChildProcess.emit (events.js:194:13)
00:06:58         at maybeClose (internal/child_process.js:998:16)
00:06:58         at Socket.<anonymous> (internal/child_process.js:403:11)
00:06:58         at Socket.emit (events.js:194:13)
00:06:58         at Pipe.<anonymous> (net.js:593:12)
00:06:58   ...

@addaleax
Copy link
Member

@Trott This seems like it may or may not be a V8 bug. Is there any machine on which this is reproduces sufficiently often so that one could debug this?

@Trott
Copy link
Member Author

Trott commented Mar 31, 2019

@addaleax test-packetnet-ubuntu1604-arm64-2 is the one it's happened on the last few times so that might be a good candidate.

@rvagg
Copy link
Member

rvagg commented Apr 1, 2019

$ ssh test-packetnet-ubuntu1604-arm64-2 cat /proc/cpuinfo | grep ^processor | wc -l
96

parallelism isn't a problem on these machines fwiw

@addaleax I've added you to root@147.75.74.174 nodejs/build#1747

If we're seeing errors limited to a single machine then we might be running into hardware problems, we've been tracking these but never been able to nail it firmly down #23913

@Trott
Copy link
Member Author

Trott commented Apr 3, 2019

https://ci.nodejs.org/job/node-test-commit-freebsd/25352/nodes=freebsd11-x64/consoleFull

test-digitalocean-freebsd11-x64-1

00:28:53 not ok 242 parallel/test-cli-node-options
00:28:53   ---
00:28:53   duration_ms: 2.676
00:28:53   severity: fail
00:28:53   exitcode: 1
00:28:53   stack: |-
00:28:53     assert.js:769
00:28:53         throw newErr;
00:28:53         ^
00:28:53     
00:28:53     AssertionError [ERR_ASSERTION]: ifError got unwanted exception: Command failed: /usr/home/iojs/build/workspace/node-test-commit-freebsd/nodes/freebsd11-x64/out/Release/node -e console.log("B")
00:28:53     
00:28:53     
00:28:53     #
00:28:53     # Fatal error in , line 0
00:28:53     # Check failed: (perf_output_handle_) != nullptr.
00:28:53     #
00:28:53     #
00:28:53     #
00:28:53     #FailureMessage Object: 0x7fffffffcfe0
00:28:53         at ChildProcess.exithandler (child_process.js:298:12)
00:28:53         at ChildProcess.emit (events.js:194:13)
00:28:53         at maybeClose (internal/child_process.js:998:16)
00:28:53         at Socket.<anonymous> (internal/child_process.js:403:11)
00:28:53         at Socket.emit (events.js:194:13)
00:28:53         at Pipe.<anonymous> (net.js:593:12)
00:28:53   ...

@Trott
Copy link
Member Author

Trott commented Apr 3, 2019

parallelism isn't a problem on these machines fwiw

I'm not saying that parallelism is necessarily the issue with these tests, but the history of our test issues on CI suggests that a high processor count is suggestive of parallelism being the problem, not the other way around.

When there are tests that fail when run at the same time as lots of other tests, it tends to show up on machines in CI that have very high processor counts. Counterintuitive at first, but then when you start troubleshooting, it becomes more obvious: The processor count determines how many tests are run at the same time. If you are running 96 tests at once, you are far more likely to hit two tests that have a previously-unrecognized incompatibility. If you run 4 tests at once, this is far less likely.

@addaleax
Copy link
Member

addaleax commented Apr 3, 2019

Fwiw, I’m currently trying to debug this on the machine itself, and the test also fails rather frequently when it’s being run as a standalone script, without other tests being run at the same time.

@Trott
Copy link
Member Author

Trott commented Apr 3, 2019

Fwiw, I’m currently trying to debug this on the machine itself, and the test also fails rather frequently when it’s being run as a standalone script, without other tests being run at the same time.

In response to the "Fwiw", I'd say that's worth an awful lot. 😀

@Trott
Copy link
Member Author

Trott commented Apr 3, 2019

Oh, yeah, and the results from #26994 sure seem to rule out parallelism being the problem here too. Given that PR, I probably shouldn't have said anything at all. Sorry about the distraction! (EDIT: Although it seemed to feel like the right explanation with the PID re-use thing....)

@addaleax
Copy link
Member

addaleax commented Apr 3, 2019

Although it seemed to feel like the right explanation with the PID re-use thing

Yup, that’s it. All in all, the issue doesn’t appear to be super complicated:

When running Node.js with --perf-basic-prof, like this test does (via NODE_OPTIONS, but that’s not relevalent), it will try to create a file named /tmp/perf-${pid}.map.

node/deps/v8/src/log.cc

Lines 283 to 299 in 5c2ee4e

const char PerfBasicLogger::kFilenameFormatString[] = "/tmp/perf-%d.map";
// Extra space for the PID in the filename
const int PerfBasicLogger::kFilenameBufferPadding = 16;
PerfBasicLogger::PerfBasicLogger(Isolate* isolate)
: CodeEventLogger(isolate), perf_output_handle_(nullptr) {
// Open the perf JIT dump file.
int bufferSize = sizeof(kFilenameFormatString) + kFilenameBufferPadding;
ScopedVector<char> perf_dump_name(bufferSize);
int size = SNPrintF(
perf_dump_name,
kFilenameFormatString,
base::OS::GetCurrentProcessId());
CHECK_NE(size, -1);
perf_output_handle_ =
base::OS::FOpen(perf_dump_name.start(), base::OS::LogFileOpenMode);
CHECK_NOT_NULL(perf_output_handle_);

If the file already exists and is not writable, the CHECK in L299 crashes, causing these failures. If the file does not exist, it is created, and never cleaned up.

The host to which I have SSH access has a PID range from 0 to 100000, and currently has 52453 files of that format in /tmp/, of which 50016 are owned by root; meaning, it will fail about 50 % of the time, because tests running under the iojs user doesn’t have permissions to overwrite these files.

(I’m deleting all of these files now on that host, just so that the test passes a bit more often. You can re-create the issue locally like this: https://gist.github.com/addaleax/e7d6db099ae194a3f56e473c9d4c49a6.)

So:

  • Partly, this is a build WG problem, because sometimes tests run as root and sometimes as iojs, leading to permissions problems.
  • Partly, this is a problem with our test files themselves; we don’t clean up these files after they have been created.
  • Partly, this is a V8 problem, because hard crashing when the perf log file isn’t writable and not providing any more information isn’t a great solution.

/cc @nodejs/build @nodejs/testing @nodejs/v8

@rvagg
Copy link
Member

rvagg commented Apr 3, 2019

The only circumstances in which tests are run as root in our infra is if they are run manually. We run everything as 'iojs' and it never has sudo access. So I'm suspecting this is a problem with our manual access processes, maybe we need to make it clear that anyone given manual access needs to run tests as 'iojs' and the root SSH access is mainly for convenience. Or maybe we need to give only 'iojs' SSH access most of the time instead?

@Trott
Copy link
Member Author

Trott commented Apr 19, 2019

I think this can be closed because the running-tests-as-root thing isn't something that happens in regular CI runs, but is operator error? Or is that letting ourselves off the hook too easily?

@Trott
Copy link
Member Author

Trott commented Apr 19, 2019

ncu-ci walk commit only finds one recent example of this error: https://ci.nodejs.org/job/node-test-commit-freebsd/nodes=freebsd11-x64/25647/testReport/(root)/test/parallel_test_cli_node_options/

That was on test-digitalocean-freebsd11-x64-1. For some reason, I am unable to log on to that host to check for root-owned profiling files in /tmp.

Trott added a commit to Trott/io.js that referenced this issue Apr 19, 2019
The test failure is not platform-specific and is the result of
manual/human error. Some improvements may be possible, but there is
nothing fundamentally unsound about the test insofar as when it fails in
CI, there is a problem on the host that needs to be addressed and not an
inherent issue with the test.

Refs: nodejs#25028 (comment)
Closes: nodejs#25028
@Trott Trott closed this as completed in e37eee2 Apr 21, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arm Issues and PRs related to the ARM platform. c++ Issues and PRs that require attention from people who are familiar with C++. flaky-test Issues and PRs related to the tests with unstable failures on the CI.
Projects
None yet
7 participants