Flaky parallel/test-tls-buffersize #18998

BridgeAR · 2018-02-26T12:53:56Z

https://ci.nodejs.org/job/node-test-commit-linux/16566/nodes=debian8-x86/console

not ok 1585 parallel/test-tls-buffersize
  ---
  duration_ms: 0.213
  severity: fail
  stack: |-

The text was updated successfully, but these errors were encountered:

apapirovski · 2018-02-26T15:59:09Z

I'm not sure if there's much point anymore in creating separate issues for all of these debian8-x86, ubuntu1404 and ubuntu1204 failures without a strack-trace. They seem to affect literally all of the tests so we might eventually have an issue for every single test that exists on each of those platforms.

/cc @nodejs/build It would be nice if someone could look into this. The lack of a stack trace or signal code is puzzling.

gireeshpunathil · 2018-02-26T16:22:09Z

agree on the lack of hints / artifacts associated with these failure types - they don't help us in anyway other than telling that they failed. My debugging method in those cases always had been to run a 1000 times locally in expectation of a recreate, reduce the code as long as the problem retains etc. (time consuming and many times un-successful).

My question is: what type of failure this is: assertion failure / unforeseen exception / crash / bad exit caught by python driver / forced failure? is it a tap.assert(false) or something?

rvagg · 2018-02-27T00:58:51Z

@nodejs/build someone with test access should get into one of these machines straight after a failure and run the test manually to see what's up.

@apapirovski

They seem to affect literally all of the tests so we might eventually have an issue for every single test that exists on each of those platforms

Are you saying this is more than just that one test that fails like this?

BridgeAR · 2018-02-27T01:03:47Z

@rvagg check the ~~tests~~ issues with the flaky label. Probably more than 50 % are all similar to this one.

rvagg · 2018-02-27T03:41:45Z

Are these tests resource intensive in any way? Is it possible that they may crash if there are zombie jobs running in the background that may conflict them in some fashion - maybe sharing resources, or where a zombie has too much memory. I've been seeing a few make test-ci and associated children zombied on some linux machines recently, not many but enough that I could imagine it's connected.

rvagg · 2018-02-28T10:17:01Z

I've observed some similar failures on Docker containers today:

https://ci.nodejs.org/job/node-test-commit-linux/16729/nodes=alpine37-container-x64/ running on https://ci.nodejs.org/computer/test-softlayer-alpine37_container-x64-1 which runs on test-softlayer-ubuntu1604_docker-x64-1

not ok 1778 async-hooks/test-graph.intervals
  ---
  duration_ms: 0.240
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 11)

https://ci.nodejs.org/job/node-test-commit-linux-containered/2544/nodes=ubuntu1604_sharedlibs_openssl110_x64/ on https://ci.nodejs.org/computer/test-softlayer-ubuntu1604_sharedlibs_container-x64-2 which is on test-softlayer-ubuntu1604_docker-x64-2

not ok 1931 addons/stringbytes-external-exceed-max/test-stringbytes-external-exceed-max
  ---
  duration_ms: 21.243
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 9)

That same container had a different failure on its previous run that doesn't look the same but perhaps it's somehow related:

https://ci.nodejs.org/job/node-test-commit-linux-containered/2543/nodes=ubuntu1604_sharedlibs_fips20_x64/ on https://ci.nodejs.org/computer/test-softlayer-ubuntu1604_sharedlibs_container-x64-2 which is on test-softlayer-ubuntu1604_docker-x64-2

00:12:29.798 not ok 691 parallel/test-http-client-timeout-agent
00:12:29.798   ---
00:12:29.798   duration_ms: 0.824
00:12:29.798   severity: fail
00:12:29.798   stack: |-
00:12:29.798     res#0 data:0
00:12:29.798     res#0 end
00:12:29.798     res#2 data:2
00:12:29.799     res#2 end
00:12:29.799     res#4 data:4
00:12:29.799     res#4 end
00:12:29.799     res#6 data:6
00:12:29.799     res#6 end
00:12:29.799     res#8 data:8
00:12:29.799     res#8 end
00:12:29.799     res#10 data:10
00:12:29.799     res#10 end
00:12:29.799     res#12 data:12
00:12:29.799     res#12 end
00:12:29.800     res#14 data:14
00:12:29.800     res#14 end
00:12:29.800     res#16 data:16
00:12:29.800     res#16 end
00:12:29.800     res#18 data:18
00:12:29.800     res#18 end
00:12:29.800     res#20 data:20
00:12:29.800     res#20 end
00:12:29.800     res#22 data:22
00:12:29.800     res#22 end
00:12:29.801     res#24 data:24
00:12:29.801     res#24 end
00:12:29.801     res#26 data:26
00:12:29.801     res#26 end
00:12:29.801     res#28 data:28
00:12:29.801     res#28 end
00:12:29.801     req#1 timeout
00:12:29.801     req#3 timeout
00:12:29.801     req#5 timeout
00:12:29.801     req#7 timeout
00:12:29.802     req#9 timeout
00:12:29.802     req#11 timeout
00:12:29.802     req#13 timeout
00:12:29.802     req#15 timeout
00:12:29.802     req#17 timeout
00:12:29.802     req#19 timeout
00:12:29.802     req#21 timeout
00:12:29.802     req#23 timeout
00:12:29.802     req#25 timeout
00:12:29.803     req#27 timeout
00:12:29.803     req#29 timeout
00:12:29.803     req#0 timeout
00:12:29.803     req#28 close
00:12:29.803     req#26 close
00:12:29.803     req#24 close
00:12:29.803     req#22 close
00:12:29.803     req#20 close
00:12:29.803     req#18 close
00:12:29.803     req#16 close
00:12:29.804     req#14 close
00:12:29.804     req#12 close
00:12:29.804     req#10 close
00:12:29.804     req#8 close
00:12:29.804     req#6 close
00:12:29.804     req#4 close
00:12:29.804     req#2 close
00:12:29.804     req#0 close
00:12:29.804     req#29 error
00:12:29.804     req#29 close
00:12:29.804     req#27 error
00:12:29.805     req#27 close
00:12:29.805     req#25 error
00:12:29.805     req#25 close
00:12:29.805     req#23 error
00:12:29.805     req#23 close
00:12:29.805     req#21 error
00:12:29.805     req#21 close
00:12:29.805     req#19 error
00:12:29.805     req#19 close
00:12:29.805     req#17 error
00:12:29.806     req#17 close
00:12:29.806     req#15 error
00:12:29.806     req#15 close
00:12:29.806     req#13 error
00:12:29.806     req#13 close
00:12:29.806     req#11 error
00:12:29.806     req#11 close
00:12:29.806     req#9 error
00:12:29.806     req#9 close
00:12:29.806     req#7 error
00:12:29.807     req#7 close
00:12:29.807     req#5 error
00:12:29.807     req#5 close
00:12:29.807     req#3 error
00:12:29.807     req#3 close
00:12:29.807     req#1 error
00:12:29.808     req#1 close
00:12:29.808     done=31 sent=30
00:12:29.808     assert.js:74
00:12:29.808       throw new AssertionError(obj);
00:12:29.808       ^
00:12:29.808     
00:12:29.808     AssertionError [ERR_ASSERTION]: timeout on http request called too much
00:12:29.809         at process.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux-containered/nodes/ubuntu1604_sharedlibs_fips20_x64/test/parallel/test-http-client-timeout-agent.js:94:10)
00:12:29.810         at process.emit (events.js:136:15)

I couldn't find anything abnormal on these containers or the hosts running them.

apapirovski · 2018-03-01T09:47:36Z

@rvagg I think the stringbytes failures are memory overcommit related. I raised the issue before of potentially needing to turn it off on the CI machines (at least the alpine ones) because those tests try to assess whether enough memory is available before running but with memory overcommit they don't get a representative answer and proceed to crash.

The one referenced in this issue is very particular because it doesn't get killed by a signal code but it also doesn't have a stack trace...

#18998
#18997
#18996
#18995
#18735
#18733
#18732
#18663
#18504
#18496

There are a variety of subsystems represented here and I don't really see much in common.

rvagg · 2018-03-05T23:43:39Z

@nodejs/collaborators we're going to need some more help on getting to the bottom of this. Errors of this type are showing up pretty frequently and not for one specific test. Just looking at the last 4 node-test-commit-linux failures show what look to be similar failures (4 failures out of the 6 at the top right now):

https://ci.nodejs.org/job/node-test-commit-linux/16868/nodes=alpine35-container-x64/console
https://ci.nodejs.org/job/node-test-commit-linux/16867/nodes=debian8-x86/console
https://ci.nodejs.org/job/node-test-commit-linux/16868/nodes=alpine35-container-x64/console
https://ci.nodejs.org/job/node-test-commit-linux/16866/nodes=debian8-x86/console

The test-http-client-timeout-agent failure mentioned above is also showing up regularly too, I'm not sure if it's related or not.

Build has been working to iron out the frustrating Jenkins-related failures, but these kinds of errors are now one of the major blockers for getting CI back to green--your help in getting us there would be appreciated!

rvagg · 2018-03-05T23:46:35Z

Sorry, I got those numbers wrong. There are 5 failures out of the last 7 builds, 3 of them have the weird no-output crashes, two of them have the test-http-client-timeout-agent failure on Alpine. But you can go back not much further and even find test-http-client-timeout-agent failing on ARM64. There's at least one bug being exposed on Linux here.

joyeecheung · 2018-03-07T12:19:30Z

@rvagg It is possible to turn on ulimit -c <limit> on the problematic hosts and store the core dump + the Node.js binary for a period of time (or up to n GB of spaces) so we can download them and debug those?

rvagg · 2018-03-08T03:51:15Z

@joyeecheung ok, done on the Debian 8 and Debian 9 hosts. It's a bit more tricky on the Alpine ones so I haven't bothered for now. Next thing is to keep an eye out for these crashes.
I've only done a limit of infinity and set a reminder that we should undo it at some point. Is it actually possible to limit the space taken up by core files in total rather than just the size of the files themselves? Or to set a maxage without scripting something in cron?

rvagg · 2018-03-09T00:52:09Z

had one on test-digitalocean-debian8-x64-1 today https://ci.nodejs.org/job/node-test-commit-linux/16972/nodes=debian8-64/

not ok 2029 sequential/test-http-server-keep-alive-timeout-slow-server
  ---
  duration_ms: 0.212
  severity: fail
  stack: |-

but no core, at least I can't find one, perhaps one was made in the working directory but destroyed with the next run after a git clean. It's also possible that LimitCORE did nothing since we could be using a too-old version of systemd on Debian 8, the coredumpctl utilities aren't even avilable by default. So I've gone ahead and put jessie-backports on there and upgraded systemd and systemd-coredump on there. I believe that coredumpctl list should now show any cores that are dumped on those machines. I'll keep an eye out but I've never used systemd coredump utilities before so I'm flying a little blind here.

joyeecheung · 2018-03-09T05:03:47Z

@rvagg I believe if the test crashes then the output would be a severity: crashed following by the signal that the process crash with? This looks like another test that fails without output, probably there's something wrong with tools/test.py that fails to grab the output from the test.

rvagg · 2018-03-09T11:04:38Z

@joyeecheung ahhh good point, I didn't realise that. self.output.exit_code < 0 and self.output.exit_code != -signal.SIGABRT in test.py, why would not be treating abort() as a "crash" tho?

Anyway, I can see corefiles stacking up, looks like lots of them per test run on each of these servers, is that normal? I've never bothered getting in to core dump analysis so I'm a bit out of my depth.

rvagg · 2018-03-09T11:05:46Z

Also, I've enabled timestamps on the console output of node-test-commit-linux so we can match up test failures with core files since systemd keeps them nicely stored with a timestamp.

joyeecheung · 2018-03-09T16:49:58Z

@rvagg I am able to get the core dumps from test-digitalocean-debian8-x64-1 but I don't have the accompanying Node.js binary that generates them...File names are core.node.1000.567551fcf4104b21b9fb317a3ca0da3b.<pid>.<time?>, there seems to be a sha in the file name but I am unable to find the correct commit..

rvagg · 2018-03-11T11:43:03Z

Sorry @joyeecheung, I think maybe they've not been kept between runs, it looks like systemd isn't saving the binaries too. Perhaps we need to do a binary save after each run too, although that'll have to be done in Jenkins.

I'm out of action for the next 2 days sadly so I won't be able to help again until at least mid-week.

BridgeAR · 2018-04-03T14:04:03Z

It seems like we still have failures that that do not bring up any stack traces. @Trott you got a couple of those recently.

addaleax · 2018-04-03T14:08:26Z

@BridgeAR @Trott Do you know which machines are affected most by this? This has been going on for a while now, and I think I’d like to request access from build to try and debug those…

BridgeAR · 2018-04-03T17:15:58Z

Refs: nodejs/build#1207

Trott · 2018-04-03T17:19:37Z

@addaleax Not sure, but I'll try to take note. Here's another one from today on ppcle-ubuntu1404...

https://ci.nodejs.org/job/node-test-commit-plinux/16540/nodes=ppcle-ubuntu1404/console

not ok 1178 parallel/test-net-socket-timeout
  ---
  duration_ms: 0.510
  severity: fail
  stack: |-

joyeecheung · 2018-04-04T05:11:59Z

I am thinking maybe node-report could be useful, but still I think there's a bug in tools/test.py that fails to collect the stderr/stdout from failed tests.

Trott · 2018-04-10T04:46:31Z

Closing this in favor of #19903, but feel free to re-open or comment if you think that's the wrong thing to do.

BridgeAR added test Issues and PRs related to the tests. flaky-test Issues and PRs related to the tests with unstable failures on the CI. labels Feb 26, 2018

maclover7 added the tls Issues and PRs related to the tls subsystem. label Feb 26, 2018

This was referenced Mar 7, 2018

Flaky parallel/test-http-get-pipeline-problem #18731

Closed

Intermittent? - parallel/test-net-timeout-no-handle crash on alpine #18585

Closed

Node.js Foundation Technical Steering Committee (TSC) Meeting 2018-03-07 nodejs/TSC#500

Closed

rvagg mentioned this issue Mar 8, 2018

TODO: unset ulimit -c on debian* hosts nodejs/build#1174

Closed

rvagg mentioned this issue Apr 10, 2018

CI failures with no output/stack trace #19903

Closed

Trott closed this as completed Apr 10, 2018

cjihrig mentioned this issue Apr 10, 2018

http2: invoke connect callback immediately if session was connected #19842

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky parallel/test-tls-buffersize #18998

Flaky parallel/test-tls-buffersize #18998

BridgeAR commented Feb 26, 2018

apapirovski commented Feb 26, 2018

gireeshpunathil commented Feb 26, 2018

rvagg commented Feb 27, 2018

BridgeAR commented Feb 27, 2018 •

edited

Loading

rvagg commented Feb 27, 2018

rvagg commented Feb 28, 2018

apapirovski commented Mar 1, 2018 •

edited

Loading

rvagg commented Mar 5, 2018

rvagg commented Mar 5, 2018

joyeecheung commented Mar 7, 2018 •

edited

Loading

rvagg commented Mar 8, 2018 •

edited

Loading

rvagg commented Mar 9, 2018

joyeecheung commented Mar 9, 2018

rvagg commented Mar 9, 2018

rvagg commented Mar 9, 2018

joyeecheung commented Mar 9, 2018

rvagg commented Mar 11, 2018

BridgeAR commented Apr 3, 2018

addaleax commented Apr 3, 2018

BridgeAR commented Apr 3, 2018

Trott commented Apr 3, 2018

joyeecheung commented Apr 4, 2018

Trott commented Apr 10, 2018

Flaky parallel/test-tls-buffersize #18998

Flaky parallel/test-tls-buffersize #18998

Comments

BridgeAR commented Feb 26, 2018

apapirovski commented Feb 26, 2018

gireeshpunathil commented Feb 26, 2018

rvagg commented Feb 27, 2018

BridgeAR commented Feb 27, 2018 • edited Loading

rvagg commented Feb 27, 2018

rvagg commented Feb 28, 2018

apapirovski commented Mar 1, 2018 • edited Loading

rvagg commented Mar 5, 2018

rvagg commented Mar 5, 2018

joyeecheung commented Mar 7, 2018 • edited Loading

rvagg commented Mar 8, 2018 • edited Loading

rvagg commented Mar 9, 2018

joyeecheung commented Mar 9, 2018

rvagg commented Mar 9, 2018

rvagg commented Mar 9, 2018

joyeecheung commented Mar 9, 2018

rvagg commented Mar 11, 2018

BridgeAR commented Apr 3, 2018

addaleax commented Apr 3, 2018

BridgeAR commented Apr 3, 2018

Trott commented Apr 3, 2018

joyeecheung commented Apr 4, 2018

Trott commented Apr 10, 2018

BridgeAR commented Feb 27, 2018 •

edited

Loading

apapirovski commented Mar 1, 2018 •

edited

Loading

joyeecheung commented Mar 7, 2018 •

edited

Loading

rvagg commented Mar 8, 2018 •

edited

Loading