Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky parallel/test-tls-buffersize #18998

Closed
BridgeAR opened this issue Feb 26, 2018 · 23 comments
Closed

Flaky parallel/test-tls-buffersize #18998

BridgeAR opened this issue Feb 26, 2018 · 23 comments
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests. tls Issues and PRs related to the tls subsystem.

Comments

@BridgeAR
Copy link
Member

https://ci.nodejs.org/job/node-test-commit-linux/16566/nodes=debian8-x86/console

not ok 1585 parallel/test-tls-buffersize
  ---
  duration_ms: 0.213
  severity: fail
  stack: |-
@BridgeAR BridgeAR added test Issues and PRs related to the tests. flaky-test Issues and PRs related to the tests with unstable failures on the CI. labels Feb 26, 2018
@maclover7 maclover7 added the tls Issues and PRs related to the tls subsystem. label Feb 26, 2018
@apapirovski
Copy link
Member

I'm not sure if there's much point anymore in creating separate issues for all of these debian8-x86, ubuntu1404 and ubuntu1204 failures without a strack-trace. They seem to affect literally all of the tests so we might eventually have an issue for every single test that exists on each of those platforms.

/cc @nodejs/build It would be nice if someone could look into this. The lack of a stack trace or signal code is puzzling.

@gireeshpunathil
Copy link
Member

agree on the lack of hints / artifacts associated with these failure types - they don't help us in anyway other than telling that they failed. My debugging method in those cases always had been to run a 1000 times locally in expectation of a recreate, reduce the code as long as the problem retains etc. (time consuming and many times un-successful).

My question is: what type of failure this is: assertion failure / unforeseen exception / crash / bad exit caught by python driver / forced failure? is it a tap.assert(false) or something?

@rvagg
Copy link
Member

rvagg commented Feb 27, 2018

@nodejs/build someone with test access should get into one of these machines straight after a failure and run the test manually to see what's up.

@apapirovski

They seem to affect literally all of the tests so we might eventually have an issue for every single test that exists on each of those platforms

Are you saying this is more than just that one test that fails like this?

@BridgeAR
Copy link
Member Author

BridgeAR commented Feb 27, 2018

@rvagg check the tests issues with the flaky label. Probably more than 50 % are all similar to this one.

@rvagg
Copy link
Member

rvagg commented Feb 27, 2018

Are these tests resource intensive in any way? Is it possible that they may crash if there are zombie jobs running in the background that may conflict them in some fashion - maybe sharing resources, or where a zombie has too much memory. I've been seeing a few make test-ci and associated children zombied on some linux machines recently, not many but enough that I could imagine it's connected.

@rvagg
Copy link
Member

rvagg commented Feb 28, 2018

I've observed some similar failures on Docker containers today:

https://ci.nodejs.org/job/node-test-commit-linux/16729/nodes=alpine37-container-x64/ running on https://ci.nodejs.org/computer/test-softlayer-alpine37_container-x64-1 which runs on test-softlayer-ubuntu1604_docker-x64-1

not ok 1778 async-hooks/test-graph.intervals
  ---
  duration_ms: 0.240
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 11)

https://ci.nodejs.org/job/node-test-commit-linux-containered/2544/nodes=ubuntu1604_sharedlibs_openssl110_x64/ on https://ci.nodejs.org/computer/test-softlayer-ubuntu1604_sharedlibs_container-x64-2 which is on test-softlayer-ubuntu1604_docker-x64-2

not ok 1931 addons/stringbytes-external-exceed-max/test-stringbytes-external-exceed-max
  ---
  duration_ms: 21.243
  severity: crashed
  stack: |-
    oh no!
    exit code: CRASHED (Signal: 9)

That same container had a different failure on its previous run that doesn't look the same but perhaps it's somehow related:

https://ci.nodejs.org/job/node-test-commit-linux-containered/2543/nodes=ubuntu1604_sharedlibs_fips20_x64/ on https://ci.nodejs.org/computer/test-softlayer-ubuntu1604_sharedlibs_container-x64-2 which is on test-softlayer-ubuntu1604_docker-x64-2

00:12:29.798 not ok 691 parallel/test-http-client-timeout-agent
00:12:29.798   ---
00:12:29.798   duration_ms: 0.824
00:12:29.798   severity: fail
00:12:29.798   stack: |-
00:12:29.798     res#0 data:0
00:12:29.798     res#0 end
00:12:29.798     res#2 data:2
00:12:29.799     res#2 end
00:12:29.799     res#4 data:4
00:12:29.799     res#4 end
00:12:29.799     res#6 data:6
00:12:29.799     res#6 end
00:12:29.799     res#8 data:8
00:12:29.799     res#8 end
00:12:29.799     res#10 data:10
00:12:29.799     res#10 end
00:12:29.799     res#12 data:12
00:12:29.799     res#12 end
00:12:29.800     res#14 data:14
00:12:29.800     res#14 end
00:12:29.800     res#16 data:16
00:12:29.800     res#16 end
00:12:29.800     res#18 data:18
00:12:29.800     res#18 end
00:12:29.800     res#20 data:20
00:12:29.800     res#20 end
00:12:29.800     res#22 data:22
00:12:29.800     res#22 end
00:12:29.801     res#24 data:24
00:12:29.801     res#24 end
00:12:29.801     res#26 data:26
00:12:29.801     res#26 end
00:12:29.801     res#28 data:28
00:12:29.801     res#28 end
00:12:29.801     req#1 timeout
00:12:29.801     req#3 timeout
00:12:29.801     req#5 timeout
00:12:29.801     req#7 timeout
00:12:29.802     req#9 timeout
00:12:29.802     req#11 timeout
00:12:29.802     req#13 timeout
00:12:29.802     req#15 timeout
00:12:29.802     req#17 timeout
00:12:29.802     req#19 timeout
00:12:29.802     req#21 timeout
00:12:29.802     req#23 timeout
00:12:29.802     req#25 timeout
00:12:29.803     req#27 timeout
00:12:29.803     req#29 timeout
00:12:29.803     req#0 timeout
00:12:29.803     req#28 close
00:12:29.803     req#26 close
00:12:29.803     req#24 close
00:12:29.803     req#22 close
00:12:29.803     req#20 close
00:12:29.803     req#18 close
00:12:29.803     req#16 close
00:12:29.804     req#14 close
00:12:29.804     req#12 close
00:12:29.804     req#10 close
00:12:29.804     req#8 close
00:12:29.804     req#6 close
00:12:29.804     req#4 close
00:12:29.804     req#2 close
00:12:29.804     req#0 close
00:12:29.804     req#29 error
00:12:29.804     req#29 close
00:12:29.804     req#27 error
00:12:29.805     req#27 close
00:12:29.805     req#25 error
00:12:29.805     req#25 close
00:12:29.805     req#23 error
00:12:29.805     req#23 close
00:12:29.805     req#21 error
00:12:29.805     req#21 close
00:12:29.805     req#19 error
00:12:29.805     req#19 close
00:12:29.805     req#17 error
00:12:29.806     req#17 close
00:12:29.806     req#15 error
00:12:29.806     req#15 close
00:12:29.806     req#13 error
00:12:29.806     req#13 close
00:12:29.806     req#11 error
00:12:29.806     req#11 close
00:12:29.806     req#9 error
00:12:29.806     req#9 close
00:12:29.806     req#7 error
00:12:29.807     req#7 close
00:12:29.807     req#5 error
00:12:29.807     req#5 close
00:12:29.807     req#3 error
00:12:29.807     req#3 close
00:12:29.807     req#1 error
00:12:29.808     req#1 close
00:12:29.808     done=31 sent=30
00:12:29.808     assert.js:74
00:12:29.808       throw new AssertionError(obj);
00:12:29.808       ^
00:12:29.808     
00:12:29.808     AssertionError [ERR_ASSERTION]: timeout on http request called too much
00:12:29.809         at process.<anonymous> (/home/iojs/build/workspace/node-test-commit-linux-containered/nodes/ubuntu1604_sharedlibs_fips20_x64/test/parallel/test-http-client-timeout-agent.js:94:10)
00:12:29.810         at process.emit (events.js:136:15)

I couldn't find anything abnormal on these containers or the hosts running them.

@apapirovski
Copy link
Member

apapirovski commented Mar 1, 2018

@rvagg I think the stringbytes failures are memory overcommit related. I raised the issue before of potentially needing to turn it off on the CI machines (at least the alpine ones) because those tests try to assess whether enough memory is available before running but with memory overcommit they don't get a representative answer and proceed to crash.

The one referenced in this issue is very particular because it doesn't get killed by a signal code but it also doesn't have a stack trace...

#18998
#18997
#18996
#18995
#18735
#18733
#18732
#18663
#18504
#18496

There are a variety of subsystems represented here and I don't really see much in common.

@rvagg
Copy link
Member

rvagg commented Mar 5, 2018

@nodejs/collaborators we're going to need some more help on getting to the bottom of this. Errors of this type are showing up pretty frequently and not for one specific test. Just looking at the last 4 node-test-commit-linux failures show what look to be similar failures (4 failures out of the 6 at the top right now):

https://ci.nodejs.org/job/node-test-commit-linux/16868/nodes=alpine35-container-x64/console
https://ci.nodejs.org/job/node-test-commit-linux/16867/nodes=debian8-x86/console
https://ci.nodejs.org/job/node-test-commit-linux/16868/nodes=alpine35-container-x64/console
https://ci.nodejs.org/job/node-test-commit-linux/16866/nodes=debian8-x86/console

The test-http-client-timeout-agent failure mentioned above is also showing up regularly too, I'm not sure if it's related or not.

Build has been working to iron out the frustrating Jenkins-related failures, but these kinds of errors are now one of the major blockers for getting CI back to green--your help in getting us there would be appreciated!

@rvagg
Copy link
Member

rvagg commented Mar 5, 2018

Sorry, I got those numbers wrong. There are 5 failures out of the last 7 builds, 3 of them have the weird no-output crashes, two of them have the test-http-client-timeout-agent failure on Alpine. But you can go back not much further and even find test-http-client-timeout-agent failing on ARM64. There's at least one bug being exposed on Linux here.

@joyeecheung
Copy link
Member

joyeecheung commented Mar 7, 2018

@rvagg It is possible to turn on ulimit -c <limit> on the problematic hosts and store the core dump + the Node.js binary for a period of time (or up to n GB of spaces) so we can download them and debug those?

@rvagg
Copy link
Member

rvagg commented Mar 8, 2018

@joyeecheung ok, done on the Debian 8 and Debian 9 hosts. It's a bit more tricky on the Alpine ones so I haven't bothered for now. Next thing is to keep an eye out for these crashes.
I've only done a limit of infinity and set a reminder that we should undo it at some point. Is it actually possible to limit the space taken up by core files in total rather than just the size of the files themselves? Or to set a maxage without scripting something in cron?

@rvagg
Copy link
Member

rvagg commented Mar 9, 2018

had one on test-digitalocean-debian8-x64-1 today https://ci.nodejs.org/job/node-test-commit-linux/16972/nodes=debian8-64/

not ok 2029 sequential/test-http-server-keep-alive-timeout-slow-server
  ---
  duration_ms: 0.212
  severity: fail
  stack: |-

but no core, at least I can't find one, perhaps one was made in the working directory but destroyed with the next run after a git clean. It's also possible that LimitCORE did nothing since we could be using a too-old version of systemd on Debian 8, the coredumpctl utilities aren't even avilable by default. So I've gone ahead and put jessie-backports on there and upgraded systemd and systemd-coredump on there. I believe that coredumpctl list should now show any cores that are dumped on those machines. I'll keep an eye out but I've never used systemd coredump utilities before so I'm flying a little blind here.

@joyeecheung
Copy link
Member

@rvagg I believe if the test crashes then the output would be a severity: crashed following by the signal that the process crash with? This looks like another test that fails without output, probably there's something wrong with tools/test.py that fails to grab the output from the test.

@rvagg
Copy link
Member

rvagg commented Mar 9, 2018

@joyeecheung ahhh good point, I didn't realise that. self.output.exit_code < 0 and self.output.exit_code != -signal.SIGABRT in test.py, why would not be treating abort() as a "crash" tho?

Anyway, I can see corefiles stacking up, looks like lots of them per test run on each of these servers, is that normal? I've never bothered getting in to core dump analysis so I'm a bit out of my depth.

@rvagg
Copy link
Member

rvagg commented Mar 9, 2018

Also, I've enabled timestamps on the console output of node-test-commit-linux so we can match up test failures with core files since systemd keeps them nicely stored with a timestamp.

@joyeecheung
Copy link
Member

@rvagg I am able to get the core dumps from test-digitalocean-debian8-x64-1 but I don't have the accompanying Node.js binary that generates them...File names are core.node.1000.567551fcf4104b21b9fb317a3ca0da3b.<pid>.<time?>, there seems to be a sha in the file name but I am unable to find the correct commit..

@rvagg
Copy link
Member

rvagg commented Mar 11, 2018

Sorry @joyeecheung, I think maybe they've not been kept between runs, it looks like systemd isn't saving the binaries too. Perhaps we need to do a binary save after each run too, although that'll have to be done in Jenkins.

I'm out of action for the next 2 days sadly so I won't be able to help again until at least mid-week.

@BridgeAR
Copy link
Member Author

BridgeAR commented Apr 3, 2018

It seems like we still have failures that that do not bring up any stack traces. @Trott you got a couple of those recently.

@addaleax
Copy link
Member

addaleax commented Apr 3, 2018

@BridgeAR @Trott Do you know which machines are affected most by this? This has been going on for a while now, and I think I’d like to request access from build to try and debug those…

@BridgeAR
Copy link
Member Author

BridgeAR commented Apr 3, 2018

Refs: nodejs/build#1207

@Trott
Copy link
Member

Trott commented Apr 3, 2018

@addaleax Not sure, but I'll try to take note. Here's another one from today on ppcle-ubuntu1404...

https://ci.nodejs.org/job/node-test-commit-plinux/16540/nodes=ppcle-ubuntu1404/console

not ok 1178 parallel/test-net-socket-timeout
  ---
  duration_ms: 0.510
  severity: fail
  stack: |-

@joyeecheung
Copy link
Member

I am thinking maybe node-report could be useful, but still I think there's a bug in tools/test.py that fails to collect the stderr/stdout from failed tests.

@Trott
Copy link
Member

Trott commented Apr 10, 2018

Closing this in favor of #19903, but feel free to re-open or comment if you think that's the wrong thing to do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
flaky-test Issues and PRs related to the tests with unstable failures on the CI. test Issues and PRs related to the tests. tls Issues and PRs related to the tls subsystem.
Projects
None yet
Development

No branches or pull requests

8 participants