-
Notifications
You must be signed in to change notification settings - Fork 30k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky parallel/test-tls-buffersize #18998
Comments
I'm not sure if there's much point anymore in creating separate issues for all of these debian8-x86, ubuntu1404 and ubuntu1204 failures without a strack-trace. They seem to affect literally all of the tests so we might eventually have an issue for every single test that exists on each of those platforms. /cc @nodejs/build It would be nice if someone could look into this. The lack of a stack trace or signal code is puzzling. |
agree on the lack of hints / artifacts associated with these failure types - they don't help us in anyway other than telling that they failed. My debugging method in those cases always had been to run a 1000 times locally in expectation of a recreate, reduce the code as long as the problem retains etc. (time consuming and many times un-successful). My question is: what type of failure this is: assertion failure / unforeseen exception / crash / bad exit caught by python driver / forced failure? is it a |
@nodejs/build someone with test access should get into one of these machines straight after a failure and run the test manually to see what's up.
Are you saying this is more than just that one test that fails like this? |
@rvagg check the |
Are these tests resource intensive in any way? Is it possible that they may crash if there are zombie jobs running in the background that may conflict them in some fashion - maybe sharing resources, or where a zombie has too much memory. I've been seeing a few |
I've observed some similar failures on Docker containers today: https://ci.nodejs.org/job/node-test-commit-linux/16729/nodes=alpine37-container-x64/ running on https://ci.nodejs.org/computer/test-softlayer-alpine37_container-x64-1 which runs on test-softlayer-ubuntu1604_docker-x64-1
https://ci.nodejs.org/job/node-test-commit-linux-containered/2544/nodes=ubuntu1604_sharedlibs_openssl110_x64/ on https://ci.nodejs.org/computer/test-softlayer-ubuntu1604_sharedlibs_container-x64-2 which is on test-softlayer-ubuntu1604_docker-x64-2
That same container had a different failure on its previous run that doesn't look the same but perhaps it's somehow related: https://ci.nodejs.org/job/node-test-commit-linux-containered/2543/nodes=ubuntu1604_sharedlibs_fips20_x64/ on https://ci.nodejs.org/computer/test-softlayer-ubuntu1604_sharedlibs_container-x64-2 which is on test-softlayer-ubuntu1604_docker-x64-2
I couldn't find anything abnormal on these containers or the hosts running them. |
@rvagg I think the The one referenced in this issue is very particular because it doesn't get killed by a signal code but it also doesn't have a stack trace... #18998 There are a variety of subsystems represented here and I don't really see much in common. |
@nodejs/collaborators we're going to need some more help on getting to the bottom of this. Errors of this type are showing up pretty frequently and not for one specific test. Just looking at the last 4 node-test-commit-linux failures show what look to be similar failures (4 failures out of the 6 at the top right now): https://ci.nodejs.org/job/node-test-commit-linux/16868/nodes=alpine35-container-x64/console The test-http-client-timeout-agent failure mentioned above is also showing up regularly too, I'm not sure if it's related or not. Build has been working to iron out the frustrating Jenkins-related failures, but these kinds of errors are now one of the major blockers for getting CI back to green--your help in getting us there would be appreciated! |
Sorry, I got those numbers wrong. There are 5 failures out of the last 7 builds, 3 of them have the weird no-output crashes, two of them have the test-http-client-timeout-agent failure on Alpine. But you can go back not much further and even find test-http-client-timeout-agent failing on ARM64. There's at least one bug being exposed on Linux here. |
@rvagg It is possible to turn on |
@joyeecheung ok, done on the Debian 8 and Debian 9 hosts. It's a bit more tricky on the Alpine ones so I haven't bothered for now. Next thing is to keep an eye out for these crashes. |
had one on test-digitalocean-debian8-x64-1 today https://ci.nodejs.org/job/node-test-commit-linux/16972/nodes=debian8-64/
but no core, at least I can't find one, perhaps one was made in the working directory but destroyed with the next run after a |
@rvagg I believe if the test crashes then the output would be a |
@joyeecheung ahhh good point, I didn't realise that. Anyway, I can see corefiles stacking up, looks like lots of them per test run on each of these servers, is that normal? I've never bothered getting in to core dump analysis so I'm a bit out of my depth. |
Also, I've enabled timestamps on the console output of node-test-commit-linux so we can match up test failures with core files since systemd keeps them nicely stored with a timestamp. |
@rvagg I am able to get the core dumps from |
Sorry @joyeecheung, I think maybe they've not been kept between runs, it looks like systemd isn't saving the binaries too. Perhaps we need to do a binary save after each run too, although that'll have to be done in Jenkins. I'm out of action for the next 2 days sadly so I won't be able to help again until at least mid-week. |
It seems like we still have failures that that do not bring up any stack traces. @Trott you got a couple of those recently. |
Refs: nodejs/build#1207 |
@addaleax Not sure, but I'll try to take note. Here's another one from today on ppcle-ubuntu1404... https://ci.nodejs.org/job/node-test-commit-plinux/16540/nodes=ppcle-ubuntu1404/console not ok 1178 parallel/test-net-socket-timeout
---
duration_ms: 0.510
severity: fail
stack: |- |
I am thinking maybe node-report could be useful, but still I think there's a bug in tools/test.py that fails to collect the stderr/stdout from failed tests. |
Closing this in favor of #19903, but feel free to re-open or comment if you think that's the wrong thing to do. |
https://ci.nodejs.org/job/node-test-commit-linux/16566/nodes=debian8-x86/console
The text was updated successfully, but these errors were encountered: