Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aarch64 hang during test Create a non persistent cache #11177

Closed
andrew-m-leonard opened this issue Nov 13, 2020 · 6 comments
Closed

aarch64 hang during test Create a non persistent cache #11177

andrew-m-leonard opened this issue Nov 13, 2020 · 6 comments

Comments

@andrew-m-leonard
Copy link
Contributor

Testcases on various platforms hang periodically, see:
adoptium/infrastructure#1579 (comment)
adoptium/infrastructure#1579 (comment)

The console output:

01:18:00  Testing: Test 26: CMVC 168131 : Create a non persistent cache
01:18:00  Test start time: 2020/11/13 01:17:59 Coordinated Universal Time
01:18:00  Running command: "/home/jenkins/workspace/Test_openjdk11_j9_sanity.functional_aarch64_linux/openjdkbinary/j2sdk-image/bin/java"  -Xcompressedrefs -Xjit -Xgcpolicy:gencon  -Xshareclasses:name=ShareClassesCMLTests,nonpersistent -version
01:18:00  Time spent starting: 2 milliseconds
01:28:13  ***[TEST INFO 2020/11/13 01:27:59] ProcessKiller detected a timeout after 600000 milliseconds!***
01:28:13  INFO: getUnixPID() has failed indicating this is not a UNIX System.'Debug on timeout' is currently only supported on Linux.

From a bit of an examination of the above test framework code, I believe the hang is here as it uses no timeout on the wait:
https://github.com/eclipse/openj9/blob/efdb86514d722cf83747d9d8badc449fe6121658/test/functional/cmdline_options_tester/src/Test.java#L227

The above console output indicates the test had timed out and ProcessKiller logic kills the process here:
https://github.com/eclipse/openj9/blob/efdb86514d722cf83747d9d8badc449fe6121658/test/functional/cmdline_options_tester/src/Test.java#L565
So I suspect the proc.waitFor() above is waiting for a process that has already terminated.

Adding a timeout to this proc.waitFor() would at least alleviate the Adopt hang please...?

@pshipton
Copy link
Member

The spec for waitFor() says "This method returns immediately if the process has already terminated.".

Adding a timeout in the waitFor() allows the non-terminated process to keep running, or there is a bug related to waitFor() if the process is terminated and waitFor() isn't returning. We need to look at the machine and see what's going on.

@knn-k

@pshipton
Copy link
Member

pshipton commented Nov 13, 2020

Also, we could fix getUnixPID() as then the test framework would attempt to attach to the running process with gdb and get a stack trace.

Update: this was done in #11199

@pshipton pshipton changed the title Adopt testcases periodically hanging due to cmdline_options_tester "hang" aarch64 hang during test Create a non persistent cache Nov 13, 2020
@0xdaryl
Copy link
Contributor

0xdaryl commented Dec 7, 2020

Has the AArch64 problem been seen since the initial report (and whose build information is no longer available)? @knn-k has tried but failed to reproduce.

@pshipton : for my own edification, why is this labeled a blocker? I'd just like to understand the criteria.

@pshipton
Copy link
Member

pshipton commented Dec 7, 2020

@0xdaryl I set it as a blocker because it was blocking the functional tests from completing at AdoptOpenJDK, potentially hiding other problems. Also consuming machine resources until the test times out, which may impact other testing. Feel free to adjust it as desired.

@pshipton
Copy link
Member

Do we need this in the milestone plan? I'll remove it.

@pshipton
Copy link
Member

This seems obsolete, closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants