Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

jdk_time target fails on some arm32_linux nodes #3360

Closed
smlambert opened this issue Jan 29, 2024 · 20 comments
Closed

jdk_time target fails on some arm32_linux nodes #3360

smlambert opened this issue Jan 29, 2024 · 20 comments

Comments

@smlambert
Copy link
Contributor

Please set the title to indicate the test name and machine name where known.

As indicated by Jan 2024 CPU triage, in adoptium/aqa-tests#4982 (comment), the jdk_time test target was failing with timeouts on ci.adoptium.net arm32 nodes. It was able to pass on the temurin-compliance arm32 linux node.

To make it easy for the infrastructure team to repeat and diagnose, please
answer the following questions:

Any other details:
Suspect related to the findings in #2002 but not confirmed

@sxa
Copy link
Member

sxa commented Jan 30, 2024

Will be part of the top level #2662

@Haroon-Khel
Copy link
Contributor

Haroon-Khel commented Jan 31, 2024

Some jdk_time_2 tests failed on test-docker-ubuntu2004-armv7l-4 with the last jdk17 release
https://ci.adoptium.net/job/Test_openjdk17_hs_extended.openjdk_arm_linux/137/tapResults

Failed test cases: 
TEST: java/time/tck/java/time/format/TCKLocalizedOffsetIdPrinterParser.java
TEST: java/time/tck/java/time/serial/TCKZonedDateTimeSerialization.java
TEST: java/time/tck/java/time/format/TCKLocalizedPrinterParser.java
TEST: java/time/tck/java/time/serial/TCKOffsetDateTimeSerialization.java
TEST: java/time/tck/java/time/temporal/TCKChronoUnit.java
TEST: java/time/tck/java/time/serial/TCKLocalTimeSerialization.java
TEST: java/time/tck/java/time/serial/TCKZoneIdSerialization.java
TEST: java/time/tck/java/time/format/TCKInstantPrinterParser.java
TEST: java/time/tck/java/time/serial/TCKLocalDateSerialization.java
TEST: java/time/tck/java/time/serial/TCKYearSerialization.java
TEST: java/time/tck/java/time/format/TCKSignStyle.java
TEST: java/time/tck/java/time/format/TCKFormatStyle.java
TEST: java/time/tck/java/time/temporal/TCKChronoField.java
TEST: java/time/tck/java/time/format/TCKZoneIdPrinterParser.java
TEST: java/time/tck/java/time/format/TCKPadPrinterParser.java
TEST: java/time/tck/java/time/format/TCKTextStyle.java
TEST: java/time/tck/java/time/serial/TCKPeriodSerialization.java
TEST: java/time/tck/java/time/serial/TCKLocalDateTimeSerialization.java
TEST: java/time/nontestng/java/time/chrono/HijrahConfigTest.java
TEST: java/time/tck/java/time/format/TCKDateTimeParseResolver.java
TEST: java/time/tck/java/time/zone/serial/TCKZoneOffsetTransitionRuleSerialization.java
Test results: passed: 163; error: 23

However all of these tests passed when run individually on the machine itself

jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/format/TCKLocalizedPrinterParser.java    
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/serial/TCKOffsetDateTimeSerialization.java
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/temporal/TCKChronoUnit.java               
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/serial/TCKLocalTimeSerialization.java
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/serial/TCKZoneIdSerialization.java   
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/format/TCKInstantPrinterParser.java
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/serial/TCKLocalDateSerialization.java
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/serial/TCKYearSerialization.java     
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/format/TCKSignStyle.java        
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/format/TCKFormatStyle.java
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/temporal/TCKChronoField.java
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/format/TCKZoneIdPrinterParser.java
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/format/TCKPadPrinterParser.java   
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/format/TCKTextStyle.java       
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/serial/TCKPeriodSerialization.java
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/serial/TCKLocalDateTimeSerialization.java
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/nontestng/java/time/chrono/HijrahConfigTest.java       
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/format/TCKDateTimeParseResolver.java
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork
jenkins@1a14546b3875:~/test$ $JAVA -jar jtreg/lib/jtreg.jar jdk17u/test/jdk/java/time/tck/java/time/zone/serial/TCKZoneOffsetTransitionRuleSerialization.java
Test results: passed: 1
Report written to /home/jenkins/test/JTreport/html/report.html
Results written to /home/jenkins/test/JTwork

using the latest jdk17 release

jenkins@1a14546b3875:~/test$ $JAVA -version 
openjdk version "17.0.10" 2024-01-16
OpenJDK Runtime Environment Temurin-17.0.10+7 (build 17.0.10+7)
OpenJDK Server VM Temurin-17.0.10+7 (build 17.0.10+7, mixed mode, sharing)

@Haroon-Khel
Copy link
Contributor

Looks like it times out when you run them in jdk_time together, but pass when running them individually.

https://ci.adoptium.net/job/Grinder/8698/console

16:17:12  
16:17:12  TEST RESULT: Error. Agent error: java.lang.Exception: Agent 52 timed out with a timeout of 960 seconds; check console log for any additional details
16:17:12  --------------------------------------------------
16:17:45  TEST: java/time/tck/java/time/serial/TCKZonedDateTimeSerialization.java
16:17:45  TEST JDK: /home/jenkins/workspace/Grinder/jdkbinary/j2sdk-image
16:17:45  
16:17:45  ACTION: build -- Error. Agent error: java.lang.Exception: Agent 53 timed out with a timeout of 960 seconds; check console log for any additional details
16:17:45  REASON: Named class compiled on demand
16:17:45  TIME:   2073.694 seconds

@Haroon-Khel
Copy link
Contributor

Running jdk_time_2 from my own branch, with JTREG_TIMEOUT_OPTION increased to 40 (I dont know how to do it from the grinder)
https://ci.adoptium.net/job/Grinder/8700/console

@Haroon-Khel
Copy link
Contributor

htop on the dockerhost machine while the above grinder is running

image

grinder 8700 has lost contact with test-docker-ubuntu2004-armv7l-4

17:27:15  XML output with verification to /home/jenkins/workspace/Grinder/aqa-tests/TKG/output_17067220302633/jdk_time_2/work
17:47:17  Cannot contact test-docker-ubuntu2004-armv7l-4: java.lang.InterruptedException

@Haroon-Khel
Copy link
Contributor

I think the jdk_time test target might be a bit much for the docker static containers.

The grinder has caused test-docker-ubuntu2004-armv7l-4 to disconnect from jenkins
https://ci.adoptium.net/computer/test%2Ddocker%2Dubuntu2004%2Darmv7l%2D4/log

SSH Connection failed with IOException: "The kexTimeout (65000 ms) expired.", retrying in 15 seconds. There are 1 more retries left.
The kexTimeout (65000 ms) expired.
ERROR: Connection is not established!
java.lang.IllegalStateException: Connection is not established!
	at com.trilead.ssh2.Connection.getRemainingAuthMethods(Connection.java:989)

@Haroon-Khel
Copy link
Contributor

Rerunning on a non docker container (odroid machine) https://ci.adoptium.net/job/Grinder/8702/console

@Haroon-Khel
Copy link
Contributor

Passed on the odroid machine. Also passed on test-docker-ubuntu2004-armv7l-2 https://ci.adoptium.net/job/Grinder/8701/console

@sxa
Copy link
Member

sxa commented Jan 31, 2024

Interesting ...

@Haroon-Khel Are they all new containers and did you cap the CPU/memory on them? From the Grinder logs the run on both containers as mentioned in #3360 (comment) and #3360 (comment) were both running with -concurrency:81 on the jtreg command line (CPU cores/2+1. Those machines have 160 physical cores) but most of the containers should be started with options that restrict the containers to less than that. I'm not quite sure why the two container Grinders didn't show the same behaviour, but the odroid system had concurrency:1 so were potentially safer. I'm slightly surprised that a time related test would give a large load, but maybe we need to restrict those tests. I would have expected similar results on the aarch64 systems though, which are run on the same host 🤷🏻

Noting that the calculation for currency is the one I was looking at adjusting in adoptium/aqa-tests#4792 (in case you want to see where it is and play with it) but I don't think that will affect this particular situation.

@Haroon-Khel
Copy link
Contributor

Ive updated my branch to run the test with concurrency:2
Rerunning on test-docker-ubuntu2004-armv7l-4 https://ci.adoptium.net/job/Grinder/8703/console

Are they all new containers and did you cap the CPU/memory on them?

Yes, the images are built with
docker build --cpu-period=100000 --cpu-quota=800000

And are run with
docker run --restart unless-stopped --cpus=4.0 --memory=6G

Bear in mind the jdk_time_2 tests passed on test-docker-ubuntu2004-armv7l-2, another static container but running on dockerhost-equinix-ubuntu2004-armv8-1 not dockerhost-equinix-ubuntu2204-armv8-1. Link to the grinder https://ci.adoptium.net/job/Grinder/8701/console -concurrency:81 is used

@Haroon-Khel
Copy link
Contributor

https://ci.adoptium.net/job/Grinder/8703/console passed. So we need to limit concurrency on the docker static containers, and eventhough its passes with -concurrency:81 on some, it shouldnt be using a value this high

@Haroon-Khel
Copy link
Contributor

adoptium/aqa-tests#4792 looks good, but I want to first look at why the test scripts are reporting 160 cores and not the set 4 (or whatever) cores of the docker containers

@sxa
Copy link
Member

sxa commented Feb 1, 2024

adoptium/aqa-tests#4792 looks good

Unfortunately I'm not yet convinced which is why it's still in draft, but I need to get back to it ;-)
It will be getting 160 cores from /proc/cpuinfo by default which is ok, but it ought to be reading the memory size from /sys/fs/cgroup/memory/memory.limit_in_bytes in the container (8589934592 bytes, converting to to Mb (8192 which goes into MEMORY_SIZE - this is the bit that will go wrong on the arm systems if my PR goes in) and then should set CONC (the value for concurrency) to 4 (Half the memory in Gb) as that's less than the number of cores (160)
All that logic is in https://github.com/adoptium/aqa-tests/blob/c27e9e649919f839444c638844e133a0d13edd04/openjdk/openjdk.mk#L52

@Haroon-Khel
Copy link
Contributor

Im having some luck with --cpuset-cpus="0-3" instead of --cpus=4.0 to assign cpus to containers, and with nproc to find the number of processors

Container started with --cpus=4.0

root@d0a2dceb584c:/# nproc
160

Container started with --cpuset-cpus="0-3"

root@91b440680d8b:/# nproc
4

So a solution is to change https://github.com/adoptium/aqa-tests/blob/c27e9e649919f839444c638844e133a0d13edd04/openjdk/openjdk.mk#L22 to use nproc to find the number of cpus and to restart our arm32 containers with --cpuset-cpus="0-3"

@Haroon-Khel
Copy link
Contributor

I've redeployed test-docker-ubuntu2004-armv7l-4 with --cpuset-cpus="0-3"

Rerunning jdk_time_2 from my branch on it, JTREG_TIMEOUT_OPTION is set back to normal. https://ci.adoptium.net/job/Grinder/8710/console

Concurrency should be 4/2 +1 =3 which shouldn't cause a timeout

@Haroon-Khel
Copy link
Contributor

Looks good -timeoutFactor:8 -xml:verify -concurrency:3

@Haroon-Khel Haroon-Khel self-assigned this Feb 2, 2024
@Haroon-Khel
Copy link
Contributor

Grinder passed 👍🏻

@Haroon-Khel
Copy link
Contributor

Passes for jdk11 too https://ci.adoptium.net/job/Grinder/8711/console

@Haroon-Khel
Copy link
Contributor

The remaining arm32 nodes on dockerhost-equinix-ubuntu2204-armv8-1 have been restarted with --cpuset-cpus="0-3"

@sxa
Copy link
Member

sxa commented Feb 5, 2024

Do you know what the difference in behaviour is between the two options i.e. is there likely to be any downside to this change? We should check whether this is happening elsewhere too e g. X64 and ppcle systems to see if they are the same (trying to use more cores than they should be at present)

Nice find!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done
Development

No branches or pull requests

3 participants