-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LambdaLoadTest hang #10638
Comments
@gita-omr can someone take a look pls. |
A quick note, this test hangs at Windows platform as well #8214. |
@andrewcraik fyi |
Not sure the Windows problem is the same, we'd have to wait for it to re-occur and look at any diagnostic files produced (the existing information says there are no diagnostic files produced). The Windows hang is a different mode and may be related to jdwp. |
Ok, we’ll start on Power. |
Ran grinder 30 times and it never failed. |
An update: the job is unfortunately gone by now. I took a look at the artifacts and all javacore files are empty, system core did not show anything. I guess let's wait until it fails again... |
@gita-omr Not sure why you are seeing empty javacore files, I can see content in them. |
Oh, maybe I ran out of disk space! Thanks. |
Did it hang again on Windows? #8214 (comment) |
Yes, although I suspect the Windows hang is a separate issue, related to the mode. It "regularly" hangs on Windows, but not that often. |
Another one https://ci.eclipse.org/openj9/job/Test_openjdk8_j9_special.system_ppc64_aix_Nightly_lambdaLoadTest/146 |
Here is some investigation running on a local Windows 10 laptop.
openj9 non-compressed refs:
hotspot:
The LambdaLoadtest runs these junit tests on
If only one of the tests is included in the inventory then the test passes with 100Mb heap except for the net.adoptopenjdk.test.streams.TestParallelStreamOperations classes. It is this test which causes the OOM. The test uses parallel streams for count the occurrences of various strings and letters in a test dataset of 130000 lines of text with a total of 8746601 characters. The tests include the constant
which is used to limit the number of instances of the test which can run concurrently (if there are already two instances running the test exits immediately). So the normal case is that there are two instances running for the duration of the test (since the tests also take much longer to complete than any of the other tests in the workload). Results of some test runs with different heap sizes shows that there is an out of memory 'boundary' where the jvm may report out of memory or may just consume all available CPU for a long period. These tests were all run with no additional -X options: openj9 compressed refs: openj9 non-compressed refs: hotspot The test currently has no -Xmx setting, so the amount of heap will vary according to test machine selected at run time. The fact that the jvm appears to become extremely slow as the heap becomes exhausted looks like the most likely explanation for the tests appearing to hang or time out. An obvious fix is to set a specific 1Gb heap size for these tests - I'll create a PR for that and do some extensive testing. That will 'fix' the test case failures - but does the extreme slowing of the jvm warrant further investigation? |
BTW, the LambdaLoadTests are not currently running as they should due to adoptium/aqa-systemtest#379 - the test run time is deemed to have expired as soon as the workload starts. But since this issue shows up with just two instances of the offending test case running it is still seen anyway (but there may be some test runs which 'complete' before two instances of that test have been started). |
Is it possible to get GC verbose log for this case please? I think it might be inefficient Excessive GC detection situation |
It turned out to be harder than I thought to catch a run which hangs, rather than completing successfully or going out of memory. |
There is a case where Excessive GC condition is missed obviously. We are going to work to fix it |
As suspected by @dmitripivkine this is indeed a case of excessive GC, but for some reason not leading to appropriate OOM exceptions. We have two criteria that has to be met to raise OOM. Besides excessively doing GC, there should be no sign of progress (reclaimed memory). A theory is that the the latter is not properly measured in this particular scenario, The scenario is that Scavenge aborted, and is followed by a precolate Global GC that satisfies the allocation but in Tenure (not Allocate space as noramally should be). Since Allocate space is empty subsequent allocations trigger allocation failure right away and process repeats. Example:
|
@jonoommen please try to reproduce this, and if the theory seems plausible, we'll work on a fix, where reclaimed memory would be measured from AF start to end, rather than Global GC start to end. -Xtgc:excessivegc logs might be useful |
FWIW, this is how I ran locally and added the various jvm command line options:
For Unix
to
|
From reproducing this and analyzing the data for various test scenarios, there were many findings, but here is a summary of what is most relevant. Firstly, we are dealing with 2 problems here:
The first issue can be resolved through GC changes, however, it is not the real reason for failure. The latter mentioned problem is the more serious problem. From heap dump analysis and scavenger tracing, it appears that we have an intermittent memory leak with this test case, although this is not conclusive. Here are some findings: Passing Test Case Scavenger Tracing:
Failing Test Case Scavenger Tracing:
LT stderr {SCAV: tgcScavenger OBJECT HISTOGRAM} - Tenure Age is 10 LT stderr {SCAV: char[] 0 341 0 385 425 968 1887 4260 8926 19327 47220 0 0 0 0 And then later in the log file before OOM is hit: LT stderr {SCAV: tgcScavenger OBJECT HISTOGRAM} - Tenure Age is 10 LT stderr {SCAV: java/lang/Object[] 0 1340 1370 1408 1332 2091 22445 2859 3187 1720 2079 3961 2238 4317 212359 (holds net/adoptopenjdk/test/streams/support/Line objects - see dominator tree later) For the passing case with the same heap size, nothing remotely similar will be found in the scavenger tracing and no excessive tenuring for any of these object types. The same is observed with -Xint as well. Thread stack: Here is a link to the line in TestParallelStreamOperations.java where the potential leak could be occurring; https://github.com/AdoptOpenJDK/openjdk-systemtest/blob/7df5b9a6be199e2acd977079562d3ae160c8c65d/openjdk.test.lambdasAndStreams/src/test.lambda/net/adoptopenjdk/test/streams/TestParallelStreamOperations.java#L300 Using a memory analyzer on the heap dump, here is some interesting info:
|
https://ci.eclipse.org/openj9/job/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_lambdaLoadTest/125
LambdaLoadTest_OpenJ9_NonLinux_special_23
variation: Mode688
JVM_OPTIONS: -Xcompressedrefs -Xjit:count=0 -Xgcpolicy:gencon -Xaggressive -Xconcurrentlevel0
javacores, core and other diagnostic files in the artifact https://140-211-168-230-openstack.osuosl.org/artifactory/ci-eclipse-openj9/Test/Test_openjdk11_j9_special.system_ppc64_aix_Nightly_lambdaLoadTest/125/system_lambdaLoadTest_test_output.tar.gz
I see lots of this
The text was updated successfully, but these errors were encountered: