Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Failing tests for JDK 21.0.2 #426

Closed
owaiskazi19 opened this issue Jan 20, 2024 · 12 comments · Fixed by #429
Closed

[BUG] Failing tests for JDK 21.0.2 #426

owaiskazi19 opened this issue Jan 20, 2024 · 12 comments · Fixed by #429
Assignees
Labels
bug Something isn't working v2.12.0

Comments

@owaiskazi19
Copy link
Member

owaiskazi19 commented Jan 20, 2024

What is the bug?

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.flowframework.rest.FlowFrameworkSecureRestApiIT.testCreateProvisionDeprovisionWorkflowWithFullAccess" -Dtests.seed=37D0F3650C59B32F -Dtests.security.manager=false -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-VE -Dtests.timezone=Indian/Kerguelen -Druntime.java=21
org.opensearch.flowframework.rest.FlowFrameworkSecureRestApiIT > testCreateProvisionDeprovisionWorkflowWithFullAccess FAILED
    java.net.SocketTimeoutException: 60.000 milliseconds timeout on connection http-outgoing-8 [ACTIVE]
        at __randomizedtesting.SeedInfo.seed([37D0F3650C59B32F:BE62CDF6F65B5600]:0)
        at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:937)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:333)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:321)
        at org.opensearch.flowframework.TestHelpers.makeRequest(TestHelpers.java:116)
        at org.opensearch.flowframework.TestHelpers.makeRequest(TestHelpers.java:89)
        at org.opensearch.flowframework.TestHelpers.makeRequest(TestHelpers.java:78)
        at org.opensearch.flowframework.FlowFrameworkRestTestCase.deprovisionWorkflow(FlowFrameworkRestTestCase.java:351)
        at org.opensearch.flowframework.rest.FlowFrameworkSecureRestApiIT.testCreateProvisionDeprovisionWorkflowWithFullAccess(FlowFrameworkSecureRestApiIT.java:127)

        Caused by:
        java.net.SocketTimeoutException: 60.000 milliseconds timeout on connection http-outgoing-8 [ACTIVE]
            at org.apache.http.nio.protocol.HttpAsyncRequestExecutor.timeout(HttpAsyncRequestExecutor.java:387)
            at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:98)
            at org.apache.http.impl.nio.client.InternalIODispatch.onTimeout(InternalIODispatch.java:40)
            at org.apache.http.impl.nio.reactor.AbstractIODispatch.timeout(AbstractIODispatch.java:175)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.sessionTimedOut(BaseIOReactor.java:261)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.timeoutCheck(AbstractIOReactor.java:506)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.validate(BaseIOReactor.java:211)
            at org.apache.http.impl.nio.reactor.AbstractIOReactor.execute(AbstractIOReactor.java:280)
            at org.apache.http.impl.nio.reactor.BaseIOReactor.execute(BaseIOReactor.java:104)
            at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor$Worker.run(AbstractMultiworkerIOReactor.java:591)
            at java.base/java.lang.Thread.run(Thread.java:1583)

Tests are failing after #421 got merged.

See the instances for more logs:

  1. https://github.com/opensearch-project/flow-framework/actions/runs/7590930291/job/20678264507?pr=425
  2. https://github.com/opensearch-project/flow-framework/actions/runs/7590847338/job/20678051272

How can one reproduce the bug?

  1. Run security enabled integ tests on JDK 21. ./gradlew integTest -Dsecurity.enabled=true

What is the expected behavior?

Test should pass

What is your host/environment?

ubuntu-latest

Do you have any screenshots?

If applicable, add screenshots to help explain your problem.

Do you have any additional context?

Add any other context about the problem.

@owaiskazi19 owaiskazi19 added bug Something isn't working untriaged v2.12.0 labels Jan 20, 2024
@dbwiddis
Copy link
Member

Tests were failing on JDK21 before that PR was even submitted:
https://github.com/opensearch-project/flow-framework/actions/workflows/test_security.yml

See for example this run prior to the #421 PR:
https://github.com/opensearch-project/flow-framework/actions/runs/7589890744

First of the recent failures was on the initial PR for this bug fix:
https://github.com/opensearch-project/flow-framework/actions/runs/7555230555

But it passed later from that same PR:
https://github.com/opensearch-project/flow-framework/actions/runs/7561855053

And then started failng again for this PR that didn't even change the code:
https://github.com/opensearch-project/flow-framework/actions/runs/7577073358

So it's flaky but don't assume it's a recent change.

@dbwiddis
Copy link
Member

Have investigated multiple possibilities:

  • java security manager enablement for JDK 18+ (no effect). Also tested matrix and successful repeatedly on 17, 18, 19, 20 but fails on 21
  • suspected change in jdk.httpclient.keepalive and jdk.httpclient.keepalive.h2 from 20 mintues to 30 seconds in JDK21 but extending this timeout had no effect
  • various reports of this error occr when there's a dropped connection for any reason, it just times out

Logs appear to show the root cause is authentication:

? WARN ][o.o.d.FileBasedSeedHostsProvider] [integTest-0] expected, but did not find, a dynamic hosts list at [/__w/flow-framework/flow-framework/build/testclusters/integTest-0/config/unicast_hosts.txt]
? WARN ][o.o.s.a.BackendRegistry  ] [integTest-0] Authentication finally failed for null from 127.0.0.1:55912
? WARN ][o.o.s.a.BackendRegistry  ] [integTest-0] Authentication finally failed for null from 127.0.0.1:55914
? WARN ][o.o.s.a.BackendRegistry  ] [integTest-0] Authentication finally failed for null from 127.0.0.1:55922
<many more>

One big clue: tests went from all green to flaky to all failing, only on JDK21. Investigated JDK version on the failed runners:

  • JDK 21.0.1 always passes
  • JDK 21.0.2 always fails

Will try a different distro....

@dbwiddis
Copy link
Member

Ran a matrix with temurin and zulu, 21.0.1 and 21.0.2. Both distros passed on 21.0.1, both distros failed on 21.0.2.

Possible bug in JDK 21.0.2. Tagging @sormuras to see if he has any idea what this could be.

Near term fix: update security integ test matrix to specify 21.0.1 version instead of just 21. @owaiskazi19 can you include this in one of your open PRs?

@dbwiddis
Copy link
Member

Here's the 21.0.2 release notes: https://www.oracle.com/java/technologies/javase/21-0-2-relnotes.html

Some java.nio ones might be suspicious, but I'm way out of my depth here...

@AlanBateman
Copy link

AlanBateman commented Jan 20, 2024

@dbwiddis Would it be possible to provide instructions on how to reproduce this? It is just a matter of cloning this repo and running the tests? Asking so that folks working on the JDK can quickly see if this is a JDK bug or not. Maybe you could come to the OpenJDK net-dev mailing list with instructions so this potential regression can be diagnosed? It would also be useful to know which operating system and whether it duplicates with JDK 22 and JDK 23 EA builds.

@sormuras
Copy link

Skimmed the issue tracker for 21.0.2 for socket entries: https://bugs.openjdk.org/issues/?jql=project%20%3D%20JDK%20AND%20fixVersion%20%3D%2021.0.2%20AND%20text%20~%20%22socket%22 ... seems unrelated to me.

@dbwiddis
Copy link
Member

@dblock @reta @nknize Just an FYI, nothing actionable yet

If you start to see this pop up elsewhere, wanted you to be aware.

We're using performRequest() via Rest High Level Client as part of our integ tests, and something broke with the latest JDK patch version (released Jan 16, has made its way to GHA runners by now), only on the security-enabled tests, where those client requests are failing auth.

@dbwiddis dbwiddis self-assigned this Jan 20, 2024
@reta
Copy link
Contributor

reta commented Jan 20, 2024

@dblock @reta @nknize Just an FYI, nothing actionable yet

Thanks @dbwiddis , the issue was brought up and listed as risk here opensearch-project/OpenSearch#11906, we haven't done any updates yet in our codebase but you are very right, the GA may already use the latest version, we probably should do 2 things:

  1. Use 21.0.1 as the version (GA supports that, large change but not much we can do)
  2. Ban 21.0.2 from OpenSearch runtimes (we do have mechanism implemented, I will follow up with the issue)

@dbwiddis
Copy link
Member

dbwiddis commented Jan 20, 2024

@owaiskazi19 @joshpalis @amitgalitz @ohltyler @jackiehanyang FYI:

  • so far only seeing this with REST requests in integ tests but given above conversation may show up elsewhere we use SccalingExecutorBuilder (Change thread pool to a ScalingExecutorBuilder #421)
  • we should use 21.0.1 for all our github actions until this is resolved
  • AD Repo is going to have similar issues with their next GHA tests (CC: @kaituo)

JDK issue in question: https://bugs.openjdk.org/browse/JDK-8323659 (technically correct but not backwards compatible behavior)

@dbwiddis dbwiddis changed the title [BUG] Failing tests for JDK 21 [BUG] Failing tests for JDK 21.0.2 Jan 20, 2024
@AlanBateman
Copy link

@dbwiddis Attempts to duplicate this so far have failed. Would it be possible to come to the OpenJDK net-dev mailing list? If there is a regression in 21.0.2 then we'd like to track it down.

@dbwiddis
Copy link
Member

@AlanBateman we believe this is an impact of https://bugs.openjdk.org/browse/JDK-8323659

@dbwiddis
Copy link
Member

@AlanBateman Failure has occurred independently from this repo in another repo's integration tests: https://github.com/opensearch-project/skills/actions/runs/7617744945/job/20747440932?pr=140

The commonality in the logs is an entry

[2024-01-22T14:59:53,046][INFO ][o.o.i.NeuralSparseSearchToolIT] [testNeuralSparseSearchToolInFlowAgent] There are still tasks running after this test that might break subsequent tests [indices:data/read/search, indices:data/read/search[phase/query], indices:data/write/bulk, indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/index, indices:data/write/update, indices:data/write/update[s]].

It occurs with both fixed and scaling thread pools when the queue is invoked, which is a LinkedTransferQueue() which matches https://bugs.openjdk.org/browse/JDK-8323659

So I'm fairly certain no further debugging is required. I'll leave it up to you to assess the impact and when to release the fix.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.12.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants