Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fix org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedThroughputDegradationAndRejection #1361

Closed
reta opened this issue Oct 13, 2021 · 3 comments
Labels
bug Something isn't working

Comments

@reta
Copy link
Collaborator

reta commented Oct 13, 2021

Describe the bug
The test is flaky and often fails, see please #1358

To Reproduce

./gradlew ':server:test' --tests "org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedThroughputDegradationAndRejection" -Dtests.seed=8EC1ED900DC1D154 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1" -Druntime.java=15

Expected behavior
A test should pass

Plugins
N/A

Screenshots
N/A

Host/Environment (please complete the following information):

Linux/Ubuntu

OpenSearch Build Hamster says Hello!
Gradle Version : 6.6.1
OS Info : Linux 5.11.0-37-generic (amd64)
Runtime JDK Version : 15 (OpenJDK)
Runtime java.home : /usr/lib/jvm/java-15.0.2-openjdk-amd64
Gradle JDK Version : 11 (JDK)
Gradle java.home : /usr/lib/jvm/java-11-openjdk-amd64
Random Testing Seed : 8EC1ED900DC1D154
In FIPS 140 mode : false

Additional context

org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests > testReplicaThreadedThroughputDegradationAndRejection FAILED
    java.lang.AssertionError: expected:<1> but was:<6>
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:834)
        at org.junit.Assert.assertEquals(Assert.java:645)
        at org.junit.Assert.assertEquals(Assert.java:631)
        at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedThroughputDegradationAndRejection(ShardIndexingPressureConcurrentExecutionTests.java:506)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=121, name=Thread-104, state=RUNNABLE, group=TGRP-ShardIndexingPressureConcurrentExecutionTests]

        Caused by:
        org.opensearch.common.util.concurrent.OpenSearchRejectedExecutionException: rejected execution of replica operation [shard_detail=[IndexName][0], shard_total_bytes=10900, shard_operation_bytes=100, shard_max_coordinating_and_primary_bytes=10, shard_max_replica_bytes=12823] OR [node_total_bytes=10900, node_operation_bytes=100, node_max_coordinating_and_primary_bytes=10240, node_max_replica_bytes=15360]
            at __randomizedtesting.SeedInfo.seed([8EC1ED900DC1D154]:0)
            at org.opensearch.index.ShardIndexingPressure.rejectShardRequest(ShardIndexingPressure.java:286)
            at org.opensearch.index.ShardIndexingPressure.markReplicaOperationStarted(ShardIndexingPressure.java:179)
            at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.lambda$fireConcurrentAndParallelRequestsForUniformThroughPut$13(ShardIndexingPressureConcurrentExecutionTests.java:834)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=109, name=Thread-92, state=RUNNABLE, group=TGRP-ShardIndexingPressureConcurrentExecutionTests]

        Caused by:
        org.opensearch.common.util.concurrent.OpenSearchRejectedExecutionException: rejected execution of replica operation [shard_detail=[IndexName][0], shard_total_bytes=11000, shard_operation_bytes=100, shard_max_coordinating_and_primary_bytes=10, shard_max_replica_bytes=12823] OR [node_total_bytes=11000, node_operation_bytes=100, node_max_coordinating_and_primary_bytes=10240, node_max_replica_bytes=15360]
            at __randomizedtesting.SeedInfo.seed([8EC1ED900DC1D154]:0)
            at org.opensearch.index.ShardIndexingPressure.rejectShardRequest(ShardIndexingPressure.java:286)
            at org.opensearch.index.ShardIndexingPressure.markReplicaOperationStarted(ShardIndexingPressure.java:179)
            at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.lambda$fireConcurrentAndParallelRequestsForUniformThroughPut$13(ShardIndexingPressureConcurrentExecutionTests.java:834)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=110, name=Thread-93, state=RUNNABLE, group=TGRP-ShardIndexingPressureConcurrentExecutionTests]

        Caused by:
        org.opensearch.common.util.concurrent.OpenSearchRejectedExecutionException: rejected execution of replica operation [shard_detail=[IndexName][0], shard_total_bytes=10900, shard_operation_bytes=100, shard_max_coordinating_and_primary_bytes=10, shard_max_replica_bytes=12823] OR [node_total_bytes=10900, node_operation_bytes=100, node_max_coordinating_and_primary_bytes=10240, node_max_replica_bytes=15360]
            at __randomizedtesting.SeedInfo.seed([8EC1ED900DC1D154]:0)
            at org.opensearch.index.ShardIndexingPressure.rejectShardRequest(ShardIndexingPressure.java:286)
            at org.opensearch.index.ShardIndexingPressure.markReplicaOperationStarted(ShardIndexingPressure.java:179)
            at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.lambda$fireConcurrentAndParallelRequestsForUniformThroughPut$13(ShardIndexingPressureConcurrentExecutionTests.java:834)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=33, name=Thread-16, state=RUNNABLE, group=TGRP-ShardIndexingPressureConcurrentExecutionTests]

        Caused by:
        org.opensearch.common.util.concurrent.OpenSearchRejectedExecutionException: rejected execution of replica operation [shard_detail=[IndexName][0], shard_total_bytes=10800, shard_operation_bytes=100, shard_max_coordinating_and_primary_bytes=10, shard_max_replica_bytes=12823] OR [node_total_bytes=10800, node_operation_bytes=100, node_max_coordinating_and_primary_bytes=10240, node_max_replica_bytes=15360]
            at __randomizedtesting.SeedInfo.seed([8EC1ED900DC1D154]:0)
            at org.opensearch.index.ShardIndexingPressure.rejectShardRequest(ShardIndexingPressure.java:286)
            at org.opensearch.index.ShardIndexingPressure.markReplicaOperationStarted(ShardIndexingPressure.java:179)
            at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.lambda$fireConcurrentAndParallelRequestsForUniformThroughPut$13(ShardIndexingPressureConcurrentExecutionTests.java:834)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=52, name=Thread-35, state=RUNNABLE, group=TGRP-ShardIndexingPressureConcurrentExecutionTests]

        Caused by:
        org.opensearch.common.util.concurrent.OpenSearchRejectedExecutionException: rejected execution of replica operation [shard_detail=[IndexName][0], shard_total_bytes=10800, shard_operation_bytes=100, shard_max_coordinating_and_primary_bytes=10, shard_max_replica_bytes=12823] OR [node_total_bytes=10800, node_operation_bytes=100, node_max_coordinating_and_primary_bytes=10240, node_max_replica_bytes=15360]
            at __randomizedtesting.SeedInfo.seed([8EC1ED900DC1D154]:0)
            at org.opensearch.index.ShardIndexingPressure.rejectShardRequest(ShardIndexingPressure.java:286)
            at org.opensearch.index.ShardIndexingPressure.markReplicaOperationStarted(ShardIndexingPressure.java:179)
            at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.lambda$fireConcurrentAndParallelRequestsForUniformThroughPut$13(ShardIndexingPressureConcurrentExecutionTests.java:834)
org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests > testReplicaThreadedThroughputDegradationAndRejection FAILED
    java.lang.AssertionError: expected:<1> but was:<9>
        at org.junit.Assert.fail(Assert.java:88)
        at org.junit.Assert.failNotEquals(Assert.java:834)
        at org.junit.Assert.assertEquals(Assert.java:645)
        at org.junit.Assert.assertEquals(Assert.java:631)
        at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.testReplicaThreadedThroughputDegradationAndRejection(ShardIndexingPressureConcurrentExecutionTests.java:506)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=50, name=Thread-33, state=RUNNABLE, group=TGRP-ShardIndexingPressureConcurrentExecutionTests]

        Caused by:
        org.opensearch.common.util.concurrent.OpenSearchRejectedExecutionException: rejected execution of replica operation [shard_detail=[IndexName][0], shard_total_bytes=10700, shard_operation_bytes=100, shard_max_coordinating_and_primary_bytes=10, shard_max_replica_bytes=12588] OR [node_total_bytes=10700, node_operation_bytes=100, node_max_coordinating_and_primary_bytes=10240, node_max_replica_bytes=15360]
            at __randomizedtesting.SeedInfo.seed([8EC1ED900DC1D154]:0)
            at org.opensearch.index.ShardIndexingPressure.rejectShardRequest(ShardIndexingPressure.java:286)
            at org.opensearch.index.ShardIndexingPressure.markReplicaOperationStarted(ShardIndexingPressure.java:179)
            at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.lambda$fireConcurrentAndParallelRequestsForUniformThroughPut$13(ShardIndexingPressureConcurrentExecutionTests.java:834)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=57, name=Thread-40, state=RUNNABLE, group=TGRP-ShardIndexingPressureConcurrentExecutionTests]

        Caused by:
        org.opensearch.common.util.concurrent.OpenSearchRejectedExecutionException: rejected execution of replica operation [shard_detail=[IndexName][0], shard_total_bytes=10800, shard_operation_bytes=100, shard_max_coordinating_and_primary_bytes=10, shard_max_replica_bytes=11058] OR [node_total_bytes=10800, node_operation_bytes=100, node_max_coordinating_and_primary_bytes=10240, node_max_replica_bytes=15360]
            at __randomizedtesting.SeedInfo.seed([8EC1ED900DC1D154]:0)
            at org.opensearch.index.ShardIndexingPressure.rejectShardRequest(ShardIndexingPressure.java:286)
            at org.opensearch.index.ShardIndexingPressure.markReplicaOperationStarted(ShardIndexingPressure.java:179)
            at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.lambda$fireConcurrentAndParallelRequestsForUniformThroughPut$13(ShardIndexingPressureConcurrentExecutionTests.java:834)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=119, name=Thread-102, state=RUNNABLE, group=TGRP-ShardIndexingPressureConcurrentExecutionTests]

        Caused by:
        org.opensearch.common.util.concurrent.OpenSearchRejectedExecutionException: rejected execution of replica operation [shard_detail=[IndexName][0], shard_total_bytes=10700, shard_operation_bytes=100, shard_max_coordinating_and_primary_bytes=10, shard_max_replica_bytes=11058] OR [node_total_bytes=10700, node_operation_bytes=100, node_max_coordinating_and_primary_bytes=10240, node_max_replica_bytes=15360]
            at __randomizedtesting.SeedInfo.seed([8EC1ED900DC1D154]:0)
            at org.opensearch.index.ShardIndexingPressure.rejectShardRequest(ShardIndexingPressure.java:286)
            at org.opensearch.index.ShardIndexingPressure.markReplicaOperationStarted(ShardIndexingPressure.java:179)
            at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.lambda$fireConcurrentAndParallelRequestsForUniformThroughPut$13(ShardIndexingPressureConcurrentExecutionTests.java:834)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=45, name=Thread-28, state=RUNNABLE, group=TGRP-ShardIndexingPressureConcurrentExecutionTests]

        Caused by:
        org.opensearch.common.util.concurrent.OpenSearchRejectedExecutionException: rejected execution of replica operation [shard_detail=[IndexName][0], shard_total_bytes=10700, shard_operation_bytes=100, shard_max_coordinating_and_primary_bytes=10, shard_max_replica_bytes=11058] OR [node_total_bytes=10700, node_operation_bytes=100, node_max_coordinating_and_primary_bytes=10240, node_max_replica_bytes=15360]
            at __randomizedtesting.SeedInfo.seed([8EC1ED900DC1D154]:0)
            at org.opensearch.index.ShardIndexingPressure.rejectShardRequest(ShardIndexingPressure.java:286)
            at org.opensearch.index.ShardIndexingPressure.markReplicaOperationStarted(ShardIndexingPressure.java:179)
            at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.lambda$fireConcurrentAndParallelRequestsForUniformThroughPut$13(ShardIndexingPressureConcurrentExecutionTests.java:834)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=105, name=Thread-88, state=RUNNABLE, group=TGRP-ShardIndexingPressureConcurrentExecutionTests]

        Caused by:
        org.opensearch.common.util.concurrent.OpenSearchRejectedExecutionException: rejected execution of replica operation [shard_detail=[IndexName][0], shard_total_bytes=10900, shard_operation_bytes=100, shard_max_coordinating_and_primary_bytes=10, shard_max_replica_bytes=11058] OR [node_total_bytes=10900, node_operation_bytes=100, node_max_coordinating_and_primary_bytes=10240, node_max_replica_bytes=15360]
            at __randomizedtesting.SeedInfo.seed([8EC1ED900DC1D154]:0)
            at org.opensearch.index.ShardIndexingPressure.rejectShardRequest(ShardIndexingPressure.java:286)
            at org.opensearch.index.ShardIndexingPressure.markReplicaOperationStarted(ShardIndexingPressure.java:179)
            at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.lambda$fireConcurrentAndParallelRequestsForUniformThroughPut$13(ShardIndexingPressureConcurrentExecutionTests.java:834)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=23, name=Thread-6, state=RUNNABLE, group=TGRP-ShardIndexingPressureConcurrentExecutionTests]

        Caused by:
        org.opensearch.common.util.concurrent.OpenSearchRejectedExecutionException: rejected execution of replica operation [shard_detail=[IndexName][0], shard_total_bytes=11000, shard_operation_bytes=100, shard_max_coordinating_and_primary_bytes=10, shard_max_replica_bytes=12705] OR [node_total_bytes=11000, node_operation_bytes=100, node_max_coordinating_and_primary_bytes=10240, node_max_replica_bytes=15360]
            at __randomizedtesting.SeedInfo.seed([8EC1ED900DC1D154]:0)
            at org.opensearch.index.ShardIndexingPressure.rejectShardRequest(ShardIndexingPressure.java:286)
            at org.opensearch.index.ShardIndexingPressure.markReplicaOperationStarted(ShardIndexingPressure.java:179)
            at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.lambda$fireConcurrentAndParallelRequestsForUniformThroughPut$13(ShardIndexingPressureConcurrentExecutionTests.java:834)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=62, name=Thread-45, state=RUNNABLE, group=TGRP-ShardIndexingPressureConcurrentExecutionTests]

        Caused by:
        org.opensearch.common.util.concurrent.OpenSearchRejectedExecutionException: rejected execution of replica operation [shard_detail=[IndexName][0], shard_total_bytes=10800, shard_operation_bytes=100, shard_max_coordinating_and_primary_bytes=10, shard_max_replica_bytes=12705] OR [node_total_bytes=10800, node_operation_bytes=100, node_max_coordinating_and_primary_bytes=10240, node_max_replica_bytes=15360]
            at __randomizedtesting.SeedInfo.seed([8EC1ED900DC1D154]:0)
            at org.opensearch.index.ShardIndexingPressure.rejectShardRequest(ShardIndexingPressure.java:286)
            at org.opensearch.index.ShardIndexingPressure.markReplicaOperationStarted(ShardIndexingPressure.java:179)
            at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.lambda$fireConcurrentAndParallelRequestsForUniformThroughPut$13(ShardIndexingPressureConcurrentExecutionTests.java:834)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=43, name=Thread-26, state=RUNNABLE, group=TGRP-ShardIndexingPressureConcurrentExecutionTests]

        Caused by:
        org.opensearch.common.util.concurrent.OpenSearchRejectedExecutionException: rejected execution of replica operation [shard_detail=[IndexName][0], shard_total_bytes=10700, shard_operation_bytes=100, shard_max_coordinating_and_primary_bytes=10, shard_max_replica_bytes=12705] OR [node_total_bytes=10700, node_operation_bytes=100, node_max_coordinating_and_primary_bytes=10240, node_max_replica_bytes=15360]
            at __randomizedtesting.SeedInfo.seed([8EC1ED900DC1D154]:0)
            at org.opensearch.index.ShardIndexingPressure.rejectShardRequest(ShardIndexingPressure.java:286)
            at org.opensearch.index.ShardIndexingPressure.markReplicaOperationStarted(ShardIndexingPressure.java:179)
            at org.opensearch.index.ShardIndexingPressureConcurrentExecutionTests.lambda$fireConcurrentAndParallelRequestsForUniformThroughPut$13(ShardIndexingPressureConcurrentExecutionTests.java:834)

@reta
Copy link
Collaborator Author

reta commented Oct 13, 2021

@getsaurabh02 could you please take a look? The test is certainly unstable, thank you.

getsaurabh02 added a commit to getsaurabh02/OpenSearch that referenced this issue Oct 14, 2021
@getsaurabh02
Copy link
Member

Thanks @reta > I verified the flakiness is due to NUM_THREADS value, which is ranging from 100 to 120. For any value greater than 110, the requests are breaching the shard limits and hence are facing more rejections (greater than 1). This is breaking the assertion above.

I have been able to verify the failure consistently for value of ```NUM_THREADS`` greater than 110.

Since the test focussed on rejection due to throughput degradation, and not the node limit, I have updated the concurrency to be in the range if 80 to 100 in the #1364, such that shard limit is never reached.

final int NUM_THREADS = scaledRandomIntBetween(80, 100);

adnapibar pushed a commit that referenced this issue Oct 14, 2021
Fixes flakiness for test testReplicaThreadedThroughputDegradationAndRejection.

Reduced the number of threads concurrently executing from the initial range between (100-120) to a new range between (80-100), as the previous range was breaking the node limits set as 10kb, for every execution where number of threads were greater than 110.

Signed-off-by: Saurabh Singh <sisurab@amazon.com>
getsaurabh02 added a commit to getsaurabh02/OpenSearch that referenced this issue Oct 20, 2021
dblock pushed a commit that referenced this issue Oct 20, 2021
Signed-off-by: Saurabh Singh <sisurab@amazon.com>

Co-authored-by: Saurabh Singh <sisurab@amazon.com>
@anasalkouz
Copy link
Member

@reta Seems the fix already merged. closing the ticket. Please feel free to re-open it if you still facing issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants