Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [Segment Replication] Primary mode assertion failure on source #6740

Closed
dreamer-89 opened this issue Mar 17, 2023 · 2 comments
Closed
Assignees
Labels
bug Something isn't working distributed framework

Comments

@dreamer-89
Copy link
Member

While handling get_segment_files request, source updates the target's checkpoint info locally. This is used to evaluate the segment backpressure specific metrics (Introduced in #6563). Ideally, source should not be performing a round of segment replication if it is not the active primary.

This error pops via testSingleIndexShardAllocation test failure (stack trace below) on CI.
Gradle check:

  1. https://build.ci.opensearch.org/job/gradle-check/12589/consoleFull
  2. https://build.ci.opensearch.org/job/gradle-check/12591/consoleFull

Sample stack trace failure

REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationAllocationIT.testSingleIndexShardAllocation" -Dtests.seed=55B79163B5336542 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=fr -Dtests.timezone=Africa/Douala -Druntime.java=19

org.opensearch.indices.replication.SegmentReplicationAllocationIT > testSingleIndexShardAllocation FAILED
    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=1205, name=opensearch[node_t3][generic][T#2], state=RUNNABLE, group=TGRP-SegmentReplicationAllocationIT]
        at __randomizedtesting.SeedInfo.seed([55B79163B5336542:8FF4C02420EEBA2D]:0)

        Caused by:
        java.lang.AssertionError: shard [test][40], node[AL0e5Bp2QNCVUX_j7e3A-g], relocating [pZ2Exp_0T4GqoPAU4MSOHA], [P], s[RELOCATING], a[id=uyQi7yr_TiC1P_aZB_nzYQ, rId=a-I_vS4KQm6twzy8RSJvfQ], expected_shard_size[230] is not a primary shard in primary mode
            at __randomizedtesting.SeedInfo.seed([55B79163B5336542]:0)
            at org.opensearch.index.shard.IndexShard.assertPrimaryMode(IndexShard.java:2341)
            at org.opensearch.index.shard.IndexShard.updateVisibleCheckpointForShard(IndexShard.java:2728)
            at org.opensearch.indices.replication.OngoingSegmentReplications.startSegmentCopy(OngoingSegmentReplications.java:127)
            at org.opensearch.indices.replication.SegmentReplicationSourceService$GetSegmentFilesRequestHandler.messageReceived(SegmentReplicationSourceService.java:157)
            at org.opensearch.indices.replication.SegmentReplicationSourceService$GetSegmentFilesRequestHandler.messageReceived(SegmentReplicationSourceService.java:154)
            at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106)
            at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453)
            at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806)
            at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
            at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
            at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
            at java.****/java.lang.Thread.run(Thread.java:1589)
@dreamer-89
Copy link
Member Author

dreamer-89 commented Mar 20, 2023

Segment replication related ITs are flaky due to this.

One example gradle check https://build.ci.opensearch.org/job/gradle-check/12664 containing failure.

  2> REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationAllocationIT.testSingleIndexShardAllocation" -Dtests.seed=FDBD9C941C57B009 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=ja-JP -Dtests.timezone=America/St_Lucia -Druntime.java=19
  2> com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=633, name=opensearch[node_t2][generic][T#7], state=RUNNABLE, group=TGRP-SegmentReplicationAllocationIT]
        at __randomizedtesting.SeedInfo.seed([FDBD9C941C57B009:27FECDD3898A6F66]:0)

        Caused by:
        java.lang.AssertionError: shard [test][49], node[Pdp2eDa_QrCDWtLXfqvV5Q], relocating [h2EFzzAOT9GRrxYvy3E4kQ], [P], s[RELOCATING], a[id=FtYFi3exQGWuwI7j2ywZeg, rId=-hMUX6grTj2dOAj8PTC3zw], expected_shard_size[230] is not a primary shard in primary mode
            at __randomizedtesting.SeedInfo.seed([FDBD9C941C57B009]:0)
            at org.opensearch.index.shard.IndexShard.assertPrimaryMode(IndexShard.java:2341)
            at org.opensearch.index.shard.IndexShard.updateVisibleCheckpointForShard(IndexShard.java:2728)
            at org.opensearch.indices.replication.OngoingSegmentReplications.startSegmentCopy(OngoingSegmentReplications.java:127)
            at org.opensearch.indices.replication.SegmentReplicationSourceService$GetSegmentFilesRequestHandler.messageReceived(SegmentReplicationSourceService.java:157)
            at org.opensearch.indices.replication.SegmentReplicationSourceService$GetSegmentFilesRequestHandler.messageReceived(SegmentReplicationSourceService.java:154)
            at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106)
            at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453)
            at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806)
            at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
            at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
            at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
            at java.****/java.lang.Thread.run(Thread.java:1589)
REPRODUCE WITH: ./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationRelocationIT.testPrimaryRelocation" -Dtests.seed=FDBD9C941C57B009 -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=el-CY -Dtests.timezone=ECT -Druntime.java=19

org.opensearch.indices.replication.SegmentReplicationRelocationIT > testPrimaryRelocation FAILED
    java.lang.AssertionError: Expected search hits on node: node_t1 to be at least 204 but was: 102
        at org.junit.Assert.fail(Assert.java:89)
        at org.opensearch.indices.replication.SegmentReplicationBaseIT.lambda$waitForSearchableDocs$0(SegmentReplicationBaseIT.java:132)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1060)
        at org.opensearch.indices.replication.SegmentReplicationBaseIT.waitForSearchableDocs(SegmentReplicationBaseIT.java:127)
        at org.opensearch.indices.replication.SegmentReplicationBaseIT.waitForSearchableDocs(SegmentReplicationBaseIT.java:122)
        at org.opensearch.indices.replication.SegmentReplicationBaseIT.waitForSearchableDocs(SegmentReplicationBaseIT.java:139)
        at org.opensearch.indices.replication.SegmentReplicationRelocationIT.testPrimaryRelocation(SegmentReplicationRelocationIT.java:122)

    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=647, name=opensearch[node_t0][generic][T#1], state=RUNNABLE, group=TGRP-SegmentReplicationRelocationIT]

        Caused by:
        java.lang.AssertionError: shard [test-idx-1][0], node[WBJYgS0LQ0CkKWlqWITlBA], relocating [7omFIh5gT_aV2kvEP2gk4w], [P], s[RELOCATING], a[id=QK0lMG1aRRK16WhYWgf_oQ, rId=iBKmA7SeTZCe6D0H_6hMug], expected_shard_size[230] is not a primary shard in primary mode
            at __randomizedtesting.SeedInfo.seed([FDBD9C941C57B009]:0)
            at org.opensearch.index.shard.IndexShard.assertPrimaryMode(IndexShard.java:2341)
            at org.opensearch.index.shard.IndexShard.updateVisibleCheckpointForShard(IndexShard.java:2728)
            at org.opensearch.indices.replication.OngoingSegmentReplications.startSegmentCopy(OngoingSegmentReplications.java:127)
            at org.opensearch.indices.replication.SegmentReplicationSourceService$GetSegmentFilesRequestHandler.messageReceived(SegmentReplicationSourceService.java:157)
            at org.opensearch.indices.replication.SegmentReplicationSourceService$GetSegmentFilesRequestHandler.messageReceived(SegmentReplicationSourceService.java:154)
            at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106)
            at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453)
            at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806)
            at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52)
            at java.****/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
            at java.****/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
            at java.****/java.lang.Thread.run(Thread.java:1589)

@dreamer-89
Copy link
Member Author

The flaky tests should be fixed with #6757, closing this one.

@github-project-automation github-project-automation bot moved this from Todo to Done in Segment Replication Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
Status: Done
Development

No branches or pull requests

3 participants