Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flaky tests #90

Open
bowenlan-amzn opened this issue Jun 30, 2021 · 30 comments
Open

Flaky tests #90

bowenlan-amzn opened this issue Jun 30, 2021 · 30 comments
Labels
maintenance improves code quality, but not the product testing

Comments

@bowenlan-amzn
Copy link
Member

bowenlan-amzn commented Jun 30, 2021

Describe the bug

Note
Please first search in this page for the same failure before recording yours

This issue is to record flaky tests we are seeing. Most of time, your PR is not the reason of the test failure. So you can just record the flaky test failure here and we will dig into it later.

Report like this:

Refer to this comment #90 (comment)
You can download the log of failed run from the bottom of the summary of the workflow.

How to debug the flaky test

Flaky test is hard to understand at the first look, considering we may not have enough log. And it's also not easy to reproduce (race condition, environment dependent).

An efficient way to handle these:

  1. Go into the failed run, based on the test failures, you should be able to relate to the source code part.
  2. Download the log associated with the failed run, locating the logs in integTest.log when this test was run.
  3. If we don't have enough log to understand how it could happen, add more DEBUG log into the source code that would be enough for us to understand when this flaky test shows up next time. Change test log level to debugging like this. So we can see more info in the cluster log.
  4. You can also add log into the test code, so we can see more info in the test log.

Script to reproduce the flaky

This script works on Linux system

script ../flakyoutput.txt # use shell command script to record the terminal output into a file

for i in {1..10}; do
	echo "=== Run time $i ==="
	./gradlew clean && ./gradlew integTest --tests "*IndexManagementIndicesIT" --tests "*MetadataRegressionIT" --tests "*ActionRetryIT" -PnumNodes=3
	retVal=$?
	if [ $retVal -eq 1 ]; then
	    echo "Flaky test found in run time $i"
	    break
	fi
done

ctrl+d # end shell command script
@bowenlan-amzn

This comment was marked as outdated.

@downsrob
Copy link
Contributor

downsrob commented Oct 5, 2021

IndexManagementIndicesIT.test update management index mapping with new schema version

I saw this flakey failure on 3/5 multi node runs I did
Workflow link

org.opensearch.indexmanagement.IndexManagementIndicesIT > test update management index mapping with new schema version FAILED
    org.opensearch.client.ResponseException: method [DELETE], host [http://[::1]:39473], URI [/.ds-test-data-stream-000001], status line [HTTP/1.1 400 Bad Request]
    {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"index [.ds-test-data-stream-000001] is the write index for data stream [test-data-stream] and cannot be deleted"}],"type":"illegal_argument_exception","reason":"index [.ds-test-data-stream-000001] is the write index for data stream [test-data-stream] and cannot be deleted"},"status":400}

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.IndexManagementIndicesIT.test update management index mapping with new schema version" -Dtests.seed=BA2DF08B4E74969 -Dtests.security.manager=false -Dtests.locale=es-US -Dtests.timezone=Pacific/Niue -Druntime.java=14
        at __randomizedtesting.SeedInfo.seed([BA2DF08B4E74969:E2E0F3C6EB23FCC8]:0)
        at org.opensearch.client.RestClient.convertResponse(RestClient.java:339)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:309)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:283)
        at org.opensearch.indexmanagement.ODFERestTestCase.wipeAllODFEIndices(ODFERestTestCase.kt:139)
        at org.opensearch.indexmanagement.IndexManagementIndicesIT.test update management index mapping with new schema version(IndexManagementIndicesIT.kt:80)

thalurur pushed a commit to thalurur/open-index-management that referenced this issue Oct 22, 2021
* Rename plugin helper file

* Update TransformIndices.tsx

* Backport commits from main (opensearch-project#90)

* Release 1.1.0.0 (opensearch-project#83)

* Release 1.1.0.0

Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com>

* Correct copyright notices

* UI fixes for new ISM UI (opensearch-project#84)

* Removes X icon from action/transition flyout footer next to cancel

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Adds tooltips to edit/delete icon buttons on the draggable action/transition components

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Removes overlay for flyout so clicking outside doesn't close flyout and removes X close button on top right

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Adds edit/delete tooltips for state component and removes underline from hovering states

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Changes state flyout from X Close to Cancel

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Adds JSON editor for allocation action

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Adds isValid to action interface and implements in actions

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Correctly removes unused rollover keys from rollover object, handles uncontrolled inputs and NaN minDocs

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Adds error for state name already existing in policy

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Fixes duplicated actions when editing as previously we had a new ID and matching logic didn't find an action

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Fixes other action inputs that can return NaN and creates a no conditions option that is the default

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Updates snapshot

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Correctly show danger toast on update/create failures

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Fixes rollup action nesting multiple ism_template keys

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Fixes timeout/retry settings turning into uncontrolled inputs and NaN value

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Removes text transformation on inputs and updates isValid methods and passes to form labels

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Fixes allocation and rollup adding UIAction properties to policy JSON on updates

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Updates release notes w/ new PR changes

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Wraps actions/transitions in states component and updates default rollup to include ism_rollup key

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Fixes some small UI touchups/issues for new ISM UI (opensearch-project#85)

* Moves cancel secondary button next to primary on action/transition flyout

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Removes default ISM template and increases width of empty prompt

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Adds punctuation

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Adds back X icon next to cancel for state flyout

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Makes inputs in flyout full width, updates help text, and adds some spacing

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Updates transition default to not include conditions empty object, and on change transition to delete conditions when selecting none

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Moves edit button out of policy settings content panel and on to global page for view policy

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Updates release note

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Fixes broken link

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Fixes small issues on new ISM UI (opensearch-project#88)

* Fixes small issues on new ISM UI

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Updates snapshots

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Updates release notes

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

* Updates workflows to trigger on 1.* branches

Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>

Co-authored-by: Bowen Lan <62091230+bowenlan-amzn@users.noreply.github.com>

* Support data filter when viewing sample data

* Update TransformOptions.tsx

* Add TransformService.test.ts

* Draft of date histogram panel

* Add helper method and move import

* Split calendar and fixed interval to 2 panels

* Add rendering test

* Add close popover test

* Update cypress-workflow.yml

* Update links.yml

* Update unit-tests-workflow.yml

* Update TransformOptions.tsx

* Update tests

* Update DateHistogramPanel.test.tsx

* Make scripted metrics editor larger

* Expand code editor related panels

* Undo changes to workflow

Signed-off-by: Annie Lee <leeyun@amazon.com>

* Update TimeAggregation.tsx

* Update DefineTransforms.tsx

undo changes to style

* Clean up code

* Update snapshot

Signed-off-by: Annie Lee <leeyun@amazon.com>

* Refactor the order of checking if name is defined before checking for duplicate name

* Update CreateTransformForm.tsx

* Update DefineTransforms.tsx

Signed-off-by: Annie Lee <leeyun@amazon.com>

* Update DefineTransforms.tsx

Signed-off-by: Annie Lee <leeyun@amazon.com>

* Update DefineTransforms.tsx

* Update DefineTransforms.tsx

* Update DefineTransforms.tsx

Co-authored-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com>
Co-authored-by: Bowen Lan <62091230+bowenlan-amzn@users.noreply.github.com>
@downsrob
Copy link
Contributor

downsrob commented Nov 5, 2021

RollupInterceptorIT.test roll up search
RollupInterceptorIT.test bucket and sub aggregations have correct values
RollupRunnerIT.test metadata is created for rollup job when none exists
RollupRunnerIT.test no-op execution when a full window of time to rollup is not available

Workflow link

org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptorIT > test roll up search FAILED
    java.lang.AssertionError: Rollup job doesn't have metadata set
        at __randomizedtesting.SeedInfo.seed([687045F74F56AAF3:C7C066BB95AE41F9]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.junit.Assert.assertNotNull(Assert.java:713)
        at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptorIT$test roll up search$1.invoke(RollupInterceptorIT.kt:91)
        at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptorIT$test roll up search$1.invoke(RollupInterceptorIT.kt:51)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:140)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:134)
        at org.opensearch.indexmanagement.rollup.interceptor.RollupInterceptorIT.test roll up search(RollupInterceptorIT.kt:89)

TransformRunnerIT.test transform with data filter

Workflow link

errors
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter" -Dtests.seed=A19D930A096062ED -Dtests.security.manager=false -Dtests.locale=ru -Dtests.timezone=America/Recife -Druntime.java=14
org.opensearch.indexmanagement.transform.TransformRunnerIT > test transform with data filter FAILED
    java.lang.AssertionError: Transform job doesn't have metadata set
        at __randomizedtesting.SeedInfo.seed([A19D930A096062ED:87A193E3B89B59E1]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.junit.Assert.assertNotNull(Assert.java:713)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test transform with data filter$metadata$1.invoke(TransformRunnerIT.kt:108)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test transform with data filter$metadata$1.invoke(TransformRunnerIT.kt:34)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:140)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:134)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter(TransformRunnerIT.kt:106)

@downsrob
Copy link
Contributor

downsrob commented Nov 15, 2021

IndexStateManagementHistoryIT.test history shard settings

Workflow link

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT.test history shard settings" -Dtests.seed=394C4E8F4FAA2B7B -Dtests.security.manager=false -Dtests.locale=mt -Dtests.timezone=Europe/Vaduz -Druntime.java=14
org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT > test history shard settings FAILED
    java.lang.AssertionError: Wrong number of shards expected:<2> but was:<1>
        at __randomizedtesting.SeedInfo.seed([394C4E8F4FAA2B7B:CA84DAB66688400A]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:120)
        at org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT$test history shard settings$2.invoke(IndexStateManagementHistoryIT.kt:452)
        at org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT$test history shard settings$2.invoke(IndexStateManagementHistoryIT.kt:50)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:140)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:134)
        at org.opensearch.indexmanagement.indexstatemanagement.action.IndexStateManagementHistoryIT.test history shard settings(IndexStateManagementHistoryIT.kt:447)

  1> [2023-01-10T21:09:43,896][INFO ][o.o.i.i.a.IndexStateManagementHistoryIT] [test history shard settings] before test
  1> [2023-01-10T21:09:43,900][INFO ][o.o.i.i.a.IndexStateManagementHistoryIT] [test history shard settings] initializing REST clients against [http://[::1]:40303, http://127.0.0.1:35475]
  1> [2023-01-10T21:10:07,142][INFO ][o.o.i.i.a.IndexStateManagementHistoryIT] [test history shard settings] after test

logs.zip

@downsrob downsrob mentioned this issue Dec 3, 2021
1 task
@downsrob
Copy link
Contributor

downsrob commented Apr 7, 2022

Time captured is 0 problem, if we see such problem, we can update the >0 to be >=0

 org.opensearch.indexmanagement.transform.TransformRunnerIT > test transform FAILED
    java.lang.AssertionError: Didn't capture search time
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform" -Dtests.seed=49642B2182B7B4FC -Dtests.security.manager=false -Dtests.locale=hi-IN -Dtests.timezone=America/Indiana/Vincennes -Druntime.java=17
        at __randomizedtesting.SeedInfo.seed([49642B2182B7B4FC:BEAE0232E3E6D91]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform(TransformRunnerIT.kt:69)
org.opensearch.indexmanagement.transform.TransformRunnerIT > test transform FAILED
    java.lang.AssertionError: Doesn't capture indexed time
        at __randomizedtesting.SeedInfo.seed([E76B690D887C5E11:A5E5A20F24F5877C]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform(TransformRunnerIT.kt:68)

Test code ref, we should use different index name for different tests.

We saw a number of different flaky test failures in jenkins runs on security disabled domains for the 2.0.1 release which all seem to be related to capturing search time. Search time is checked after index time in these integration tests, and indexed into the metadata document at the same time. Perhaps the search result is cached from another test which is making the search time 0 on single node domains.

Suite: Test class org.opensearch.indexmanagement.transform.TransformRunnerIT
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform" -Dtests.seed=C156C07169688EB4 -Dtests.security.manager=false -Dtests.locale=th -Dtests.timezone=Asia/Dubai -Druntime.java=17
  2> java.lang.AssertionError: Didn't capture search time
        at __randomizedtesting.SeedInfo.seed([C156C07169688EB4:83D80B73C5E157D9]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform(TransformRunnerIT.kt:69)
Suite: Test class org.opensearch.indexmanagement.transform.TransformRunnerIT
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter" -Dtests.seed=9D6F01C659F0D54 -Dtests.security.manager=false -Dtests.locale=sr-Latn -Dtests.timezone=Etc/GMT+6 -Druntime.java=17
  2> java.lang.AssertionError: Didn't capture search time
        at __randomizedtesting.SeedInfo.seed([9D6F01C659F0D54:2FEAF0F5D4643658]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter(TransformRunnerIT.kt:108)
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform" -Dtests.seed=9D6F01C659F0D54 -Dtests.security.manager=false -Dtests.locale=sr-Latn -Dtests.timezone=Etc/GMT+6 -Druntime.java=17
  2> java.lang.AssertionError: Didn't capture search time
        at __randomizedtesting.SeedInfo.seed([9D6F01C659F0D54:4B583B1EC916D439]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform(TransformRunnerIT.kt:69)
Suite: Test class org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT.test continuous delay exclusion period" -Dtests.seed=5BC1719B69F3D965 -Dtests.security.manager=false -Dtests.locale=es-PA -Dtests.timezone=ROK -Druntime.java=17
  2> java.lang.AssertionError: Did not spend time searching
        at __randomizedtesting.SeedInfo.seed([5BC1719B69F3D965:B377556363CB8BF6]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT.test continuous delay exclusion period(RollupRunnerIT.kt:730)

TransformRunnerIt.test transform with data filter

Added commit that comments the assertion:
#581

 java.lang.AssertionError: Didn't capture search time
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter" -Dtests.seed=D58AE6EFD6EC0421 -Dtests.security.manager=false -Dtests.locale=es-BO -Dtests.timezone=Asia/Qostanay -Druntime.java=17

org.opensearch.indexmanagement.transform.TransformRunnerIT > test transform with data filter FAILED
    java.lang.AssertionError: Didn't capture search time
        at __randomizedtesting.SeedInfo.seed([D58AE6EFD6EC0421:F3B6E60667173F2D]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test transform with data filter(TransformRunnerIT.kt:111)

@bowenlan-amzn bowenlan-amzn changed the title Flakey tests Flaky tests Apr 19, 2022
@bowenlan-amzn
Copy link
Member Author

bowenlan-amzn commented Apr 19, 2022

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.indexstatemanagement.action.RolloverActionIT.test rollover min primary shard size" -Dtests.seed=7BB5E96EAB38AAE6 -Dtests.security.manager=false -Dtests.locale=cs-CZ -Dtests.timezone=America/Lower_Princes -Druntime.java=17
org.opensearch.indexmanagement.indexstatemanagement.action.RolloverActionIT > test rollover min primary shard size FAILED

    java.lang.AssertionError: Found multiple shards over 100kb

        at __randomizedtesting.SeedInfo.seed([7BB5E96EAB38AAE6:7EE92CD3306B0DB7]:0)
Suite: Test class org.opensearch.indexmanagement.indexstatemanagement.action.RolloverActionIT
        at org.junit.Assert.fail(Assert.java:89)
  2> Dub 18, 2022 7:37:30 ODP. org.opensearch.client.RestClient logResponse
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.opensearch.indexmanagement.indexstatemanagement.action.RolloverActionIT.test rollover min primary shard size(RolloverActionIT.kt:287)
  2> WARNING: request [PUT http://[::1]:42817/bwc_index-1] returned 1 warnings: [299 OpenSearch-2.0.0-SNAPSHOT-d94ccc2b76e1b6ea07481b30903a12038f5703fd "[index.opendistro.index_state_management.rollover_alias] setting was deprecated in OpenSearch and will be removed in a future release! See the breaking changes documentation for the next major version."]

Add more comments in this PR #611

@bowenlan-amzn
Copy link
Member Author

bowenlan-amzn commented Apr 22, 2022

java.lang.NullPointerException

RestStopRollupActionIT.test stopping a failed rollup

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup" -Dtests.seed=25E38D5DE3AB3302 -Dtests.security.manager=false -Dtests.locale=en-MT -Dtests.timezone=Africa/Brazzaville -Druntime.java=11

Suite: Test class org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup" -Dtests.seed=25E38D5DE3AB3302 -Dtests.security.manager=false -Dtests.locale=en-MT -Dtests.timezone=Africa/Brazzaville -Druntime.java=11
  2> java.lang.NullPointerException
        at __randomizedtesting.SeedInfo.seed([25E38D5DE3AB3302:41F6E686D5846EF4]:0)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:125)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:123)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup(RestStopRollupActionIT.kt:123)

TransformRunnerIT.test continuous transform with wildcard indices

  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices" -Dtests.seed=4CB2963FF335C01A -Dtests.security.manager=false -Dtests.locale=zh-Hans-CN -Dtests.timezone=Etc/GMT+9 -Druntime.java=11
  2> java.lang.NullPointerException
        at __randomizedtesting.SeedInfo.seed([4CB2963FF335C01A:A6D458A846E559FD]:0)
        at java.base/java.time.Instant.compareTo(Instant.java:1254)
        at java.base/java.time.Instant.isAfter(Instant.java:1271)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test continuous transform with wildcard indices$secondIterationMetadata$1.invoke(TransformRunnerIT.kt:764)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test continuous transform with wildcard indices$secondIterationMetadata$1.invoke(TransformRunnerIT.kt:30)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:113)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices(TransformRunnerIT.kt:758)

https://github.com/opensearch-project/index-management/blob/1.x/src/test/kotlin/org/opensearch/indexmanagement/transform/TransformRunnerIT.kt#L764

--tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test no-op execution when no buckets have been modified

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test no-op execution when no buckets have been modified" -Dtests.seed=5DCD6D7B8AAB3141 -Dtests.security.manager=false -Dtests.locale=es-MX -Dtests.timezone=Europe/Minsk -Druntime.java=11
 
java.lang.NullPointerException
        at __randomizedtesting.SeedInfo.seed([5DCD6D7B8AAB3141:6E4476B0B23793F7]:0)
        at java.base/java.time.Instant.compareTo(Instant.java:1254)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test no-op execution when no buckets have been modified$secondIterationMetadata$1.invoke(TransformRunnerIT.kt:428)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT$test no-op execution when no buckets have been modified$secondIterationMetadata$1.invoke(TransformRunnerIT.kt:30)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:113)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test no-op execution when no buckets have been modified(TransformRunnerIT.kt:424)

@downsrob
Copy link
Contributor

Looking at some of the recent flaky failures in RestStopRollupActionIT, the flaky failures seemed to be caused by the metadata doc missing, but there weren't enough logs to tell what was going wrong. Next time there is a PR which fails due to this, we should add additional debug logs in the rollup runner and metadata writer. The rollup tests seem particularly flaky.

@downsrob
Copy link
Contributor

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStartRollupActionIT.test starting a failed rollup" -Dtests.see d=3BC75D2F7F1C189E -Dtests.security.manager=false -Dtests.locale=
zh-Hans-SG -Dtests.timezone=Pacific/Fakaofo -Druntime.java=14
  2> java.lang.AssertionError: MetadataID on rollup was null
        at __randomizedtesting.SeedInfo.seed([3BC75D2F7F1C189E:78637077DCF0C390]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.assertTrue(Assert.java:42)
        at org.junit.Assert.assertNotNull(Assert.java:713)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStartRollupActionIT$test starting a failed rollup$1.invoke(RestStartRollupActionIT.kt:109)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStartRollupActionIT$test starting a failed rollup$1.invoke(RestStartRollupActionIT.kt:107)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStartRollupActionIT.test starting a failed rollup(RestStartRollupActionIT.kt:107)

I had this fail on a multinode cluser and when looking into the logs I saw that it was because the ISM index moved nodes:
On node 2:

[2022-05-12T21:44:04,036][INFO ][o.o.j.s.JobSweeper       ] [integTest-2] Descheduling jobs of shard 3 index .opendistro-ism-config as the shard is removed from this node.
[2022-05-12T21:44:04,333][INFO ][o.o.j.s.JobScheduler     ] [integTest-2] Scheduling job id restart_failed_rollup for index .opendistro-ism-config .
[2022-05-12T21:44:04,469][INFO ][o.o.j.s.JobScheduler     ] [integTest-2] Descheduling jobId: restart_failed_rollup

On node 1:

[2022-05-12T21:44:04,543][INFO ][o.o.j.s.JobScheduler     ] [integTest-0] Scheduling job id restart_failed_rollup for index .opendistro-ism-config 
.... this final descheduling was because the ism index was deleted during cleanup
[2022-05-12T21:44:36,179][INFO ][o.o.j.s.JobScheduler     ] [integTest-0] Descheduling jobId: restart_failed_rollup

It seems that some of these multinode flaky failures could be related to opensearch-project/job-scheduler#173

@bowenlan-amzn
Copy link
Member Author

bowenlan-amzn commented Jun 6, 2022

DeleteActionIT.test basic

There could be race condition to check history document after an index gets deleted.

https://github.com/opensearch-project/index-management/blob/main/src/test/kotlin/org/opensearch/indexmanagement/indexstatemanagement/action/DeleteActionIT.kt#L55

[2022-06-06T05:15:20,891][INFO ][o.o.i.i.ManagedIndexRunner] [integTest-0] Executing attempt_delete for deleteactionit_index_1
[2022-06-06T05:15:20,984][INFO ][o.o.j.s.JobSweeper       ] [integTest-0] Descheduling job p8ToeW8xSnaMY9Ju0bjRig on index .opendistro-ism-config
[2022-06-06T05:15:20,985][INFO ][o.o.j.s.JobScheduler     ] [integTest-0] Descheduling jobId: p8ToeW8xSnaMY9Ju0bjRig
[2022-06-06T05:15:20,994][INFO ][o.o.i.i.ManagedIndexRunner] [integTest-0] Finished executing attempt_delete for deleteactionit_index_1
[2022-06-06T05:15:21,004][INFO ][o.o.j.s.JobSweeper       ] [integTest-0] Indexing failed for job p8ToeW8xSnaMY9Ju0bjRig#metadata on index .opendistro-ism-config
[2022-06-06T05:15:21,010][ERROR][o.o.i.i.ManagedIndexRunner] [integTest-0] There was VersionConflictEngineException trying to update the metadata for deleteactionit_index_1. Message: [p8ToeW8xSnaMY9Ju0bjRig#metadata]: version conflict, required seqNo [5], primary term [1]. but no document was found
org.opensearch.index.engine.VersionConflictEngineException: [p8ToeW8xSnaMY9Ju0bjRig#metadata]: version conflict, required seqNo [5], primary term [1]. but no document was found
[2022-06-06T05:15:21,011][ERROR][o.o.i.i.ManagedIndexRunner] [integTest-0] Failed to update ManagedIndexMetaData after executing the Step : attempt_delete

@bowenlan-amzn
Copy link
Member Author

bowenlan-amzn commented Sep 1, 2022

https://github.com/opensearch-project/index-management/runs/8145891847?check_suite_focus=true

org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stopping rollup with metadata FAILED
    java.lang.AssertionError: Rollup is not STARTED expected:<started> but was:<finished>
        at __randomizedtesting.SeedInfo.seed([F0C92609FBF50346:BB84B25D20FFC0F2]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:120)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping rollup with metadata$1.invoke(RestStopRollupActionIT.kt:226)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping rollup with metadata$1.invoke(RestStopRollupActionIT.kt:222)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping rollup with metadata(RestStopRollupActionIT.kt:222)

logs.zip


RestStartTransformActionIT.test starting a finished transform

Workflow link

error
2021-08-02T20:16:09.9041497Z REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT.test starting a finished transform" -Dtests.seed=DD2C384C67FC30BD -Dtests.security.manager=false -Dtests.locale=uk-UA -Dtests.timezone=America/Campo_Grande -Druntime.java=14
2021-08-02T20:16:09.9050653Z 
2021-08-02T20:16:09.9828220Z org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT > test starting a finished transform FAILED
2021-08-02T20:16:09.9831015Z     java.lang.AssertionError: Status should be finished expected:<finished> but was:<started>
2021-08-02T20:16:09.9832385Z         at __randomizedtesting.SeedInfo.seed([DD2C384C67FC30BD:87A478982B41BA5]:0)
2021-08-02T20:16:09.9833459Z         at org.junit.Assert.fail(Assert.java:89)
2021-08-02T20:16:09.9834423Z         at org.junit.Assert.failNotEquals(Assert.java:835)
2021-08-02T20:16:09.9835510Z         at org.junit.Assert.assertEquals(Assert.java:120)
2021-08-02T20:16:09.9838406Z         at org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT$test starting a finished transform$2.invoke(RestStartTransformActionIT.kt:196)
2021-08-02T20:16:09.9845749Z         at org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT$test starting a finished transform$2.invoke(RestStartTransformActionIT.kt:32)
2021-08-02T20:16:09.9854345Z         at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:126)
2021-08-02T20:16:09.9857268Z         at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:120)
2021-08-02T20:16:09.9860591Z         at org.opensearch.indexmanagement.transform.resthandler.RestStartTransformActionIT.test starting a finished transform(RestStartTransformActionIT.kt:193)
2021-08-02T20:16:17.8835315Z 

@bowenlan-amzn
Copy link
Member Author

bowenlan-amzn commented Sep 9, 2022

test move metadata service
java.lang.AssertionError: expected:<Managed index's metadata is pending migration.> but was:<null>

Update: from the log, everything is working fine, but metadata move cannot find the metadata in the cluster state, I think this is more likely a core problem. And we will remove this test anyway soon.

org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT > test move metadata service FAILED
    java.lang.AssertionError: expected:<Managed index's metadata is pending migration.> but was:<null>
        at __randomizedtesting.SeedInfo.seed([EBD0D8B7D135C4E0:49A3DDF6C040E639]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:120)
        at org.junit.Assert.assertEquals(Assert.java:146)
        at org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT$test move metadata service$1.invoke(MetadataRegressionIT.kt:92)
        at org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT$test move metadata service$1.invoke(MetadataRegressionIT.kt:91)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
        at org.opensearch.indexmanagement.indexstatemanagement.MetadataRegressionIT.test move metadata service(MetadataRegressionIT.kt:91)

 1> [2022-09-08T00:32:56,025][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] before test
  1> [2022-09-08T00:32:56,028][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] [MetadataRegressionIT#test move metadata service]: setting up test
  1> [2022-09-08T00:32:56,084][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] [MetadataRegressionIT#test move metadata service]: all set up test
  1> [2022-09-08T00:32:56,140][INFO ][o.o.c.s.ClusterSettings  ] [external_0] updating [plugins.index_state_management.jitter] from [0.6] to [0.0]
  1> [2022-09-08T00:32:56,142][INFO ][o.o.c.s.ClusterSettings  ] [external_0] updating [plugins.index_state_management.jitter] from [0.6] to [0.0]
  1> [2022-09-08T00:32:56,197][INFO ][o.o.c.s.ClusterSettings  ] [external_0] updating [plugins.index_state_management.metadata_migration.status] from [0] to [-1]
  1> [2022-09-08T00:32:56,226][INFO ][o.o.c.s.ClusterSettings  ] [external_0] updating [plugins.index_state_management.metadata_migration.status] from [-1] to [0]
  1> [2022-09-08T00:32:56,631][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] true
  1> [2022-09-08T00:32:56,637][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] check if metadata is saved in cluster state: {index_uuid=9VHQYzkVT2y8JVczuACWQQ, rolled_over=false, policy_id=metadataregressionit_testPolicyName_1, policy_seq_no=0, index=metadataregressionit_index_1, policy_primary_term=1, state={"name":"ReplicaCountState","start_time":1234}, retry_info={"failed":false,"consumed_retries":0}, policy_completed=false, info={"message":"Happy moving"}}
  1> [2022-09-08T00:33:16,765][INFO ][o.o.c.s.ClusterSettings  ] [external_0] updating [plugins.index_state_management.jitter] from [0.0] to [0.6]
  1> [2022-09-08T00:33:16,766][INFO ][o.o.c.s.ClusterSettings  ] [external_0] updating [plugins.index_state_management.jitter] from [0.0] to [0.6]
  1> [2022-09-08T00:33:16,776][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] [MetadataRegressionIT#test move metadata service]: cleaning up after test
  1> [2022-09-08T00:33:16,896][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] [MetadataRegressionIT#test move metadata service]: cleaned up after test
  1> [2022-09-08T00:33:16,896][INFO ][o.o.i.i.MetadataRegressionIT] [test move metadata service] after test

logs.zip

@bowenlan-amzn
Copy link
Member Author

org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT > test explaining a snapshot management policy FAILED
    java.lang.AssertionError: Policy sequence numbers didn't match expected:<1> but was:<0>
        at __randomizedtesting.SeedInfo.seed([59FF6DB8C59564EE:B1933B2EA2BBF760]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT$test explaining a snapshot management policy$1.invoke(RestExplainSnapshotManagementIT.kt:55)
        at org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT$test explaining a snapshot management policy$1.invoke(RestExplainSnapshotManagementIT.kt:35)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT.test explaining a snapshot management policy(RestExplainSnapshotManagementIT.kt:35)

[2022-09-22T23:45:03,302][INFO ][o.o.i.s.r.RestExplainSnapshotManagementIT] [test explaining a snapshot management policy] before test
[2022-09-22T23:46:13,587][INFO ][o.o.i.s.r.RestExplainSnapshotManagementIT] [test explaining a snapshot management policy] after test

logs.zip

@bowenlan-amzn bowenlan-amzn added maintenance improves code quality, but not the product and removed Severity-Minor Priority-Low labels Nov 5, 2022
@opensearch-project opensearch-project deleted a comment from dbbaughe Nov 10, 2022
@opensearch-project opensearch-project deleted a comment from dbbaughe Nov 10, 2022
@bowenlan-amzn
Copy link
Member Author

bowenlan-amzn commented Nov 10, 2022

RollupRunnerIT.test rollup action with alias as target_index successfully

[2022-11-10T22:06:04,534][INFO ][o.o.i.r.r.RollupRunnerIT ] [test rollup action with alias as target_index successfully] before test
[2022-11-10T22:06:08,891][INFO ][o.o.i.r.r.RollupRunnerIT ] [test rollup action with alias as target_index successfully] after test

org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT > test rollup action with alias as target_index successfully FAILED
    org.opensearch.client.ResponseException: method [PUT], host [http://127.0.0.1:42643/], URI [/_plugins/_rollup/jobs/runner_with_alias_as_target?if_seq_no=9&if_primary_term=1], status line [HTTP/1.1 409 Conflict]
    {"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[runner_with_alias_as_target]: version conflict, required seqNo [9], primary term [1]. current document has seqNo [16] and primary term [1]","index":".opendistro-ism-config","shard":"0","index_uuid":"o7nDgHBFQTWUBENE987S_Q"}],"type":"version_conflict_engine_exception","reason":"[runner_with_alias_as_target]: version conflict, required seqNo [9], primary term [1]. current document has seqNo [16] and primary term [1]","index":".opendistro-ism-config","shard":"0","index_uuid":"o7nDgHBFQTWUBENE987S_Q"},"status":409}
        at __randomizedtesting.SeedInfo.seed([7332C3AEA733CC1E:570293340F7A6A8C]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:345)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:85)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:68)
        at app//org.opensearch.indexmanagement.rollup.runner.RollupRunnerIT.test rollup action with alias as target_index(RollupRunnerIT.kt:871)

Need to understand rollup job document update logic more deeply to see why the race condition can happen.

@bowenlan-amzn
Copy link
Member Author

bowenlan-amzn commented Nov 21, 2022

Continue to dive deep on this issue: #90 (comment)

Some rollup job fails to run after 2s in multi-node tests, the related logs

587:[node 2] JobSweeper started listening to operations on index .opendistro-ism-config
603:[node 2] [.opendistro-ism-config] creating index
620:[node 0] JobSweeper started listening to operations on index .opendistro-ism-config
630:[node 0] Descheduling jobs of shard 0 index .opendistro-ism-config as the shard is removed from this node.
631:[node 2] Descheduling jobs of shard 0 index .opendistro-ism-config as the shard is removed from this node.
682:[node 2] JobSweeper started listening to operations on index .opendistro-ism-config
723:[node 0] Scheduling job id restart_failed_rollup for index .opendistro-ism-config
734:[      ] send the update rollup start time request
775:[node 0] Descheduling jobId: restart_failed_rollup
782:[node 2] Scheduling job id restart_failed_rollup for index .opendistro-ism-config

From the log, one reasonable hypothesis is the shard of config index on node 2 has not been updated to date (with the execute 2s later change)
Added the refresh parameter for update start time call, but it doesn't resolve the problem.

It's not clear why the descheduling on node 0 and scheduling on the other node 2 happens, does it relate to the update start time call? Need more log in JS to find out the reason of this.

@petardz
Copy link
Contributor

petardz commented Dec 12, 2022

https://github.com/opensearch-project/index-management/runs/8145891847?check_suite_focus=true

org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stopping rollup with metadata FAILED
    java.lang.AssertionError: Rollup is not STARTED expected:<started> but was:<finished>
        at __randomizedtesting.SeedInfo.seed([F0C92609FBF50346:BB84B25D20FFC0F2]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:120)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping rollup with metadata$1.invoke(RestStopRollupActionIT.kt:226)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping rollup with metadata$1.invoke(RestStopRollupActionIT.kt:222)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping rollup with metadata(RestStopRollupActionIT.kt:222)

logs.zip

Job in test is defined as continuous=false so it could finish before test catches that STARTED status change in metadata. We can just make it continuous=true

@petardz
Copy link
Contributor

petardz commented Dec 12, 2022

java.lang.NullPointerException

RestStopRollupActionIT.test stopping a failed rollup

When node disconnects(which happens a lot for some reason), job might stay blocked on API calls like _search for 30+ seconds(tested locally) and our waitFor { ... } in tests would timeout. I suggest we increase this default timeout for waitFor from 30sec to 90 or so

@bowenlan-amzn
Copy link
Member Author

bowenlan-amzn commented Jan 3, 2023

RestDeleteRollupActionIT.test deleting a rollup

org.opensearch.indexmanagement.rollup.resthandler.RestDeleteRollupActionIT > test deleting a rollup FAILED
    org.opensearch.client.ResponseException: method [PUT], host [https://localhost:9200], URI [/_plugins/_rollup/jobs/GjkLiHNdTP?refresh=true], status line [HTTP/1.1 400 Bad Request]
    {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [rollup_metadata.continuous.next_window_end_time] cannot be changed from type [long] to [date]"}],"type":"illegal_argument_exception","reason":"mapper [rollup_metadata.continuous.next_window_end_time] cannot be changed from type [long] to [date]"},"status":400}
        at __randomizedtesting.SeedInfo.seed([19AD219FA033B002:35C712E4A8126070]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:345)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:85)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:68)
        at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRollupJson(RollupRestTestCase.kt:99)
        at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRollup(RollupRestTestCase.kt:77)
        at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRollup$default(RollupRestTestCase.kt:71)
        at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRandomRollup(RollupRestTestCase.kt:111)
        at app//org.opensearch.indexmanagement.rollup.RollupRestTestCase.createRandomRollup$default(RollupRestTestCase.kt:109)
        at app//org.opensearch.indexmanagement.rollup.resthandler.RestDeleteRollupActionIT.test deleting a rollup(RestDeleteRollupActionIT.kt:21)
org.opensearch.indexmanagement.rollup.resthandler.RestDeleteRollupActionIT > test deleting a rollup that doesn't exist in existing config index FAILED
    java.lang.AssertionError: expected:<NOT_FOUND> but was:<BAD_REQUEST>
        at __randomizedtesting.SeedInfo.seed([3FE7F518CA239133:249BA19E7AC77924]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:[120](https://github.com/opensearch-project/index-management/actions/runs/6115943950/job/16600369681#step:5:121))
        at org.junit.Assert.assertEquals(Assert.java:146)
        at org.opensearch.indexmanagement.rollup.resthandler.RestDeleteRollupActionIT.test deleting a rollup that doesn't exist in existing config index(RestDeleteRollupActionIT.kt:37)

logs.zip


The flaky reason for such is because the rollup job, even after the config index deleted, the running cannot be interrupted but continue to still operate (saving some doc into config index)
So the idea to fix this is by waiting for the rollup job to "truely" finish then do the next test operation.

  • for continuous mode
    test continuous rollup search
    How to fix: Can based on the processed documents number to decide whether this run is finished or the next start window, refer to

    // The test data does not have tpep_pickup_datetime going past "2019-01-01" so we can assume that
    // if the nextWindowStartTime is after 2019-01-02T00:00:00Z then all data has been rolled up
    assertTrue(
    "Rollup has not caught up yet, docs processed: ${rollupMetadata.stats.documentsProcessed}",
    rollupMetadata.continuous!!.nextWindowStartTime.isAfter(Instant.parse("2019-01-02T00:00:00Z"))
    )

  • for one-time mode, we wait for metadata status becomes FINISHED, but then the rollup job will disable itself, which will recreate the index and block the next job run

    if (!updatableJob.continuous) {
    if (listOf(RollupMetadata.Status.STOPPED, RollupMetadata.Status.FINISHED, RollupMetadata.Status.FAILED).contains(metadata.status)) {
    disableJob(updatableJob, metadata)
    }
    }

    How to fix: can based on whether the job is already disabled which is the last update call from one rollup jon
    For all the non continuous rollup job, should check like this
    assertEquals("Rollup is not finished", RollupMetadata.Status.FINISHED, rollupMetadata.status)
    assertTrue("Rollup is not disabled", !rollupJob.enabled)

@opensearch-project opensearch-project deleted a comment from downsrob Jan 10, 2023
@bowenlan-amzn
Copy link
Member Author

bowenlan-amzn commented Jan 11, 2023

org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stop rollup when multiple shards configured for IM config index FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://[::1]:41461], URI [/_plugins/_rollup/jobs/multi_shard_stop/_stop], status line [HTTP/1.1 409 Conflict]
    {"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[UhUCS4kBSAFK6pqWBC9G]: version conflict, required seqNo [9], primary term [1]. current document has seqNo [10] and primary term [1]","index":".opendistro-ism-config","shard":"3","index_uuid":"lVNkVSDPRDqJg1pNZJMMow"}],"type":"version_conflict_engine_exception","reason":"[UhUCS4kBSAFK6pqWBC9G]: version conflict, required seqNo [9], primary term [1]. current document has seqNo [10] and primary term [1]","index":".opendistro-ism-config","shard":"3","index_uuid":"lVNkVSDPRDqJg1pNZJMMow"},"status":409}
        at __randomizedtesting.SeedInfo.seed([4C03DDBDB5DE427F:396CA6272AA3E4F]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:384)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:354)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:329)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:110)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:95)
        at app//org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stop rollup when multiple shards configured for IM config index(RestStopRollupActionIT.kt:302)

  1> [2023-07-12T18:49:25,309][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] before test
  1> [2023-07-12T18:49:28,596][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] Update rollup start time to 1689180510596
  1> [2023-07-12T18:49:32,665][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] There are still tasks running after this test that might break subsequent tests [indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/index].
  1> [2023-07-12T18:49:32,666][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] after test

logs.zip

@bowenlan-amzn
Copy link
Member Author

org.opensearch.indexmanagement.transform.TransformRunnerIT > test continuous transform with wildcard indices FAILED
    java.lang.AssertionError: Not the expected pages processed expected:<6> but was:<2>
        at __randomizedtesting.SeedInfo.seed([BBAD4A6DC3D51A99:51CB84FA7605837E]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices(TransformRunnerIT.kt:743)

For now only seen on 1.3.8

logs.zip

@bowenlan-amzn
Copy link
Member Author

Suite: Test class org.opensearch.indexmanagement.transform.resthandler.RestStopTransformActionIT
2> REPRODUCE WITH: gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.resthandler.RestStopTransformActionIT.test stopping a stopped Transform" -Dtests.seed=3AE2878216465C0D -Dtests.security.manager=false -Dtests.locale=ar-SA -Dtests.timezone=PLT -Druntime.java=17
2> java.lang.AssertionError: expected:<{acknowledged=true}> but was:<{acknowledged=false}>
at __randomizedtesting.SeedInfo.seed([3AE2878216465C0D:BB536F5FF53592B4]:0)
at org.junit.Assert.fail(Assert.java:89)
at org.junit.Assert.failNotEquals(Assert.java:835)
at org.junit.Assert.assertEquals(Assert.java:120)
at org.junit.Assert.assertEquals(Assert.java:146)
at org.opensearch.indexmanagement.transform.resthandler.RestStopTransformActionIT.test stopping a stopped Transform(RestStopTransformActionIT.kt:45)

@yuye-aws
Copy link
Member

yuye-aws commented May 12, 2023

org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stopping a failed rollup FAILED
java.lang.NullPointerException
at __randomizedtesting.SeedInfo.seed([93A0B009A21A1943:F7B5DBD2943544B5]:0)
at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:131)
at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:129)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:119)
at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:112)
at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup(RestStopRollupActionIT.kt:129)
1> [2023-05-11T22:33:50,096][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a stopped rollup] before test
4_Run integration tests with multi node config.txt

@getsaurabh02
Copy link
Member

Assertion failure -

Suite: Test class org.opensearch.indexmanagement.transform.TransformRunnerIT
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices" -Dtests.seed=566DD28B92044ABE -Dtests.security.manager=false -Dtests.locale=de-CH -Dtests.timezone=Asia/Beirut -Druntime.java=8
  2> java.lang.AssertionError: Not the expected pages processed expected:<6> but was:<2>
        at __randomizedtesting.SeedInfo.seed([566DD28B92044ABE:BC0B1C1C27D4D359]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:647)
        at org.opensearch.indexmanagement.transform.TransformRunnerIT.test continuous transform with wildcard indices(TransformRunnerIT.kt:743)

#818 (comment)

@bowenlan-amzn
Copy link
Member Author

org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT > test explaining a snapshot management policy FAILED
    java.lang.AssertionError: Policy creation trigger time didn't match expected:<1693475220000> but was:<1693475160000>
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.snapshotmanagement.resthandler.RestExplainSnapshotManagementIT.test explaining a snapshot management policy" -Dtests.seed=2D65846124E40D95 -Dtests.security.manager=false -Dtests.locale=ar-IQ -Dtests.timezone=Europe/Rome -Druntime.java=17
        at __randomizedtesting.SeedInfo.seed([2D65846124E40D95:C509D2F743CA9E1B]:0)
        at org.junit.Assert.fail(Assert.java:89)
        at org.junit.Assert.failNotEquals(Assert.java:835)
        at org.junit.Assert.assertEquals(Assert.java:120)
        at 

1693475220000 and 1693475160000 has 1m difference

val expectedCreationTime = smPolicy.creation.schedule.getNextExecutionTime(now()).toEpochMilli()

This line should be move forward right after updateSMPolicyStartTime(smPolicy)

@bowenlan-amzn
Copy link
Member Author

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stop rollup when multiple shards configured for IM config index" -Dtests.seed=4C04B53ACE4A7D12 -Dtests.security.manager=false -Dtests.locale=be-BY -Dtests.timezone=Africa/Blantyre -Druntime.java=17
org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stop rollup when multiple shards configured for IM config index FAILED
    org.opensearch.client.ResponseException: method [POST], host [http://127.0.0.1:40201/], URI [/_plugins/_rollup/jobs/multi_shard_stop/_stop], status line [HTTP/1.1 409 Conflict]
    {"error":{"root_cause":[{"type":"version_conflict_engine_exception","reason":"[3SZraIoBu38AgWriQ-H6]: version conflict, required seqNo [2], primary term [1]. current document has seqNo [4] and primary term [1]","index":".opendistro-ism-config","shard":"3","index_uuid":"odLQixofSoy2BMmZcMD-jg"}],"type":"version_conflict_engine_exception","reason":"[3SZraIoBu38AgWriQ-H6]: version conflict, required seqNo [2], primary term [1]. current document has seqNo [4] and primary term [1]","index":".opendistro-ism-config","shard":"3","index_uuid":"odLQixofSoy2BMmZcMD-jg"},"status":409}
        at __randomizedtesting.SeedInfo.seed([4C04B53ACE4A7D12:391A2E5093E0122]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:376)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:346)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:321)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:[110](https://github.com/opensearch-project/index-management/actions/runs/6087082313/job/16529399791?pr=917#step:5:111))
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:95)
        at app//org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stop rollup when multiple shards configured for IM config index(RestStopRollupActionIT.kt:302)

  1> [2023-09-06T04:56:09,984][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] before test

  1> [2023-09-06T04:56:09,987][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] initializing REST clients against [http://[::1]:34255, http://127.0.0.1:40201]/
Suite: Test class org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT
  1> [2023-09-06T04:56:12,771][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] Update rollup start time to 1693968914771
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stop rollup when multiple shards configured for IM config index" -Dtests.seed=4C04B53ACE4A7D12 -Dtests.security.manager=false -Dtests.locale=be-BY -Dtests.timezone=Africa/Blantyre -Druntime.java=17
  1> [2023-09-06T04:56:15,073][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stop rollup when multiple shards configured for IM config index] after test

logs.zip

@bowenlan-amzn
Copy link
Member Author

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup" -Dtests.seed=[60](https://github.com/opensearch-project/index-management/actions/runs/6087082323/job/16529957685?pr=917#step:4:61)268BB86926DC53 -Dtests.security.manager=false -Dtests.locale=es-PA -Dtests.timezone=Etc/GMT+6 -Druntime.java=17

org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stopping a failed rollup FAILED
    java.lang.NullPointerException
        at __randomizedtesting.SeedInfo.seed([60268BB86926DC53:433E0635F0981A5]:0)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:131)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT$test stopping a failed rollup$1.invoke(RestStopRollupActionIT.kt:129)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor(TestHelpers.kt:125)
        at org.opensearch.indexmanagement.TestHelpersKt.waitFor$default(TestHelpers.kt:113)
        at org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT.test stopping a failed rollup(RestStopRollupActionIT.kt:129)

  1> [2023-09-05T21:42:00,471][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a failed rollup] before test
  1> [2023-09-05T21:42:00,685][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a failed rollup] Update rollup start time to 1693971662685
  1> [2023-09-05T21:42:21,926][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a failed rollup] There are still tasks running after this test that might break subsequent tests [indices:admin/seq_no/retention_lease_sync[p], indices:data/write/bulk[s], indices:data/write/bulk[s][p], indices:data/write/index, internal:index/shard/recovery/start_recovery, retention_lease_sync].
  1> [2023-09-05T21:42:21,927][INFO ][o.o.i.r.r.RestStopRollupActionIT] [test stopping a failed rollup] after test

logs.zip

@bowenlan-amzn
Copy link
Member Author

bowenlan-amzn commented Sep 6, 2023

org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestRetryFailedManagedIndexActionIT > test index not failed FAILED
    org.opensearch.client.ResponseException: method [PUT], host [http://127.0.0.1:36939/], URI [/_plugins/_ism/policies/nKcdTokvYV?refresh=true], status line [HTTP/1.1 400 Bad Request]
    {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [managed_index_metadata.step.step_status] cannot be changed from type [text] to [keyword]"}],"type":"illegal_argument_exception","reason":"mapper [managed_index_metadata.step.step_status] cannot be changed from type [text] to [keyword]"},"status":400}
        at __randomizedtesting.SeedInfo.seed([277DE7DD2C0FC9DA:DCD1D3F9DB703133]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:385)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:355)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:330)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:86)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:69)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicyJson(IndexStateManagementRestTestCase.kt:132)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicy(IndexStateManagementRestTestCase.kt:[107](https://github.com/opensearch-project/index-management/actions/runs/6092475260/job/16530644871?pr=919#step:5:108))
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicy$default(IndexStateManagementRestTestCase.kt:101)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createRandomPolicy(IndexStateManagementRestTestCase.kt:144)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createRandomPolicy$default(IndexStateManagementRestTestCase.kt:142)
        at app//org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestRetryFailedManagedIndexActionIT.test index not failed(RestRetryFailedManagedIndexActionIT.kt:181)

  1> [2023-09-05T21:10:30,893][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test reset action start time] before test
  1> [2023-09-05T21:10:46,956][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test reset action start time] after test
  1> [2023-09-05T21:10:46,961][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index not failed] before test
  1> [2023-09-05T21:10:47,549][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index not failed] after test
  1> [2023-09-05T21:10:47,573][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index failed] before test
  1> [2023-09-05T21:10:54,254][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index failed] after test

logs (1).zip

org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestRetryFailedManagedIndexActionIT > test index has no metadata FAILED
    org.opensearch.client.ResponseException: method [PUT], host [http://[::1]:43057], URI [/_plugins/_ism/policies/VQXByzgvXd?refresh=true], status line [HTTP/1.1 400 Bad Request]
    {"error":{"root_cause":[{"type":"illegal_argument_exception","reason":"mapper [managed_index_metadata.step.step_status] cannot be changed from type [text] to [keyword]"}],"type":"illegal_argument_exception","reason":"mapper [managed_index_metadata.step.step_status] cannot be changed from type [text] to [keyword]"},"status":400}
        at __randomizedtesting.SeedInfo.seed([9D243BEA443D0FB7:81B369160383AE27]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:376)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:346)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:321)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest(TestHelpers.kt:86)
        at app//org.opensearch.indexmanagement.TestHelpersKt.makeRequest$default(TestHelpers.kt:69)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicyJson(IndexStateManagementRestTestCase.kt:132)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicy(IndexStateManagementRestTestCase.kt:[107](https://github.com/opensearch-project/index-management/actions/runs/6116145222/job/16600899736#step:5:108))
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createPolicy$default(IndexStateManagementRestTestCase.kt:101)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createRandomPolicy(IndexStateManagementRestTestCase.kt:144)
        at app//org.opensearch.indexmanagement.indexstatemanagement.IndexStateManagementRestTestCase.createRandomPolicy$default(IndexStateManagementRestTestCase.kt:142)
        at app//org.opensearch.indexmanagement.indexstatemanagement.resthandler.RestRetryFailedManagedIndexActionIT.test index has no metadata(RestRetryFailedManagedIndexActionIT.kt:156)

 1> [2023-09-07T20:58:23,281][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test reset action start time] before test
  1> [2023-09-07T20:58:38,825][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test reset action start time] after test
  1> [2023-09-07T20:58:38,832][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index has no metadata] before test
  1> [2023-09-07T20:58:38,980][INFO ][o.o.i.i.r.RestRetryFailedManagedIndexActionIT] [test index has no metadata] after test

logs.zip

@bowenlan-amzn
Copy link
Member Author

bowenlan-amzn commented Oct 4, 2023

There are situations multi node test running very long, 5-10x of normal run time. One such situation I checked is one node is stuck with problematic node lock.

This happens when running the test test allocate exclude, and when the index .opendistro-ism-config gets deleted at the end of the test

org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/home/runner/work/index-management/index-management/build/testclusters/integTest-1/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],creationTime=2023-10-04T17:06:54.440010251Z)

It never be able to join back the cluster also...

[2023-10-04T17:08:53,839][WARN ][o.o.e.NodeEnvironment    ] [integTest-1] lock assertion failed
java.nio.channels.ClosedByInterruptException: null
	at java.nio.channels.spi.AbstractInterruptibleChannel.end(AbstractInterruptibleChannel.java:199) ~[?:?]
	at sun.nio.ch.FileChannelImpl.endBlocking(FileChannelImpl.java:171) ~[?:?]
	at sun.nio.ch.FileChannelImpl.size(FileChannelImpl.java:395) ~[?:?]
	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:169) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.opensearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1121) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.env.NodeEnvironment.availableShardPaths(NodeEnvironment.java:949) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.env.NodeEnvironment.deleteShardDirectoryUnderLock(NodeEnvironment.java:580) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.IndicesService.deleteShardStore(IndicesService.java:1210) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.IndexService.onShardClose(IndexService.java:670) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.IndexService$StoreCloseListener.accept(IndexService.java:793) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.IndexService$StoreCloseListener.accept(IndexService.java:780) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.store.Store.closeInternal(Store.java:566) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.store.Store$1.closeInternal(Store.java:193) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.AbstractRefCounted.decRef(AbstractRefCounted.java:78) [opensearch-common-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.store.Store.decRef(Store.java:539) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1775) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1746) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.shard.IndexShard.refresh(IndexShard.java:1303) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.shard.IndexShard.lambda$buildRefreshListeners$44(IndexShard.java:4325) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.shard.RefreshListeners.forceRefreshes(RefreshListeners.java:128) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.index.shard.IndexShard.relocated(IndexShard.java:830) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.recovery.RecoverySourceHandler.lambda$finalizeRecovery$27(RecoverySourceHandler.java:850) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.CancellableThreads.executeIO(CancellableThreads.java:124) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.CancellableThreads.execute(CancellableThreads.java:104) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.recovery.RecoverySourceHandler.lambda$finalizeRecovery$28(RecoverySourceHandler.java:849) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$1.onResponse(ActionListener.java:82) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
	at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:79) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$4.onResponse(ActionListener.java:182) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.core.action.ActionListener$6.onResponse(ActionListener.java:301) [opensearch-core-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:181) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:70) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1493) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:394) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:388) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:849) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
[2023-10-04T17:08:53,844][WARN ][o.o.i.c.IndicesClusterStateService] [integTest-1] [[.opendistro-ism-config/svhpmccOR-ighAH7RXtFkA]] failed to complete pending deletion for index
org.apache.lucene.store.AlreadyClosedException: FileLock invalidated by an external force: NativeFSLock(path=/home/runner/work/index-management/index-management/build/testclusters/integTest-1/data/nodes/0/node.lock,impl=sun.nio.ch.FileLockImpl[0:9223372036854775807 exclusive invalid],creationTime=2023-10-04T17:06:54.440010251Z)
	at org.apache.lucene.store.NativeFSLockFactory$NativeFSLock.ensureValid(NativeFSLockFactory.java:165) ~[lucene-core-9.8.0.jar:9.8.0 d914b3722bd5b8ef31ccf7e8ddc638a87fd648db - 2023-09-21 21:57:47]
	at org.opensearch.env.NodeEnvironment.assertEnvIsLocked(NodeEnvironment.java:1121) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.env.NodeEnvironment.indexPaths(NodeEnvironment.java:932) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.env.NodeEnvironment.deleteIndexDirectoryUnderLock(NodeEnvironment.java:656) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.IndicesService.processPendingDeletes(IndicesService.java:1488) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.indices.cluster.IndicesClusterStateService$1.doRun(IndicesClusterStateService.java:388) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:908) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]

logs.zip

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance improves code quality, but not the product testing
Projects
None yet
Development

No branches or pull requests

5 participants