-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flaky tests #90
Comments
This comment was marked as outdated.
This comment was marked as outdated.
I saw this flakey failure on 3/5 multi node runs I did
|
* Rename plugin helper file * Update TransformIndices.tsx * Backport commits from main (opensearch-project#90) * Release 1.1.0.0 (opensearch-project#83) * Release 1.1.0.0 Signed-off-by: bowenlan-amzn <bowenlan23@gmail.com> * Correct copyright notices * UI fixes for new ISM UI (opensearch-project#84) * Removes X icon from action/transition flyout footer next to cancel Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Adds tooltips to edit/delete icon buttons on the draggable action/transition components Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Removes overlay for flyout so clicking outside doesn't close flyout and removes X close button on top right Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Adds edit/delete tooltips for state component and removes underline from hovering states Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Changes state flyout from X Close to Cancel Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Adds JSON editor for allocation action Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Adds isValid to action interface and implements in actions Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Correctly removes unused rollover keys from rollover object, handles uncontrolled inputs and NaN minDocs Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Adds error for state name already existing in policy Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Fixes duplicated actions when editing as previously we had a new ID and matching logic didn't find an action Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Fixes other action inputs that can return NaN and creates a no conditions option that is the default Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Updates snapshot Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Correctly show danger toast on update/create failures Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Fixes rollup action nesting multiple ism_template keys Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Fixes timeout/retry settings turning into uncontrolled inputs and NaN value Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Removes text transformation on inputs and updates isValid methods and passes to form labels Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Fixes allocation and rollup adding UIAction properties to policy JSON on updates Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Updates release notes w/ new PR changes Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Wraps actions/transitions in states component and updates default rollup to include ism_rollup key Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Fixes some small UI touchups/issues for new ISM UI (opensearch-project#85) * Moves cancel secondary button next to primary on action/transition flyout Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Removes default ISM template and increases width of empty prompt Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Adds punctuation Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Adds back X icon next to cancel for state flyout Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Makes inputs in flyout full width, updates help text, and adds some spacing Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Updates transition default to not include conditions empty object, and on change transition to delete conditions when selecting none Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Moves edit button out of policy settings content panel and on to global page for view policy Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Updates release note Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Fixes broken link Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Fixes small issues on new ISM UI (opensearch-project#88) * Fixes small issues on new ISM UI Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Updates snapshots Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Updates release notes Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> * Updates workflows to trigger on 1.* branches Signed-off-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> Co-authored-by: Bowen Lan <62091230+bowenlan-amzn@users.noreply.github.com> * Support data filter when viewing sample data * Update TransformOptions.tsx * Add TransformService.test.ts * Draft of date histogram panel * Add helper method and move import * Split calendar and fixed interval to 2 panels * Add rendering test * Add close popover test * Update cypress-workflow.yml * Update links.yml * Update unit-tests-workflow.yml * Update TransformOptions.tsx * Update tests * Update DateHistogramPanel.test.tsx * Make scripted metrics editor larger * Expand code editor related panels * Undo changes to workflow Signed-off-by: Annie Lee <leeyun@amazon.com> * Update TimeAggregation.tsx * Update DefineTransforms.tsx undo changes to style * Clean up code * Update snapshot Signed-off-by: Annie Lee <leeyun@amazon.com> * Refactor the order of checking if name is defined before checking for duplicate name * Update CreateTransformForm.tsx * Update DefineTransforms.tsx Signed-off-by: Annie Lee <leeyun@amazon.com> * Update DefineTransforms.tsx Signed-off-by: Annie Lee <leeyun@amazon.com> * Update DefineTransforms.tsx * Update DefineTransforms.tsx * Update DefineTransforms.tsx Co-authored-by: Drew Baugher <46505179+dbbaughe@users.noreply.github.com> Co-authored-by: Bowen Lan <62091230+bowenlan-amzn@users.noreply.github.com>
errors
|
|
Time captured is 0 problem, if we see such problem, we can update the
Test code ref, we should use different index name for different tests. We saw a number of different flaky test failures in jenkins runs on security disabled domains for the 2.0.1 release which all seem to be related to capturing search time. Search time is checked after index time in these integration tests, and indexed into the metadata document at the same time. Perhaps the search result is cached from another test which is making the search time 0 on single node domains.
TransformRunnerIt.test transform with data filter Added commit that comments the assertion:
|
Add more comments in this PR #611 |
--tests "org.opensearch.indexmanagement.transform.TransformRunnerIT.test no-op execution when no buckets have been modified
|
Looking at some of the recent flaky failures in RestStopRollupActionIT, the flaky failures seemed to be caused by the metadata doc missing, but there weren't enough logs to tell what was going wrong. Next time there is a PR which fails due to this, we should add additional debug logs in the rollup runner and metadata writer. The rollup tests seem particularly flaky. |
I had this fail on a multinode cluser and when looking into the logs I saw that it was because the ISM index moved nodes:
On node 1:
It seems that some of these multinode flaky failures could be related to opensearch-project/job-scheduler#173 |
There could be race condition to check history document after an index gets deleted.
|
https://github.com/opensearch-project/index-management/runs/8145891847?check_suite_focus=true
error
|
test move metadata service Update: from the log, everything is working fine, but metadata move cannot find the metadata in the cluster state, I think this is more likely a core problem. And we will remove this test anyway soon.
|
|
RollupRunnerIT.test rollup action with alias as target_index successfully
Need to understand rollup job document update logic more deeply to see why the race condition can happen. |
Continue to dive deep on this issue: #90 (comment) Some rollup job fails to run after 2s in multi-node tests, the related logs
From the log, one reasonable hypothesis is the shard of config index on node 2 has not been updated to date (with the execute 2s later change) It's not clear why the descheduling on node 0 and scheduling on the other node 2 happens, does it relate to the update start time call? Need more log in JS to find out the reason of this. |
Job in test is defined as continuous=false so it could finish before test catches that STARTED status change in metadata. We can just make it continuous=true |
When node disconnects(which happens a lot for some reason), job might stay blocked on API calls like _search for 30+ seconds(tested locally) and our waitFor { ... } in tests would timeout. I suggest we increase this default timeout for waitFor from 30sec to 90 or so |
The flaky reason for such is because the rollup job, even after the config index deleted, the running cannot be interrupted but continue to still operate (saving some doc into config index)
|
|
For now only seen on 1.3.8 |
|
org.opensearch.indexmanagement.rollup.resthandler.RestStopRollupActionIT > test stopping a failed rollup FAILED |
Assertion failure -
|
1693475220000 and 1693475160000 has 1m difference Line 48 in 1c9f232
This line should be move forward right after |
|
|
|
There are situations multi node test running very long, 5-10x of normal run time. One such situation I checked is one node is stuck with problematic node lock. This happens when running the test
It never be able to join back the cluster also...
|
Describe the bug
This issue is to record flaky tests we are seeing. Most of time, your PR is not the reason of the test failure. So you can just record the flaky test failure here and we will dig into it later.
Report like this:
Refer to this comment #90 (comment)
You can download the log of failed run from the bottom of the summary of the workflow.
How to debug the flaky test
Flaky test is hard to understand at the first look, considering we may not have enough log. And it's also not easy to reproduce (race condition, environment dependent).
An efficient way to handle these:
integTest.log
when this test was run.Script to reproduce the flaky
This script works on Linux system
The text was updated successfully, but these errors were encountered: