Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed RFC Feature: Automatically Discover Intermittently Failing Tests #22

Closed
evanchia-ly-sdets opened this issue Jan 5, 2022 · 6 comments
Assignees
Labels
priority/minor Lowest priority. Work that may be scheduled

Comments

@evanchia-ly-sdets
Copy link
Collaborator

evanchia-ly-sdets commented Jan 5, 2022

Summary:

It can sometimes be difficult to know whether the test failure is related to their proposed change, or the failure is due to an inherently flaky test or unstable feature. To limit the pain from this ambiguity, the test infrastructure can automatically help identify the source of intermittent behavior.

What is the relevance of this feature?

Whenever a contributor encounters a test failure in AR, time is usually lost from one of two ways:

  • The contributor assumes the test is flaky, triggers a rerun with no changes, then after the multiple failures realizes the changelist contains a bug. This costs both time of an AR run and AR resources of one run.
  • The contributor spends time asking test owners or general chat if there is a known issue with the test because they are unsure if the failure is related to their changelist. This causes minor disruptions as people take time to assist answering the question and may also lead to people losing confidence in the testing pipeline.

Implementing this feature will allow contributors to determine whether a test failure is intermittent or deterministic. There may also be an intermittent bug with the product and the test is correctly catching the issue. This feature will help generate data for addressing intermittently failing tests whether the fault be the product or the test itself.

Feature design description:

Test failures triggers another step in Jenkins that reruns all failed tests M-number of times (proposing 10) and outputs a summary of the results. The results will provide information on whether the test is failing deterministically or if the failure is due to a flaky test. The rerun will not change the success output of the AR run and is meant to only provide additional information. There will also be an agreed upon timeout set (proposing 5 minutes) to ensure that we limit the amount of excess time spent in the pipeline.

For example, let's say that a user runs the Automated Review (AR) for their pull request that results in a single test failure. The pipeline will trigger an addition rerun step and isolate the failed test. The failed test will be rerun 10 times which can result in one of two scenarios:
-10/10 failures shows that the failure is deterministic and related to the pull request
-0-9/10 failures shows that the test is intermittently failing and further investigation is required. Data is collected.
The user is then notified of this result.

Technical design description:

A new step will be added to the Jenkinsfile for test reruns. This step can be further separated into two functions: Pytest reruns and googletest reruns. The Pytest reruns are trivial because a feature already exists. Run the command below which utilizes the pytest cache for the last test failures. The pytest cache may need to be preserved before starting the reruns because successive pytest reruns will overwrite the cache and may change the list of "previously failed tests".
The GoogleTests will need to mimic this cache property and store the failed test commands in another location. AzTestRunner will need to be modified to write to a "cache" file that contains all of the parameters required to run the each failed test. The new Jenkins step will iterate through each failed test inside of this "cache" and call AzTestRunner with the correct parameters. Once these tests are finished, Jenkins will determine whether the test failures are intermittent or not and add this information to the notifications it already sends out.

Pytest

Rerunning failed tests for pytest is simple, as it contains a rerun functionality already. The function utilizes contents stored in ~/o3de/.pytest_cache to track which tests failed:

# Pytest rerun command
~/o3de/python/python.cmd -m pytest --last-failed

GoogleTest

When a test fails, it will write a copy/paste rerun command for the individual Googletest. The rerun step will execute all of the saved test commands. AzTestRunner already compiles a rerun command to output, so this can probably be leveraged:

# Example output of test rerun command
/data/workspace/o3de/build/linux/bin/profile/AzTestRunner /data/workspace/o3de/build/linux/bin/profile/libAzCore.Tests.so AzRunUnitTests

What are the advantages of the feature?

Implementing this feature will improve the user experience for contributors when running into test failures. Knowing whether a test is intermittent or deterministic is useful information. This will also enable future features for intermittently failing test metric tracking. This can be useful for fixing intermittently failing tests by relating test failure rates against changelists.

What are the disadvantages of the feature?

  • Increased Cost from AR execution times and extra EC2 resources on test failures.
    • This is mitigated by setting a timeout on the reruns which stops triggering reruns when the timeout is exceeded, but still finishes the current rerun. This will result in a worst case scenario extra (T + N) time, where T is the timeout and N is the time to take to run the entire suite (if all tests fail).
  • Only tests utilizing LyTestTools (Pytest) and AzTestRunner (Googletest) frameworks will have the rerun commands. If a test is executed in any other way, the framework will have to be extended for its rerun command.
  • A frequent flaky-failure case is when multiple tests are run at the same time, however this will run the test in isolation. This problem requires a considerably larger effort that is considered out of scope for this proposal.

How will this be implemented or integrated into the O3DE environment?

Covered in technical description.

Are there any alternatives to this feature?

Utilize CTest which has existing failed test retry functionality. This solution was not preferred because CTest cannot distinguish individual test cases within its module. It would rerun the entire module, which would increase the time to rerun.

We track the test history of each test and upon test failures, query the history to check the last N runs of the same test. This would add a small amount of time to the pipeline compared to rerunning the tests. However, this solution misses the value of focusing on the test history on the current change, and preventing new intermittent behavior.

How will users learn this feature?

Since this feature will be automatically added into Automated Review, a simple message/email notifying O3DE contributors about the changes should suffice. A section can be added in the existing Automated Review document for the reruns. This feature will be added to the documentation and will be automatically used during Automated Review.

Are there any open questions?

The largest value that can be obtained from this feature is only if contributors utilize the information as a tool to fix intermittently failing tests. This feature only provides additional information and will require further features and processes as a solution to intermittently failing tests.

Feedback Date

Discussion and decision should be made on 1/28/22

@Kadino Kadino transferred this issue from o3de/o3de Jan 5, 2022
@Kadino Kadino changed the title Intermittently Failing Test RFC Proposed RFC Feature: Automatically Discover Intermittently Failing Tests Jan 5, 2022
@jckand
Copy link
Contributor

jckand commented Jan 28, 2022

  1. It would be great if we had some way to notify the actual test owners with this data, rather than only the person submitting an AR change so it's more actionable data.
  2. Timeout may need to be adjusted if a large amount of Python tests fail, due to Editor/AP spin-up time. May just require experimentation.

@robhaco
Copy link

robhaco commented Jan 28, 2022

On the technical side, it will likely be a requirement to create some utilities related to managing/monitoring the pytest cache due to what appears to be common issues with the cache needing to be invalidated/reset to get the expected results from the --last-failed argument.

@AMZN-Dk
Copy link
Collaborator

AMZN-Dk commented Jan 28, 2022

Per the SIG testing meeting on 1-28-2022 this RFC has been approved to more forward pending the implementation of a test failure resolution policy per #26

@jckand
Copy link
Contributor

jckand commented Jan 28, 2022

On the technical side, it will likely be a requirement to create some utilities related to managing/monitoring the pytest cache due to what appears to be common issues with the cache needing to be invalidated/reset to get the expected results from the --last-failed argument.

Good callout. I believe each new feature "suite" that runs has its own pytest cache, so this may need to cache and check all of them.

@Kadino Kadino added the triaged/needs-information An issue which needs clarification label Feb 18, 2022
@Kadino
Copy link
Collaborator

Kadino commented Feb 18, 2022

Needs to be informed by the policy generated by #26

@AMZN-Dk AMZN-Dk added the needs-priority Indicates issue lacks a priority/foo label and requires one. label May 10, 2022
@Kadino Kadino added priority/minor Lowest priority. Work that may be scheduled and removed needs-priority Indicates issue lacks a priority/foo label and requires one. labels Jun 21, 2022
@evanchia-ly-sdets
Copy link
Collaborator Author

Duplicate of #43

@evanchia-ly-sdets evanchia-ly-sdets marked this as a duplicate of #43 Aug 17, 2022
@evanchia-ly-sdets evanchia-ly-sdets closed this as not planned Won't fix, can't repro, duplicate, stale Aug 17, 2022
@Kadino Kadino removed the triaged/needs-information An issue which needs clarification label Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority/minor Lowest priority. Work that may be scheduled
Projects
None yet
Development

No branches or pull requests

5 participants