-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Intermittent Failure Intervention #43
Comments
All instances of "nightly" should be changed to "periodic" since jobs may not actually run nightly (may run twice a day or once a week). |
I am supportive of this RFC, no major callouts or questions at this time. |
The automated alarms proposed above are only looking at failures in Branch Update runs and Periodic runs, which are both on code already-submitted to a branch. That other set of Automated Review failures (when trying to prepare a pull request to merge) may still be useful to record, but there will have to be future analysis of how to report/alarm based on them. It is true that the new code may have new unintentional failures, but they will not appear in these alarms. Unfortunately, this also means that new intermittent failures will not be prevented from merging into development. This RFC only covers catching them after they merge. |
Based on discussion, SIG-Testing is ready to move forward with implementing this RFC and its dependencies. |
This RFC LGTM, just a few comments:
|
This RFC does not propose new builds, nor modifying the pipeline with anything other than metrics. And details on the metrics solution will be handled by a separate proposal. I would expect the other task you mention would define build costs @shirangj |
I'd like to make sure we differentiate between intermittent failures due to actual bugs within the product that intermittently occur causing a failure and intermittent failures due to the automated test framework, environment or suite etc. itself having issues. These are pretty broad buckets but will help clarify the difference between bugs with O3DE vs issues with the tests which have different priorities and resolution paths. |
@AMZN-Dk I do not see a way for a robot to reliably make this determination. Though I think it could be worth a separate RFC to investigate doing so. It would be possible to track certain specific exceptions as being the fault of the test framework (or a bad test, product bug, or unknown), but humans will need to curate that imperfect/incomplete list of root causes and be careful not to attribute these incorrectly or too-broadly. This will have a relatively low accuracy, and likely cannot provide an accurate summary of where/why the myriad failure types originate. Another path could be to have humans manually enter data for root cause information as they investigate and resolve an issue, which is also going to significantly unreliable. |
Migrating #43 to a pull request for review and refinement. Signed-off-by: Kadino <sweeneys@amazon.com>
Persist accepted RFC from #43 Signed-off-by: Kadino <sweeneys@amazon.com>
Migrating #43 to a pull request for review and refinement. Signed-off-by: Kadino <sweeneys@amazon.com> Signed-off-by: scspaldi <scspaldi@amazon.com>
While this RFC is "accepted" by SIG-Testing, implementing this proposal remains blocked on having requisite metrics publicly available: #62 |
RFC: Intermittent Failure Intervention
Summary
The Testing Special Interest Group (SIG-Testing) primarily serves a support and advisory role to other O3DE SIGs, to help them maintain their own tests. While this ownership model tends to function well, there are cases where instability in features owned by one SIG can interfere with the tests of all SIGs. This RFC proposes a runbook for SIG-Testing to follow during emergent cases where the intermittent failure rate approaches a critical level. It also proposes metrics with automated alarms that trigger proactively following this runbook on behalf of other SIGs, as well as improved automated failure warnings to all SIGs to reduce the need to manually follow the runbook:
Note: Investigation and intervention on behalf of other SIGs is currently outside of the stated responsibilities of SIG-Testing, and accepting this RFC would amend the charter.
What is the motivation for this suggestion?
Intermittent failures are a frustrating reality of complex software. O3DE SIGs already strive to deliver quality features, and do not intentionally merge new code that intermittently fails. And in many cases intermittently unsafe code is caught and fixed before it ships: during development, during code reviews, or by tests executed during Automated Review of pull requests. This RFC does not seek to change how code is developed, reviewed, or submitted. Regardless, some percentage of instability evades early detection and creates nondeterminism.
When a failure appears to be nondeterministic, it can initially pass the Automated Review pipeline only to later fail during verification of a future change. Since these failures can "disappear on rerun" and are easy to ignore, they tend to accumulate without being fixed. This debt of accumulated nondeterminism wastes time and hardware resources, and also frustrates contributors who investigate failures they cannot reproduce. While a policy exists to help contributors handle intermittent failures, its guidance has proven insufficient to prevent subtle issues from accumulating into a crisis. For example, documents such as this RFC are produced every 3-6 months when a pipeline stability crisis occurs. If this RFC is accepted, whenever the rate of intermittent failure rises above a threshold, an automated notification will prompt a SIG-Testing member to follow the runbook. Such interventions have regularly been necessary in the past, but had insufficient metrics, no automation, and no runbook.
To limit the frequency a human must manually follow this runbook, SIGs should also get automatically notified of failures well before a critical failure rate is reached. Existing autocut issues contain little information specific to the failure, and are not deduplicated based on this information. This can be improved to cut separate issues for different failure-causes. This can additionally combine with information from GitHub Codeowners to automatically find an appropriate SIG label to assign new issues for investigation.
The intended outcome is:
Suggestion design description
Definitions
Automated Review (AR): the portion of the Continuous Integration Pipeline which gates merging code from pull requests into a shared branch such as Development or Stabilization. Test failures here include intentional rejections, where the system is functioning normally and rejecting bad code, as well as unintended intermittent failures. Due to this, AR metrics are not used in the automation proposed below, though they may still be a useful health metric.
Branch Update Run: These builds are post-submission health checks, executed against the current state of the shared branch. All failures here are unintentional intermittent failures, or are a sign of a merge error. If Branch Update runs are failing then Automated Review runs (of merging in a new change) should similarly be failing. This is the primary source of health metrics proposed for this RFC.
Periodic (Nightly) Builds: These are periodic health checks, which execute a broader and slower range of tests. Health metrics should also be reported from here.
Tolerable Failure Rates
Metrics on test failure rates are an inherently imprecise and fuzzy measurement, which try to demonstrate statistical confidence. A single piece of code may have one extremely rare intermittent failure, or it may have multiple simultaneous patterns of intermittent failure, or it may eventually become consistently failing due to complex environmental factors. The following confidence bands intend to simplify interpreting fuzzy data, starting from the most severe:
Within these categories, some already have obvious steps to follow. Consistently failing issues will continue to follow the GitHub issues workflow. Issues undetected by automated tests either prompt new automation to detect them, or may be safe to ignore. And for the purposes of this proposal, the "Detected" category is a threshold to not require additional action beyond continuing to auto-cut an issue to notify about the failure. The boundaries between the remaining categories are proposed as:
These thresholds are subjective and are sensitive to the scope of the product, its tests, and its pipeline environment. Due to subjectivity, the values may need to change as O3DE changes in scope. To better handle the broad scope, metrics are proposed at three aggregated levels. Each level acts as a filter with different sensitivity, catching what the previous one misses. The intent of these categories is to accurately identify a problem area when possible, but still detect when small problems accumulate into a widespread issue. To keep the definitions simple, the same confidence bands are proposed for the metrics categories. This is described below in the section "Metrics for Failure Rates".
Autocut Issue Improvements
Existing autocut issues rarely result in action and pile up from failed Branch Update Runs. This proves that these issues are not effectively tracked beyond the push-notifications sent as instant messages. The runbook above calls for advance notifications sent to SIGs, and suggests the existing autocut issues are the appropriate medium. To effectively use them, the following improvements are recommended:
Failure Runbook
When any pipeline failure first occurs, it will result in an autocut issue. This issue should be auto-assigned to the SIG designated by the Codeowners file for investigation. As an issue continues to reproduce, existing issues should be auto-commented on. And if failure rates rise above a threshold, a second issue gets cut to SIG-Testing to intervene by following a runbook. The full runbook is not defined here, only an outline of how it is used.
This runbook will document both Automated and Manual processes to reduce intermittent failure. The automated processes are documented to clarify the steps that have already been taken. When the initial automated portions are insufficient, the automation prompts SIG-Testing to take action in the manual portion of the runbook by auto-cutting an extra issue in GitHub. The following steps apply to Branch Update Runs and Periodic Builds, with the intent to keep Automated Review runs only seeing newly-introduced failures. (After RFC, this runbook should exist as its own document in the SIG-Testing repo)
Pipeline Automation:
When any failure is detected in a Branch Update or Periodic Build, an issue is updated or auto-cut to track it (this is already implemented today). If tests were executed, then test metrics will also be uploaded. It is expected that autocut issues which are due to intermittent behavior may be claimed and then closed due to no reproduction, or ignored due to low priority.
Issues Automation with Metrics:
When new test metrics are uploaded, any new failure should prompt querying recent failure metrics. Based on the query results, take the following actions:
priority/critical
priority/critical
to investigate rising failure ratespriority/critical
priority/critical
to investigate rising failure ratespriority/critical
, to investigate rising failure rates with unclear originpriority/blocker
on the investigation ticketNote: Warning threshold is currently undefined at the pipeline level, as it is more sensitive to failures.
Note: Creating a pipeline critical will always occur before module or individual test critical. However it may not get investigated before other more-specific critical issues are logged. (This may result in nearly always having an investigation open)
Note: Can result in a SIG-Testing investigation being prompted on the first new failure shortly after a prolonged failure.
Metrics for Failure Rates
Three levels of test metrics are proposed, and each require alarms that trigger automated actions in GitHub Issues.
Test failures per pipeline run
Contributors are most directly impacted by the aggregate test failure rate of an entire run of the Automated Review Pipeline. This involves running tests across all modules for multiple variant builds in parallel, and O3DE has grown to nearly 200 test modules. Certain modules run in parallel with one another, and certain failures may only occur when all modules execute. There are around 100,000 tests which currently execute on each pipeline run, and a single intermittent failure across any of these tests results in a failed run.
Failures per test module
Test modules often contain hundreds of individual tests. In a test module with a hundred tests, if every test had only a 1% independent failure rate then the module would be statistically expected to almost always fail. Additionally, certain tests may only fail when run with the rest of their module. Due to the finer granularity and scale, these metrics are less sensitive than those for the full pipeline. For instance, if each of the current ~200 modules contain only a single 1/100 error rate, barely triggering a warning-level response for modules, then across the two variant test executions the pipeline would expect a 100% failure rate with around 4 failed modules in every run.
( 2 build variants * 200 modules * 1/100 fail rate = 4 failures per pipeline run)
. While an unhealthy state is possible with a low per-module failure rate, it should still be detected by the pipeline-wide metric above.Failures per individual test
Individual tests must be highly stable. They are also the smallest, quickest data point to iterate on. With nearly 50,000 (and growing) tests across the Main and Smoke suites, even a one-in-a-million failure baked into each test could accumulate into severe pipeline-level failure rates
( 2 build variants * 50,000 tests * 1/1,000,000 fail rate = 1/10 runs fail)
. While this makes subtle issues in individual tests the least sensitive to accumulated failure, identifying a single problematic test is also the best case scenario for debugging. And while these per-test metrics detect only specific issues, other complex systemic issues should be caught by the aggregate metrics. An unhealthy state is again possible with a low per-test failure rate, but should l be detected by investigations into either the module-wide or pipeline-wide metrics above.Metrics Requirements
A metrics backend system needs to track historical test failure data in Branch Update Runs and Periodic Builds, which will be queried to alarm on the recent failure trends.
The following metrics need to be collected from all tests executed in every run:
The following needs to be collected from every Test Module run by CTest in a pipeline:
The following needs to be collected from every test Build Job execution:
The following needs to be collected from every Pipeline execution:
To ensure statistical accuracy, the metrics analysis should be conducted across the most recent 100 runs from within 1 week. This should provide a balance between recency and accuracy.
The heaviest of these metrics will be for individual tests in Branch Update Runs. Test name identifiers are often in excess of one hundred characters, and there are currently nearly 50,000 tests across the Smoke and Main suites. With around 12 branch update runs per day triggering two test-runs each, this can result in a sizeable amount of data. Periodic Builds currently execute a few hundred longer-running tests as often as twice per day, and would constitute less than 1% of the total data. Periodic builds may also change in frequency depending on the needs and scale of the O3DE project.
To reduce the volume of data, we can store only individual test failures and calculate passes based on total runs of a build job. This may result in builds that fail early (during machine setup, build, get aborted, etc.) artificially inflating the test pass rate, since they would not create test metrics. Newly added tests would similarly start with a inflated pass rate. Further analysis on this exists below in the Appendix on Metrics Estimates.
Example intermittent failure scenario
An individual test "A" has already failed a few times within the previous week during Branch Update Runs, and encounters two failures during some of the Branch Update Runs today. The automation would take the following steps on as new failures occur today:
priority/critical
and add a comment with the name of the failing test module and its failure ratepriority/critical
assigned to SIG-Testing to investigate increasing overall failure ratepriority/critical
(re-added in case a user removed it) and add a comment with the name of the failing test and its failure rateWhat are the advantages of the suggestion?
What are the disadvantages of the suggestion?
Are there any alternatives to this suggestion?
AR Test failures can be automatically retried, bypassing current user pain points by ignoring intermittent failures.
AR Test failures can be automatically retried, still failing if the initial test failed, but collecting additional testing metrics to display in AR.
Tests in Automated Review could unconditionally run multiple times, to find intermittent behavior within a single change.
Periodically run tests dozens to hundreds of times, collecting failure metrics separately from AR and Branch Updates.
Pipeline failure metrics alone could be recorded and published for other SIGs to act on how they see fit, without a failsafe process for SIG-Testing to follow.
Use this proposal, but set different stability standards for different test types. This could be between C++ and Python tests, between the Smoke and Main test suites, or another partition.
What is the strategy for adoption?
*Approval of this RFC should involve input from all SIGs. Delivery of these changes may need coordination with SIG-Build.
Are there any open questions?
Appendix
Metrics Estimates
Below is an rough estimate of the volume of test metrics and their cost. Other SIGs may have additional metrics needs, which are not calculated here.
A. Jenkins Pipeline Metrics
There are currently a total of 29 stages across the seven parallel jobs in each of the Automated Review (Pull Requests) and Branch Indexing (Merge Consistency Checks) runs. There are approximately 40 pull request runs and 20 branch updates per day (across two active branches).
Periodic Builds have many more jobs, currently around 180 stages across 46 parallel jobs. It is difficult to estimate how this set of stages will change over time.
The daily Jenkins-metrics load factor of "Pipeline Run" and "Job-Stage Run" should be around
60x29 + 180x3 = 2280
. If we remove Periodic Mac builds, this would be around2200
Jenkins-level metrics per day. It is difficult to estimate how this set of pipelines and stages will change over time. However the top level metrics should stay a comparatively low volume.B. Test Result Metrics
There are currently around 43,000 tests that run in each Automated Review and Branch Update test-job, and this is expected to slowly grow over time. One path to reducing the scope of test metrics is to bundle these metrics into only reporting on the module that contains sets of tests, for which there are currently 135 modules in Automated Review. This is a major tradeoff of data quality for a ~99.9% reduction in size, which should at least be paired with saving the raw number of pass, fail, error, and skipped tests.
Another way to reduce metrics is to not explicitly store test-pass data, and to add entries for only non-pass results. This has the negative effect of conflating (reconstructed) data on "pass" and "not run" and makes it unclear when a test becomes renamed or disabled, but otherwise stores explicit failure data with significantly reduced load. This would result in a variable load of metrics which increases as more tests fail per run. Currently around 1/10 of test runs encounter a test failure. When such failures occur, the current average number of failures is around 2.5.
The daily load factor for all test-metrics would be
43,000x2x60 + 2,000x4x3 = 5,184,000
test-level metrics per day.If only modules are reported, this would be
135x2x60 + 34x4x3 = 16,608
module-level metrics per day.If modules and test-failures are reported, this would be approximately
0.1x2.5x60 + 135x2x60 + 34x4x3 ~= 16,625
module-plus-failure metrics per day. Since this load is variable, it is rounded up to20,000
. This suggests that saving only failure data would be a ~99.6% reduction in storage.C. Profiling Metrics
There are currently 1795 Micro-Benchmark metrics (across 10 modules), and 10 planned end-to-end benchmarks of workflows. Providing a metrics pipeline will encourage the current number of performance metrics to grow, as such profiling data otherwise has little utility. A wild estimate is that this will expand by 10x within a year.
These execute only in the three Periodic Builds, in each on Windows and Linux. This makes the daily profiling metrics load
1805x3x2 = 10,830
, likely growing to~110,000
daily within a year.Estimated Total Metrics Load
Metrics systems commonly store metrics with dimensional values, grouping a "single metric" as multiple related values and not only as individual KVPs. Under this model, daily metrics would be around
5,200,000
which is heavily dominated by test-metrics. This reduces to around35,000
(a greater than 99% reduction) if only test modules and failures are logged, and not all individual tests.If test metrics are allowed to naturally grow, this could reach 6,000,000 daily metrics within a year, or perhaps 150,000 if only test modules and failures are recorded. This would be around 42,000,000 vs 1,050,000 metrics per week, 180,000,000 vs 4,500,000 per month, 2,200,000,000 vs 55,000,000 per year. These metrics would exist across four or five metrics types (Pipeline Run, Job-Stage Run, Test Result, Profiling Result) with Test-Module Result being important if the reduced load is selected.
Estimated Metrics Cost
While a metrics solution has not yet been selected, here is one off-the-shelf estimate:
AWS CloudWatch primarily charges per type of custom metric, as well as a small amount per API call that uploads metrics data. Each post request of custom metrics is limited to 20 gzip-packed metrics, for which we should see a nearly 1:1 ability to batch data from test-XMLs. Monthly CloudWatch cost estimates for metrics plus a dashboard and 100 alarms are $112 for full metrics (4 types with 10MM API calls) vs $14 for failure-only (5 types with 250k API calls). This is estimated across four or five metrics types (Pipeline Run, Job-Stage Run, Test Result, Profiling Result) with Test Module Result being added if the reduced load is selected.
However it is important to limit the types of custom metrics in CloudWatch (and likely any other backend as well). While it could make dashboard partitioning and alarm-writing easier, monthly costs would be extremely high if individual metric-types were all stored separately. For instance if every unique metric were accidentally stored with a unique key (10MM types), the monthly cost could be over $240,000!
While out of scope for this RFC, this is a critical dependency which must have its usage and access limited: #34
The text was updated successfully, but these errors were encountered: