-
-
Notifications
You must be signed in to change notification settings - Fork 617
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry tests #3229
Retry tests #3229
Conversation
fc938a9
to
855e6f1
Compare
9238a9f
to
e5498ba
Compare
This is working reasonably well now.
Also, I abandoned the "--dist=loadgroup" due to the various things hanging. I can't figure out what's causing it but if we could turn that on for the linux tests that would drop the time it takes to execute them considerably. |
Do you know why this one is still faling: https://github.com/pytorch/ignite/actions/runs/8878484399/job/24374218672?pr=3229 ?
Can you elaborate this one? |
After all tests completed a "shutting down" message from Neptune was displayed and that hung for another 10mins or so. Currently when tests hang they get restarted by the GitHub action at the 25minute mark. That means if a particular machine hangs twice it is cancelled due to the 45minute timeout. We could adjust it down to 20mins to improve chances of success but overall figuring out the causes of hanging is the ideal.
|
I think this is ready to go now. Overall it:
Test failures were too frequent to avoid using the "--last-failed" functionality of pytest. This introduced some extra details like needing to specify the pytest cache directory, trapping an exit code of 5 when a previous pytest invocation had no failures. To keep things maintainable I moved these details into a function shared across all scripts. |
426d4a3
to
f37e188
Compare
b9015f6
to
8cd5c65
Compare
greatly speeds up reruns of tests as only previously failed tests are rerun. define pytest cachedir for each pytest invocation to prevent interaction between different selections of tests. protect against exit code of 5 when a previous pytest invocation had no failed tests which results in all tests being deselected. use eval to avoid issues with the -k and -m expansions.
15d4c95
to
475be56
Compare
Success! A significant issue with retrying the test suite was that the retry action sends a sigterm which pytest completely ignores. The result was overlapping pytest sessions that caused a bit of a mess. I abandoned a failed strategy of trying to kill (sigint and then sigkill) the pytest process upon retrying. It is easier to modify how pytest interprets the sigterm (it’s worth noting this is modified in case it causes other undesirable behaviour though). Regardless of how the test session is interrupted, one tricky detail was to include “unrun” tests in subsequent retries. These tests are not run because a session hangs and is interrupted: Pytest appropriately classifies these tests as not failed but for our purposes we do in fact want to include them in the next run of the The overall PR summary:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot, John, for working on this PR since all that time!
Now CI is passing :)
I left few comments and questions, can you please check them
|
||
# Run the command | ||
if [ "$trap_deselected_exit_code" -eq "1" ]; then | ||
CUDA_VISIBLE_DEVICES="" eval "pytest ${pytest_args}" || { exit_code=$?; if [ "$exit_code" -eq ${last_failed_no_failures_code} ]; then echo "All tests deselected"; else exit $exit_code; fi; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need eval
call here, can't we make the call without eval ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added an eval here because of some bugs I was running into where things like "-k ''" ends up as "-k" in the final command. The horrors of using eval are somewhat mitigated by the "-x" bash flag so that bugs in the command can be spotted more quickly. I think consistently using arrays for assembling commands in bash is a better alternative but I think using python is the best long term solution.
@@ -0,0 +1,102 @@ | |||
#!/bin/bash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder whether it would be more simple to write this script in python for later maintenance instead of a bash script and how much effort it would take?
If you think this is feasible we can do that in a follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be easy enough: largely just providing a simple cli, setting environment variables, and assembling commands to run as a subprocess. A few hours probably. Perhaps a few more to iron out issues and add some tests for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sounds good, let's make a python script instead of this bash script in a follow-up PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks John!
@vfdev-5
This is an attempt to rerun some of the CI tests using the retry github action from nick-fields. The objective is to rerun steps in the GitHub workflows that have failed due to flaky tests or tests that hang. Using the pytest argument --last-failed is helpful as passed tests are not re-executed across reruns.
An edge-case that is trickier to handle is when the first pytest invocation passes and the subsequent one fails. In some circumstances this will lead to all tests being executed for the first invocation. The pytest argument "--last-failed-no-failures none" can help here though I have observed undesirable behaviour in some cases using this argument (see reproducer below). Sometimes it skips all tests when I would expect it (or at least hope) to run some tests. Additionally, if all tests are deselected it emits an exit code of 5 in bash. For now I catch and ignore this for the first invocation.
Bash script to set up reproducer
That leaves one remaining problem for this approach to be usable (i.e. this PR could be merged): currently some of the pytest invocations hang. When this happens the tests that were not executed are automatically determined to have passed which can result in most of the test-suite being skipped completely. I haven't thought of a way to address this.
The simpler solution of removing these extra pytest arguments would work but the timeout for the full job needs to be increased a good bit more than what it is currently (45 mins).
EDIT: opting for the simpler solution for now. Perhaps using pytest-timeout could help with test hanging. Previous work here
EDIT 2: using separate pytest cache directories for different runs/executions of pytest on a particular CI machine allows the --last-failed functionality to be used effectively.