Retry tests #3229

leej3 · 2024-04-02T15:20:40Z

This is an attempt to rerun some of the CI tests using the retry github action from nick-fields. The objective is to rerun steps in the GitHub workflows that have failed due to flaky tests or tests that hang. Using the pytest argument --last-failed is helpful as passed tests are not re-executed across reruns.

An edge-case that is trickier to handle is when the first pytest invocation passes and the subsequent one fails. In some circumstances this will lead to all tests being executed for the first invocation. The pytest argument "--last-failed-no-failures none" can help here though I have observed undesirable behaviour in some cases using this argument (see reproducer below). Sometimes it skips all tests when I would expect it (or at least hope) to run some tests. Additionally, if all tests are deselected it emits an exit code of 5 in bash. For now I catch and ignore this for the first invocation.

Bash script to set up reproducer

#!/bin/bash

# Create the directory if it doesn't exist
pytest --cache-clear || true
rm -rf lf-behaviour
mkdir -p lf-behaviour

# Create the Python test file
cat << EOF > lf-behaviour/test_example.py
#!/usr/bin/env python3
import pytest

@pytest.fixture
def cache(pytestconfig):
    return pytestconfig.cache

def test_common_pass():
    assert True

def test_common_flaky_fail(cache):
    retrieved_cache = cache.get("common_flaky_fail",0)
    if not retrieved_cache > 2:
        cache.set("common_flaky_fail", retrieved_cache +1)
        assert False
    # Passes after the first failure
    assert True

def test_unique1_pass(cache):
    assert True

def test_unique1_flaky_fail(cache):
    retrieved_cache = cache.get("unique1_flaky_fail",0)
    if not retrieved_cache > 2:
        cache.set("unique1_flaky_fail", retrieved_cache +1)
        assert False
    # Passes after the first failure
    assert True

def test_unique2_pass():
    assert True

def test_unique2_fail():
    assert False

EOF

# Create the Python test file unique to the first run
cat << EOF > lf-behaviour/test_example_unique3.py
#!/usr/bin/env python3
import pytest

@pytest.fixture
def cache(pytestconfig):
    return pytestconfig.cache

def test_separately_unique3_pass(cache):
    assert True

def test_separately_unique3_fail(cache):
    retrieved_cache = cache.get("unique3_flaky_fail",0)
    if not retrieved_cache > 2:
        cache.set("unique3_flaky_fail", retrieved_cache +1)
        assert False
    # Passes after the first failure
    assert True


EOF

# Create the Python test file unique to the second run
cat << EOF > lf-behaviour/test_example_unique4.py
#!/usr/bin/env python3
import pytest
def test_separately_unique4_pass():
    assert True

def test_separately_unique4_fail():
    assert False

EOF

# Create the shell script
cat << EOF > lf-behaviour/run_unique_individually.sh
#!/bin/bash
set -ex

pytest \${EXTRA_PYTEST_ARGS} -k "unique1" test_example.py || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;}
pytest \${EXTRA_PYTEST_ARGS} test_example_unique3.py || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;}
pytest \${EXTRA_PYTEST_ARGS} -k "unique2" test_example.py || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;}
pytest \${EXTRA_PYTEST_ARGS} test_example_unique4.py || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;}
EOF

# Create the shell script
cat << EOF > lf-behaviour/run_tests.sh
#!/bin/bash
set -ex

pytest \${EXTRA_PYTEST_ARGS} -k "test_common or unique1 or unique3" . || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;}
pytest \${EXTRA_PYTEST_ARGS} -k "test_common or unique2 or unique4" . || { exit_code=\$?; if [ "\$exit_code" -eq 5 ]; then echo "All tests deselected"; else exit \$exit_code; fi;}
EOF


cd lf-behaviour
bash run_unique_individually.sh

# Echo the command to run the tests
echo "To run the tests, use the following command (you can also add pytest args --cache-clear and --last-failed-no-failures [all/none]):"
echo cd lf-behaviour
echo EXTRA_PYTEST_ARGS="--last-failed --last-failed-no-failures none" bash run_unique_individually.sh

That leaves one remaining problem for this approach to be usable (i.e. this PR could be merged): currently some of the pytest invocations hang. When this happens the tests that were not executed are automatically determined to have passed which can result in most of the test-suite being skipped completely. I haven't thought of a way to address this.

The simpler solution of removing these extra pytest arguments would work but the timeout for the full job needs to be increased a good bit more than what it is currently (45 mins).

EDIT: opting for the simpler solution for now. Perhaps using pytest-timeout could help with test hanging. Previous work here
EDIT 2: using separate pytest cache directories for different runs/executions of pytest on a particular CI machine allows the --last-failed functionality to be used effectively.

leej3 · 2024-04-29T13:35:24Z

This is working reasonably well now.

There seems to be some neptune cleanup that is slowing things down. That might be worth debugging
There are other sources of tests hanging I believe.
distributed tests during the first pytest invocation of run_cpu_tests.sh are still hanging but with the timeout of the "retry github action" they now get rerun successfully.
one potential issue that I see locally is that tests that cause an error don't get rerun.

Also, I abandoned the "--dist=loadgroup" due to the various things hanging. I can't figure out what's causing it but if we could turn that on for the linux tests that would drop the time it takes to execute them considerably.

vfdev-5 · 2024-04-29T13:51:58Z

Do you know why this one is still faling: https://github.com/pytorch/ignite/actions/runs/8878484399/job/24374218672?pr=3229 ?

one potential issue that I see locally is that tests that cause an error don't get rerun.

Can you elaborate this one?

leej3 · 2024-04-29T14:07:12Z

Do you know why this one is still faling

After all tests completed a "shutting down" message from Neptune was displayed and that hung for another 10mins or so.

Currently when tests hang they get restarted by the GitHub action at the 25minute mark. That means if a particular machine hangs twice it is cancelled due to the 45minute timeout. We could adjust it down to 20mins to improve chances of success but overall figuring out the causes of hanging is the ideal.

Regarding the errors: if an invocation of pytest has failures (invalid assertions etc) and errors (failed fixture execution etc) it seems that pytest reruns the failures but ignores the errors. It's likely a deal breaker... I haven't thought of a work around yet. (EDIT: I think I was wrong about this, a more thorough investigation shows errors get rerun)

leej3 · 2024-05-09T12:43:39Z

I think this is ready to go now. Overall it:

retries failed test runs
only rerun tests from the pytest suite that previously failed
adjusts the timeouts for some of the tests/CI test steps to make things more robust
consolidates the core test functionality into a script shared across the pre-existing test scripts

Test failures were too frequent to avoid using the "--last-failed" functionality of pytest. This introduced some extra details like needing to specify the pytest cache directory, trapping an exit code of 5 when a previous pytest invocation had no failures. To keep things maintainable I moved these details into a function shared across all scripts.

.github/workflows/unit-tests.yml

tests/ignite/conftest.py

greatly speeds up reruns of tests as only previously failed tests are rerun. define pytest cachedir for each pytest invocation to prevent interaction between different selections of tests. protect against exit code of 5 when a previous pytest invocation had no failed tests which results in all tests being deselected. use eval to avoid issues with the -k and -m expansions.

leej3 · 2024-05-22T15:52:13Z

Success!

A significant issue with retrying the test suite was that the retry action sends a sigterm which pytest completely ignores. The result was overlapping pytest sessions that caused a bit of a mess. I abandoned a failed strategy of trying to kill (sigint and then sigkill) the pytest process upon retrying. It is easier to modify how pytest interprets the sigterm (it’s worth noting this is modified in case it causes other undesirable behaviour though).

Regardless of how the test session is interrupted, one tricky detail was to include “unrun” tests in subsequent retries. These tests are not run because a session hangs and is interrupted: Pytest appropriately classifies these tests as not failed but for our purposes we do in fact want to include them in the next run of the pytest —last-failed. My solution was to add an option “—treat-unrun-as-failed” that classifies such tests as failed for subsequent runs.

The overall PR summary:

Failures on CI are now rerun up to 5 times. Tests that have previously passed are not rerun.
Common functionality in the bash test scripts has been consolidated into tests/common-test-functionality.sh
tests/ignite is passed to the pytest command. This pattern uses the correct conftest.py to configure the cli.

vfdev-5

Thanks a lot, John, for working on this PR since all that time!
Now CI is passing :)
I left few comments and questions, can you please check them

.github/workflows/hvd-tests.yml

.github/workflows/pytorch-version-tests.yml

.github/workflows/tpu-tests.yml

.github/workflows/unit-tests.yml

tests/ignite/conftest.py

vfdev-5 · 2024-05-23T07:29:45Z

tests/common-test-functionality.sh

+
+    # Run the command
+    if [ "$trap_deselected_exit_code" -eq "1" ]; then
+        CUDA_VISIBLE_DEVICES="" eval "pytest ${pytest_args}" || { exit_code=$?; if [ "$exit_code" -eq ${last_failed_no_failures_code} ]; then echo "All tests deselected"; else exit $exit_code; fi; }


why do we need eval call here, can't we make the call without eval ?

I added an eval here because of some bugs I was running into where things like "-k ''" ends up as "-k" in the final command. The horrors of using eval are somewhat mitigated by the "-x" bash flag so that bugs in the command can be spotted more quickly. I think consistently using arrays for assembling commands in bash is a better alternative but I think using python is the best long term solution.

vfdev-5 · 2024-05-23T07:31:31Z

tests/common-test-functionality.sh

@@ -0,0 +1,102 @@
+#!/bin/bash


I wonder whether it would be more simple to write this script in python for later maintenance instead of a bash script and how much effort it would take?
If you think this is feasible we can do that in a follow-up PR

I think it would be easy enough: largely just providing a simple cli, setting environment variables, and assembling commands to run as a subprocess. A few hours probably. Perhaps a few more to iron out issues and add some tests for it.

OK, sounds good, let's make a python script instead of this bash script in a follow-up PR

vfdev-5

LGTM, thanks John!

leej3 requested a review from vfdev-5 April 2, 2024 15:20

github-actions bot added the ci CI label Apr 2, 2024

leej3 force-pushed the retry-tests branch 4 times, most recently from fc938a9 to 855e6f1 Compare April 4, 2024 15:26

leej3 mentioned this pull request Apr 4, 2024

Retry with lf #3231

Closed

leej3 changed the title ~~WIP: Retry tests~~ Retry tests Apr 4, 2024

leej3 force-pushed the retry-tests branch 2 times, most recently from 9238a9f to e5498ba Compare April 25, 2024 12:03

github-actions bot added the module: metrics Metrics module label Apr 25, 2024

leej3 force-pushed the retry-tests branch from e5498ba to ff7f12c Compare April 25, 2024 14:43

leej3 changed the title ~~Retry tests~~ [WIP] Retry tests Apr 25, 2024

leej3 force-pushed the retry-tests branch from ff7f12c to 6f9390b Compare April 29, 2024 12:24

leej3 changed the title ~~[WIP] Retry tests~~ Retry tests Apr 29, 2024

leej3 force-pushed the retry-tests branch from 6f9390b to 5881d87 Compare May 8, 2024 15:59

vfdev-5 reviewed May 9, 2024

View reviewed changes

.github/workflows/unit-tests.yml Outdated Show resolved Hide resolved

vfdev-5 reviewed May 9, 2024

View reviewed changes

tests/ignite/conftest.py Show resolved Hide resolved

leej3 force-pushed the retry-tests branch 2 times, most recently from 426d4a3 to f37e188 Compare May 13, 2024 11:40

github-actions bot added the module: handlers Core Handlers module label May 13, 2024

leej3 force-pushed the retry-tests branch 3 times, most recently from b9015f6 to 8cd5c65 Compare May 21, 2024 13:24

leej3 changed the title ~~Retry tests~~ [WIP] Retry tests May 21, 2024

retry tests

0d600fa

leej3 and others added 8 commits May 22, 2024 11:08

fix string parsing

50523aa

tidy test scripts

d827530

fix bugs

30be8f8

fix distrib tests

69fcf3d

expand retry usage

f9c6407

set correct root dir

d518be0

add option to treat unrun tests as failures

c8de5eb

leej3 force-pushed the retry-tests branch 2 times, most recently from 15d4c95 to 475be56 Compare May 22, 2024 15:06

interpret sigterm as sigint

57dfb63

leej3 force-pushed the retry-tests branch from 475be56 to 57dfb63 Compare May 22, 2024 15:19

leej3 changed the title ~~[WIP] Retry tests~~ Retry tests May 22, 2024

adjust timeouts

0fa13cc

vfdev-5 reviewed May 23, 2024

View reviewed changes

respond to comments and add comments

15acf1d

vfdev-5 approved these changes May 23, 2024

View reviewed changes

Merge branch 'master' into retry-tests

043e85b

leej3 enabled auto-merge (squash) May 23, 2024 09:44

leej3 merged commit 8db318b into pytorch:master May 23, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry tests #3229

Retry tests #3229

leej3 commented Apr 2, 2024 •

edited

Loading

leej3 commented Apr 29, 2024

vfdev-5 commented Apr 29, 2024 •

edited

Loading

leej3 commented Apr 29, 2024 •

edited

Loading

leej3 commented May 9, 2024

leej3 commented May 22, 2024

vfdev-5 left a comment

vfdev-5 May 23, 2024

leej3 May 23, 2024 •

edited

Loading

vfdev-5 May 23, 2024

leej3 May 23, 2024

vfdev-5 May 23, 2024

vfdev-5 left a comment

Retry tests #3229

Retry tests #3229

Conversation

leej3 commented Apr 2, 2024 • edited Loading

leej3 commented Apr 29, 2024

vfdev-5 commented Apr 29, 2024 • edited Loading

leej3 commented Apr 29, 2024 • edited Loading

leej3 commented May 9, 2024

leej3 commented May 22, 2024

vfdev-5 left a comment

Choose a reason for hiding this comment

vfdev-5 May 23, 2024

Choose a reason for hiding this comment

leej3 May 23, 2024 • edited Loading

Choose a reason for hiding this comment

vfdev-5 May 23, 2024

Choose a reason for hiding this comment

leej3 May 23, 2024

Choose a reason for hiding this comment

vfdev-5 May 23, 2024

Choose a reason for hiding this comment

vfdev-5 left a comment

Choose a reason for hiding this comment

leej3 commented Apr 2, 2024 •

edited

Loading

vfdev-5 commented Apr 29, 2024 •

edited

Loading

leej3 commented Apr 29, 2024 •

edited

Loading

leej3 May 23, 2024 •

edited

Loading