Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add patches to fix or skip PyTorch 1.12.1 tests #16793

Merged

Conversation

Flamefire
Copy link
Contributor

@Flamefire Flamefire commented Dec 5, 2022

(created using eb --new-pr)

This skips a test which times out and is the only remaining failure: #16484 (comment)

In particular it is distributed/_shard/sharded_tensor/test_sharded_tensor and the patch is already used in the PyTorch 1.11 and 1.12-foss-2021b ECs but was missed in this one. It seems to be related to A100 GPUs so only required for the CUDA version.

Error looks like this:

======================================================================
ERROR: test_init_from_local_shards (__main__.TestShardedTensorFromLocalShards)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/tmp/easybuild-tmp/eb-grBnJU/tmpUx_nfk/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 466, in wrapper
    self._join_processes(fn)
  File "/tmp/easybuild-tmp/eb-grBnJU/tmpUx_nfk/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 689, in _join_processes
    self._check_return_codes(elapsed_time)
  File "/tmp/easybuild-tmp/eb-grBnJU/tmpUx_nfk/lib/python3.10/site-packages/torch/testing/_internal/common_distributed.py", line 739, in _check_return_codes
    raise RuntimeError(
RuntimeError: Process 0 terminated or timed out after 600.0712957382202 seconds

----------------------------------------------------------------------
Ran 58 tests in 1390.257s

FAILED (errors=1)
distributed/_shard/sharded_tensor/test_sharded_tensor failed!

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusi8003 - Linux CentOS Linux 7 (Core), x86_64, AMD EPYC 7352 24-Core Processor, 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 3.10.4
See https://gist.github.com/f65bcf11bca67019b2a8c83e6d983c1d for a full test report.

@branfosj branfosj added this to the next release (4.7.0) milestone Dec 6, 2022
branfosj
branfosj previously approved these changes Dec 6, 2022
@boegel
Copy link
Member

boegel commented Dec 7, 2022

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3902.accelgor.os - Linux RHEL 8.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 520.61.05, Python 3.6.8
See https://gist.github.com/aaafbeaaf8fed89de051c339c5c64ea0 for a full test report.

@branfosj
Copy link
Member

branfosj commented Dec 8, 2022

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0103u04a.bear.cluster - Linux RHEL 8.5, x86_64, Intel(R) Xeon(R) Gold 6330 CPU @ 2.00GHz (icelake), 2 x NVIDIA NVIDIA A30, 470.57.02, Python 3.6.8
See https://gist.github.com/e3872d709ce43962c43fa36ab6a9566f for a full test report.

@Flamefire
Copy link
Contributor Author

@boegel I'd need the full log to say more about the failues. I started 3 jobs for this PR to see if any fails when run non-interactively.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusi8026 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/5d1bb9e933377b4f97b2d07fb7935314 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusi8007 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/63ea8c88b4996f9e42e68cfc44b0ff9f for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusi8006 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/8f9ed002b91b65ee9b7fe660d2d0a890 for a full test report.

@Flamefire
Copy link
Contributor Author

I added workarounds for the tests failing for @boegel :

  • distributed/test_c10d_gloo is a timeout similar to that in a test that is almost the same, so skipped the subtest
  • test_autograd is the same we already fixed in skip flaky test in PyTorch 1.9.0 #16258 for 1.9.0 @boegel Would you mind porting that patch either from here or from 1.9 to the other ECs? I think adding it to the CUDA ECs is enough.
  • The 2 failing FSDP tests look like a similar failure I fixed in TensorFlow where a destructor used pybind11::gil_scoped_release and is called during shutdown of Python which results in an abort. But not sure as I'd need more debug info, so just disabled those tests for now.

@boegel boegel changed the title Skip remaining PyTorch 1.12.1 test timeout add patches to fix or skip PyTorch 1.12.1 tests Jan 3, 2023
@boegel
Copy link
Member

boegel commented Jan 3, 2023

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3900.accelgor.os - Linux RHEL 8.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 525.60.13, Python 3.6.8
See https://gist.github.com/b7717f1e3d5c38d34913aec04fd3a674 for a full test report.

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@boegel
Copy link
Member

boegel commented Jan 3, 2023

Going in, thanks @Flamefire!

@boegel boegel merged commit 1abda4d into easybuilders:develop Jan 3, 2023
@Flamefire Flamefire deleted the 20221205162354_new_pr_PyTorch1121 branch January 3, 2023 22:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants