{ai}[foss/2022b] PyTorch v2.1.0 #19087

Flamefire · 2023-10-26T12:29:45Z

(created using eb --new-pr)

Requires (with rebuild)

add patch to fix regression in GCC 12.x on AVX512 systems #19180
add patch for GCC 12.x to fix miscompiling C++ code causing double-free in case of exceptions #19253

And

allow Python version of Z3 to be used as a dependency #19354

branfosj · 2023-10-26T13:02:04Z

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0104u08b - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/9d7caece9d37340636fd6021cb1b71b8 for a full test report.

branfosj · 2023-10-26T17:18:23Z

@boegelbot please test @ jsc-zen2
CORE_CNT=16

boegelbot · 2023-10-26T17:20:14Z

@branfosj: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=19087 EB_ARGS= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_19087 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

exit code: 0
output:

Submitted batch job 3646

Test results coming soon (I hope)...

- notification for comment with ID 1781528056 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

branfosj · 2023-10-26T17:21:02Z

@boegelbot please test @ generoso
CORE_CNT=16

boegelbot · 2023-10-26T17:25:07Z

@branfosj: Request for testing this PR well received on login1

PR test command 'EB_PR=19087 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_19087 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 12061

Test results coming soon (I hope)...

- notification for comment with ID 1781531861 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

branfosj · 2023-10-26T19:44:17Z

Test report by @branfosj
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
bear-pg0104u04a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/d60584840759cc44bf05e4ca8b7ab857 for a full test report.

boegelbot · 2023-10-27T00:04:37Z

Test report by @boegelbot
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/9ef7f13c743fd77273b87951a2157012 for a full test report.

boegelbot · 2023-10-27T15:14:34Z

Test report by @boegelbot
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
cnx4 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/8a570c4699e2331bca4147c8dd4631fd for a full test report.

Flamefire · 2023-10-27T20:21:45Z

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
taurusml24 - Linux RHEL 7.6, POWER, 8335-GTX, 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/b8944d8f8c566a4cce8df6ca08807cb1 for a full test report.

Flamefire · 2023-10-28T10:34:27Z

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
taurusi8018 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor, 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/8901895237e7ff77024936d85107d047 for a full test report.

akesandgren · 2023-11-02T14:06:00Z

I see these failing (AMD EPYC 7313 16-Core Processor):

dynamo/test_dynamic_shapes 1/1 (1 failed, 2012 passed, 68 skipped, 31 xfailed, 2 rerun)
functorch/test_ops 1/1 (2 failed, 7137 passed, 2274 skipped, 359 xfailed, 4 rerun)
inductor/test_mkldnn_pattern_matcher 1/1 (1 failed, 21 passed, 3 skipped, 2 rerun)
test_proxy_tensor 1/1 (1 failed, 2073 passed, 617 skipped, 81 xfailed, 2 rerun)
distributed/elastic/multiprocessing/api_test 1/1 (1 failed, 59 passed, 2 rerun)
test_sparse_csr 1/1 (1 failed, 3997 passed, 671 skipped, 2 rerun)
dynamo/test_dynamic_shapes 1/1 (5452 passed, 135 skipped)

boegelbot · 2023-11-06T17:25:42Z

@Flamefire: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/6773491023
Output from first failing test suite run:

FAIL: test__parse_easyconfig_PyTorch-2.1.0-foss-2022b.eb (test.easyconfigs.easyconfigs.EasyConfigTest)
Test for easyconfig PyTorch-2.1.0-foss-2022b.eb
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/test/easyconfigs/easyconfigs.py", line 1609, in innertest
    template_easyconfig_test(self, spec_path)
  File "/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/test/easyconfigs/easyconfigs.py", line 1460, in template_easyconfig_test
    self.assertTrue(os.path.isfile(patch_full), msg)
AssertionError: False is not true : Patch file /home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/p/PyTorch/PyTorch-2.0.1_workaround-gcc12-destructor-exception-bug.patch is available for PyTorch-2.1.0-foss-2022b.eb

----------------------------------------------------------------------
Ran 18490 tests in 669.345s

FAILED (failures=1)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

akesandgren · 2023-11-07T13:18:32Z

@Flamefire That new PyTorch-2.0.1_disable-gcc12-warning.patch fails to apply on 2.1.0

VRehnberg · 2023-11-07T18:18:01Z

Test report by @VRehnberg
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
alvis-c1 - Linux Rocky Linux 8.8, x86_64, Intel Xeon Processor (Skylake), Python 3.6.8
See https://gist.github.com/VRehnberg/b8f7175187da78c20e8d0bc225fa10fd for a full test report.

VRehnberg · 2023-11-07T18:18:06Z

Test report by @VRehnberg
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
alvis-s1 - Linux Rocky Linux 8.8, x86_64, Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz, Python 3.6.8
See https://gist.github.com/VRehnberg/73d43009a2b8c126ff2bd777e8b16b2f for a full test report.

…1.7.0_disable-dev-shm-test.patch, PyTorch-1.11.1_skip-test_init_from_local_shards.patch, PyTorch-1.12.1_add-hypothesis-suppression.patch, PyTorch-1.12.1_fix-test_cpp_extensions_jit.patch, PyTorch-1.12.1_fix-TestTorch.test_to.patch, PyTorch-1.12.1_skip-test_round_robin.patch, PyTorch-1.13.1_fix-gcc-12-warning-in-fbgemm.patch, PyTorch-1.13.1_fix-protobuf-dependency.patch, PyTorch-1.13.1_fix-warning-in-test-cpp-api.patch, PyTorch-1.13.1_skip-failing-singular-grad-test.patch, PyTorch-1.13.1_skip-tests-without-fbgemm.patch, PyTorch-2.0.1_avoid-test_quantization-failures.patch, PyTorch-2.0.1_fix-skip-decorators.patch, PyTorch-2.0.1_fix-ub-in-inductor-codegen.patch, PyTorch-2.0.1_fix-vsx-loadu.patch, PyTorch-2.0.1_no-cuda-stubs-rpath.patch, PyTorch-2.0.1_skip-failing-gradtest.patch, PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch, PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch, PyTorch-2.1.0_fix-vsx-vector-shift-functions.patch, PyTorch-2.1.0_remove-test-requiring-online-access.patch, PyTorch-2.1.0_skip-diff-test-on-ppc.patch

casparvl · 2023-11-11T03:15:26Z

Test report by @casparvl
FAILED
Build succeeded for 4 out of 5 (2 easyconfigs in total)
tcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 7H12 64-Core Processor, Python 3.6.8
See https://gist.github.com/casparvl/dec0cc8cb4150b51c97ae32df2b214ab for a full test report.

casparvl · 2023-11-11T09:54:12Z

Test failures:

dynamo/test_dynamic_shapes 1/1 failed!
inductor/test_mkldnn_pattern_matcher 1/1 failed!
test_proxy_tensor 1/1 failed!
distributed/elastic/multiprocessing/api_test 1/1 failed!
test_sparse_csr 1/1 failed!

All of those also failed for #19087 (comment)

casparvl · 2023-11-11T10:01:33Z

More detail on the failures:

dynamo/test_dynamic_shapes:

_____ DynamicShapesExportTests.test_predispatch_with_for_out_dtype_nested_dynamic_shapes ______
Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.0/foss-2022b/pytorch-v2.1.0/test/dynamo/test_export.py", line 3769, in test_predispatch_with_for_out_dtype_nested
    self.assertTrue(torch.allclose(m(x), gm(x)))
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 1113, in __torch_dispatch__
    return func(*args, **kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_ops.py", line 448, in __call__
    return self._op(*args, **kwargs or {})
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 1250, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 1487, in dispatch
    op_impl_out = op_impl(self, func, *args, **kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 568, in data_dep
    raise DataDependentOutputException(func)
torch._subclasses.fake_tensor.DataDependentOutputException: aten.allclose.default

To execute this test, run the following from the base repo dir:
     python testing.py -k test_predispatch_with_for_out_dtype_nested_dynamic_shapes

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
------------------------------------ Captured stdout call -------------------------------------
inline_call []
stats [('calls_captured', 3), ('unique_graphs', 1)]
------------------------------------ Captured stderr call -------------------------------------
/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:1244: UserWarning: export(f, *args, **kwargs) is deprecated, use export(f)(*a
rgs, **kwargs) instead.  If you don't migrate, we may break your export call in the future if your user defined kwargs conflict with future kwargs added to export(f).
  warnings.warn(
------------------------------------ Captured stdout call -------------------------------------
inline_call []
stats [('calls_captured', 3), ('unique_graphs', 1)]
------------------------------------ Captured stderr call -------------------------------------
/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:1244: UserWarning: export(f, *args, **kwargs) is deprecated, use export(f)(*a
rgs, **kwargs) instead.  If you don't migrate, we may break your export call in the future if your user defined kwargs conflict with future kwargs added to export(f).
  warnings.warn(
------------------------------------ Captured stdout call -------------------------------------
inline_call []
stats [('calls_captured', 3), ('unique_graphs', 1)]
------------------------------------ Captured stderr call -------------------------------------
/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:1244: UserWarning: export(f, *args, **kwargs) is deprecated, use export(f)(*a
rgs, **kwargs) instead.  If you don't migrate, we may break your export call in the future if your user defined kwargs conflict with future kwargs added to export(f).
  warnings.warn(
=================================== short test summary info ===================================
FAILED [0.4992s] dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_predispatch_with_for_out_dtype_nested_dynamic_shapes
======== 1 failed, 2012 passed, 68 skipped, 31 xfailed, 2 rerun in 1569.77s (0:26:09) =========

inductor/test_mkldnn_pattern_matcher:

========================================== FAILURES ===========================================
_____________________________ TestPatternMatcher.test_linear_fp32 _____________________________
Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.0/foss-2022b/pytorch-v2.1.0/test/inductor/test_mkldnn_pattern_matcher.py", line 254, in test_linear_fp32
    self._test_common(mod, (v,), matcher_count, matcher_nodes)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.0/foss-2022b/pytorch-v2.1.0/test/inductor/test_mkldnn_pattern_matcher.py", line 131, in _test_common
    self.assertEqual(
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3285, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Scalars are not equal!

Expected 1 but got 0.
Absolute difference: 1
Relative difference: 1.0

To execute this test, run the following from the base repo dir:
     python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_fp32

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
------------------------------------ Captured stdout call -------------------------------------
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 1), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
------------------------------------ Captured stdout call -------------------------------------
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 1), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
------------------------------------ Captured stdout call -------------------------------------
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 1), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
=================================== short test summary info ===================================
FAILED [0.1498s] inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_linear_fp32
================ 1 failed, 21 passed, 3 skipped, 2 rerun in 208.43s (0:03:28) =================

test_proxy_tensor:

========================================== FAILURES ===========================================
______________________ TestSymbolicTracing.test_constant_specialization _______________________
Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.0/foss-2022b/pytorch-v2.1.0/test/test_proxy_tensor.py", line 1491, in test_constant_specialization
    tensor = make_fx(f, tracing_mode="symbolic")(torch.randn(10))
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/fx/experimental/proxy_tensor.py", line 739, in wrapped
    shape_env = ShapeEnv()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/fx/experimental/symbolic_shapes.py", line 2116, in __init__
    if _translation_validation_enabled():
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/fx/experimental/symbolic_shapes.py", line 1483, in _translation_validation_enabled
    return translation_validation_enabled()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/fx/experimental/validator.py", line 537, in translation_validation_enabled
    assert_z3_installed_if_tv_set()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/fx/experimental/validator.py", line 546, in assert_z3_installed_if_tv_set
    assert _HAS_Z3 or not config.translation_validation, (
AssertionError: translation validation requires Z3 package. Please, either install z3-solver or disable translation validation.

To execute this test, run the following from the base repo dir:
     python test/test_proxy_tensor.py -k test_constant_specialization

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
=================================== short test summary info ===================================
FAILED [0.0007s] test_proxy_tensor.py::TestSymbolicTracing::test_constant_specialization - A...
======== 1 failed, 2073 passed, 617 skipped, 81 xfailed, 2 rerun in 910.33s (0:15:10) =========

distributed/elastic/multiprocessing/api_test

========================================== FAILURES ===========================================
____________________________ StartProcessesNotCITest.test_wrap_bad ____________________________
Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.0/foss-2022b/pytorch-v2.1.0/test/distributed/elastic/multiprocessing/api_test.py", line 678, in test_wrap_bad
    _wrap(
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 369, in _wrap
    with stdout_cm, stderr_cm:
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/redirects.py", line 87, in redirect
    std_fd = python_std.fileno()
io.UnsupportedOperation: fileno

To execute this test, run the following from the base repo dir:
     python test/distributed/elastic/multiprocessing/api_test.py -k test_wrap_bad

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
------------------------------------ Captured stdout call -------------------------------------
hello stdout from 0
------------------------------------ Captured stderr call -------------------------------------
/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/error_handler.py:51: UserWarning: Unable to enable fault
 handler. UnsupportedOperation: fileno
  warnings.warn(f"Unable to enable fault handler. {type(e).__name__}: {e}")
hello stderr from 0
------------------------------------ Captured stdout call -------------------------------------
hello stdout from 0
------------------------------------ Captured stderr call -------------------------------------
hello stderr from 0
------------------------------------ Captured stdout call -------------------------------------
hello stdout from 0
------------------------------------ Captured stderr call -------------------------------------
hello stderr from 0
=================================== short test summary info ===================================
FAILED [0.0022s] distributed/elastic/multiprocessing/api_test.py::StartProcessesNotCITest::test_wrap_bad

test_sparse_csr:

========================================== FAILURES ===========================================
__________________ TestSparseCompressedCPU.test_invalid_input_csr_large_cpu ___________________
RuntimeError: value cannot be converted to type int32 without overflow

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.0/foss-2022b/pytorch-v2.1.0/test/test_sparse_csr.py", line 868, in test_invalid_input_csr_large
    with self.assertRaisesRegex(RuntimeError, '32-bit integer overflow in nnz'):
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/unittest/case.py", line 239, in __exit__
    self._raiseFailure('"{}" does not match "{}"'.format(
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
    raise self.test_case.failureException(msg)
AssertionError: "32-bit integer overflow in nnz" does not match "value cannot be converted to type int32 without overflow"

To execute this test, run the following from the base repo dir:
     python test/test_sparse_csr.py -k test_invalid_input_csr_large_cpu

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
=================================== short test summary info ===================================
FAILED [17.6295s] test_sparse_csr.py::TestSparseCompressedCPU::test_invalid_input_csr_large_cpu
============== 1 failed, 3997 passed, 671 skipped, 2 rerun in 318.02s (0:05:18) ===============

Flamefire · 2023-11-13T15:41:31Z

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
n1504 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (icelake), Python 3.8.13
See https://gist.github.com/Flamefire/727b6ed8d7c59fa1f87a384a11ab6e21 for a full test report.

boegelbot · 2023-12-08T12:26:06Z

@Flamefire: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/7141071902
Output from first failing test suite run:

FAIL: test_dep_versions_per_toolchain_generation (test.easyconfigs.easyconfigs.EasyConfigTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/test/easyconfigs/easyconfigs.py", line 898, in test_dep_versions_per_toolchain_generation
    self.assertFalse(multi_dep_vars, error_msg)
AssertionError: ['Z3'] is not false : No multi-variant deps found for '^.*-(?P<tc_gen>20(1[89]|[2-9][0-9])[ab]).*\.eb$' easyconfigs:

found 2 variants of 'Z3' dependency in easyconfigs using '2022b' toolchain generation
* version: 4.12.2; versionsuffix:  as dep for {'leidenalg-0.10.1-foss-2022b.eb', 'python-igraph-0.10.6-foss-2022b.eb'}
* version: 4.12.2; versionsuffix: -Python-3.10.8 as dep for {'PyTorch-2.1.0-foss-2022b.eb'}


----------------------------------------------------------------------
Ran 18629 tests in 702.547s

FAILED (failures=1)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

Flamefire · 2023-12-16T00:04:20Z

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
i8028 - Linux Rocky Linux 8.7, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/Flamefire/0ed6380ce40f2e5e8be1ce10b1fa2d09 for a full test report.

Flamefire · 2023-12-16T19:29:04Z

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
n1010 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (icelake), Python 3.8.13
See https://gist.github.com/Flamefire/31409aac5e05f8de0ced205ecb45a873 for a full test report.

branfosj · 2023-12-17T09:36:11Z

@boegelbot please test @ generoso
CORE_CNT=16

boegelbot · 2023-12-17T09:40:13Z

@branfosj: Request for testing this PR well received on login1

PR test command 'EB_PR=19087 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_19087 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 12426

Test results coming soon (I hope)...

- notification for comment with ID 1859087036 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

branfosj · 2023-12-17T09:43:07Z

@boegelbot please test @ jsc-zen2
CORE_CNT=16

boegelbot · 2023-12-17T09:45:11Z

@branfosj: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=19087 EB_ARGS= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_19087 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

exit code: 0
output:

Submitted batch job 3928

Test results coming soon (I hope)...

- notification for comment with ID 1859088543 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

branfosj · 2023-12-17T15:18:15Z

Test report by @branfosj
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
bear-pg0105u03a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/7cf747da977dc5f6103e08e8fc69d74a for a full test report.

boegelbot · 2023-12-17T18:13:52Z

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
cnx2 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/82a77e22c44ad74197256383c0855ffd for a full test report.

boegelbot · 2023-12-17T23:58:48Z

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/d4e6869b0e3863732220ab44a28506ce for a full test report.

akesandgren · 2023-12-18T12:06:14Z

Test report by @akesandgren
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 525.147.05, Python 3.10.12
See https://gist.github.com/akesandgren/d8a69178a947e149a203664fbac1d8df for a full test report.

VRehnberg · 2023-12-18T13:31:35Z

Test report by @VRehnberg
FAILED
Failed during parsing of the easyconfigs, so no ecs were built (3 easyconfigs in total)
alvis-skylake-build - Linux Rocky Linux 8.8, x86_64, Intel Xeon Processor (Skylake, IBRS, no TSX), Python 3.6.8
See https://gist.github.com/VRehnberg/4907a58e7cba019a236948c0ac4cd6af for a full test report.

Missing OS deps

VRehnberg · 2023-12-19T21:25:34Z

Test report by @VRehnberg
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
alvis-cpu1 - Linux Rocky Linux 8.8, x86_64, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, Python 3.6.8
See https://gist.github.com/VRehnberg/322cd2dc533d7e5b4cfa0aa1093b122f for a full test report.

Flamefire · 2023-12-20T11:33:45Z

@VRehnberg

distributed/test_c10d_common 1/1 (5452 passed, 135 skipped)
distributed/test_c10d_common 1/1 (1 unit test(s) failed)
distributed/test_c10d_gloo 1/1 (1 unit test(s) failed)
+ distributed/test_functional_api 1/1 (at easybuild/easyblocks/p/pytorch.py:431 in test_step)

That looks odd. I remember having seen the c10d failures too but not recently. More confusingly 1 is listed twice. Are you using the latest develop easyblock?

Also I don't know what has failed in test_functional_api as I haven't seen that before. Could you upload the log please?

VRehnberg · 2023-12-20T13:40:36Z

@Flamefire Ah, no, this picked up the default easyblocks for eb 4.8.2 instead.

Here are logs for a successful build and a failed build respectively if you're still interested
easybuild-PyTorch-2.1.0-20231219.023613.log.gz
easybuild-PyTorch-2.1.0-20231219.103033.ygVBf.log.gz

akesandgren · 2023-12-20T14:32:57Z

FYI, running this with CUDA enabled (and the necessary packages) results in the following failed tests,

test_jit 1/1 failed!
test_ops 1/1 failed!
test_optim 1/1 failed!
distributed/_tensor/test_dtensor_ops 1/1 failed!
distributed/fsdp/test_fsdp_flatten_params 1/1 failed!
nn/test_convolution 1/1 failed!
test_cpp_extensions_aot_ninja 1/1 failed!
test_cpp_extensions_aot_no_ninja 1/1 failed!
test_jit_legacy 1/1 failed!
test_jit_profiling 1/1 failed!
test_nn 1/1 failed!

So not that bad...

(Didn't have the "Skip flaky test in test_nn" fix in my version when doing this so that one might be fixed already)

Flamefire · 2023-12-20T14:56:07Z

(Didn't have the "Skip flaky test in test_nn" fix in my version when doing this so that one might be fixed already)

I'm wondering if that is fixed in 2.1.2. Does 2.1.0 fail for you too in test_nn? Can you try #19445 and check the log for that test too?

BTW: I have the CUDA versions prepared locally and am running them too (still waiting for the results though) but we need the CPU versions ready first in order to reduce the failures and address them individually. That's why I haven't uploaded them yet.

akesandgren · 2023-12-20T14:59:04Z

I've just restarted q test report for this PR. As you can see above my previous build (prior to the test_nn fix) didn't see any problems.
I'll try to get a 2.1.2 build running later.

I just wanted to see how good/bad the CUDA version would be based on this PR, and it looks fairly good.

Flamefire · 2023-12-20T15:08:15Z

@Flamefire Ah, no, this picked up the default easyblocks for eb 4.8.2 instead.

Here are logs for a successful build and a failed build respectively if you're still interested easybuild-PyTorch-2.1.0-20231219.023613.log.gz easybuild-PyTorch-2.1.0-20231219.103033.ygVBf.log.gz

The test_c10d tests seem to always fail but not due to any real issue but because it is seemingly run in a SLURM env:

The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute [...]

So don't try to run it in a Slurm job env but "escape" it by either ssh $SLURM_NODELIST (on a single node allocation) or:

for i in \$(env | grep ^SLURM_ | cut -f1 -d=); do
  unset \$i
done

Both works for me.

test_functional_api might be the same issue: test_find_or_create_pg fails after a timeout (300s). But it might also be a buggy test as I found that it was disabled in October on the PyTorch CI: pytorch/pytorch#107278

If it still fails outside the Slurm env I can add a patch to skip this test.

@akesandgren

I've just restarted q test report for this PR. As you can see above my previous build (prior to the test_nn fix) didn't see any problems.

Please also check the build log once done if that test failed even though the test report is ok. We allow some tests to fail so it might report SUCCESS even though that specific test failed.

akesandgren · 2023-12-20T20:19:22Z

Test report by @akesandgren
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 525.147.05, Python 3.10.12
See https://gist.github.com/akesandgren/c4cfaf764648a753db250189e7478eb2 for a full test report.

casparvl · 2023-12-23T02:11:19Z

Test report by @casparvl
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 535.104.12, Python 3.6.8
See https://gist.github.com/casparvl/48361cb9a86eaf4c69d163bbb48c7454 for a full test report.

Flamefire · 2023-12-27T10:08:24Z

Closing this after 2.1.2 in #19445 has been merged

Flamefire marked this pull request as draft October 26, 2023 12:33

This comment was marked as outdated.

Sign in to view

boegel added the update label Oct 27, 2023

boegel added this to the 4.x milestone Oct 27, 2023

Flamefire added 2 commits November 9, 2023 11:10

Disable bogus warning and workaround GCC destructor bug

af59878

Flamefire force-pushed the 20231026142937_new_pr_PyTorch210 branch from ba37611 to af59878 Compare November 9, 2023 10:11

schiotz mentioned this pull request Nov 9, 2023

{tools}[foss/2023a] PyTorch v2.1.0, pytest-flakefinder v1.1.0, pytest-shard v0.1.2, ... #19184

Closed

Flamefire added 2 commits November 17, 2023 12:57

Remove patch with workaround for bug fixed in GCCcore

3298b5d

Add patches and Z3 dependency to fix more test failures

1bad5c0

Flamefire added 3 commits December 15, 2023 11:09

Skip 1 test in functorch/test_ops

1bb3160

Skip 1 test in functorch/test_ops

c4d0d40

Fix typo in patch name

0e889c5

Skip flaky test in test_nn

2a3ef62

Flamefire mentioned this pull request Dec 19, 2023

{ai}[foss/2022b] PyTorch v2.1.2 #19445

Merged

Flamefire mentioned this pull request Dec 20, 2023

Honor dependency order when (re)building multiple ECs easybuilders/easybuild-framework#4404

Closed

Flamefire mentioned this pull request Dec 21, 2023

{ai}[foss/2022a] PyTorch v2.1.2 #19444

Merged

Flamefire closed this Dec 27, 2023

Flamefire deleted the 20231026142937_new_pr_PyTorch210 branch December 27, 2023 10:08

{ai}[foss/2022b] PyTorch v2.1.0 #19087

{ai}[foss/2022b] PyTorch v2.1.0 #19087

Conversation

Flamefire commented Oct 26, 2023 • edited Loading

branfosj commented Oct 26, 2023

This comment was marked as outdated.

branfosj commented Oct 26, 2023

boegelbot commented Oct 26, 2023

branfosj commented Oct 26, 2023

boegelbot commented Oct 26, 2023

branfosj commented Oct 26, 2023

boegelbot commented Oct 27, 2023

boegelbot commented Oct 27, 2023

Flamefire commented Oct 27, 2023

Flamefire commented Oct 28, 2023

akesandgren commented Nov 2, 2023

boegelbot commented Nov 6, 2023

akesandgren commented Nov 7, 2023

VRehnberg commented Nov 7, 2023

VRehnberg commented Nov 7, 2023

casparvl commented Nov 11, 2023

casparvl commented Nov 11, 2023

casparvl commented Nov 11, 2023 • edited Loading

Flamefire commented Nov 13, 2023

boegelbot commented Dec 8, 2023

Flamefire commented Dec 16, 2023

Flamefire commented Dec 16, 2023

branfosj commented Dec 17, 2023

boegelbot commented Dec 17, 2023

branfosj commented Dec 17, 2023

boegelbot commented Dec 17, 2023

branfosj commented Dec 17, 2023

boegelbot commented Dec 17, 2023

boegelbot commented Dec 17, 2023

akesandgren commented Dec 18, 2023

VRehnberg commented Dec 18, 2023 • edited Loading

VRehnberg commented Dec 19, 2023

Flamefire commented Dec 20, 2023

VRehnberg commented Dec 20, 2023

akesandgren commented Dec 20, 2023

Flamefire commented Dec 20, 2023

akesandgren commented Dec 20, 2023

Flamefire commented Dec 20, 2023 • edited Loading

akesandgren commented Dec 20, 2023

casparvl commented Dec 23, 2023

Flamefire commented Dec 27, 2023

Flamefire commented Oct 26, 2023 •

edited

Loading

casparvl commented Nov 11, 2023 •

edited

Loading

VRehnberg commented Dec 18, 2023 •

edited

Loading

Flamefire commented Dec 20, 2023 •

edited

Loading