Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai}[foss/2022b] PyTorch v2.1.0 #19087

Conversation

@Flamefire Flamefire marked this pull request as draft October 26, 2023 12:33
@branfosj
Copy link
Member

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0104u08b - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/9d7caece9d37340636fd6021cb1b71b8 for a full test report.

@boegelbot

This comment was marked as outdated.

@branfosj
Copy link
Member

@boegelbot please test @ jsc-zen2
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@branfosj: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=19087 EB_ARGS= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_19087 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3646

Test results coming soon (I hope)...

- notification for comment with ID 1781528056 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@branfosj
Copy link
Member

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@branfosj: Request for testing this PR well received on login1

PR test command 'EB_PR=19087 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_19087 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 12061

Test results coming soon (I hope)...

- notification for comment with ID 1781531861 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@branfosj
Copy link
Member

Test report by @branfosj
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
bear-pg0104u04a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/d60584840759cc44bf05e4ca8b7ab857 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/9ef7f13c743fd77273b87951a2157012 for a full test report.

@boegel boegel added the update label Oct 27, 2023
@boegel boegel added this to the 4.x milestone Oct 27, 2023
@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
cnx4 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/8a570c4699e2331bca4147c8dd4631fd for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
taurusml24 - Linux RHEL 7.6, POWER, 8335-GTX, 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/Flamefire/b8944d8f8c566a4cce8df6ca08807cb1 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
taurusi8018 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor, 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/8901895237e7ff77024936d85107d047 for a full test report.

@akesandgren
Copy link
Contributor

I see these failing (AMD EPYC 7313 16-Core Processor):

dynamo/test_dynamic_shapes 1/1 (1 failed, 2012 passed, 68 skipped, 31 xfailed, 2 rerun)
functorch/test_ops 1/1 (2 failed, 7137 passed, 2274 skipped, 359 xfailed, 4 rerun)
inductor/test_mkldnn_pattern_matcher 1/1 (1 failed, 21 passed, 3 skipped, 2 rerun)
test_proxy_tensor 1/1 (1 failed, 2073 passed, 617 skipped, 81 xfailed, 2 rerun)
distributed/elastic/multiprocessing/api_test 1/1 (1 failed, 59 passed, 2 rerun)
test_sparse_csr 1/1 (1 failed, 3997 passed, 671 skipped, 2 rerun)
dynamo/test_dynamic_shapes 1/1 (5452 passed, 135 skipped)

@boegelbot
Copy link
Collaborator

@Flamefire: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/6773491023
Output from first failing test suite run:

FAIL: test__parse_easyconfig_PyTorch-2.1.0-foss-2022b.eb (test.easyconfigs.easyconfigs.EasyConfigTest)
Test for easyconfig PyTorch-2.1.0-foss-2022b.eb
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/test/easyconfigs/easyconfigs.py", line 1609, in innertest
    template_easyconfig_test(self, spec_path)
  File "/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/test/easyconfigs/easyconfigs.py", line 1460, in template_easyconfig_test
    self.assertTrue(os.path.isfile(patch_full), msg)
AssertionError: False is not true : Patch file /home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/easybuild/easyconfigs/p/PyTorch/PyTorch-2.0.1_workaround-gcc12-destructor-exception-bug.patch is available for PyTorch-2.1.0-foss-2022b.eb

----------------------------------------------------------------------
Ran 18490 tests in 669.345s

FAILED (failures=1)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

@akesandgren
Copy link
Contributor

@Flamefire That new PyTorch-2.0.1_disable-gcc12-warning.patch fails to apply on 2.1.0

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
alvis-c1 - Linux Rocky Linux 8.8, x86_64, Intel Xeon Processor (Skylake), Python 3.6.8
See https://gist.github.com/VRehnberg/b8f7175187da78c20e8d0bc225fa10fd for a full test report.

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
alvis-s1 - Linux Rocky Linux 8.8, x86_64, Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz, Python 3.6.8
See https://gist.github.com/VRehnberg/73d43009a2b8c126ff2bd777e8b16b2f for a full test report.

…1.7.0_disable-dev-shm-test.patch, PyTorch-1.11.1_skip-test_init_from_local_shards.patch, PyTorch-1.12.1_add-hypothesis-suppression.patch, PyTorch-1.12.1_fix-test_cpp_extensions_jit.patch, PyTorch-1.12.1_fix-TestTorch.test_to.patch, PyTorch-1.12.1_skip-test_round_robin.patch, PyTorch-1.13.1_fix-gcc-12-warning-in-fbgemm.patch, PyTorch-1.13.1_fix-protobuf-dependency.patch, PyTorch-1.13.1_fix-warning-in-test-cpp-api.patch, PyTorch-1.13.1_skip-failing-singular-grad-test.patch, PyTorch-1.13.1_skip-tests-without-fbgemm.patch, PyTorch-2.0.1_avoid-test_quantization-failures.patch, PyTorch-2.0.1_fix-skip-decorators.patch, PyTorch-2.0.1_fix-ub-in-inductor-codegen.patch, PyTorch-2.0.1_fix-vsx-loadu.patch, PyTorch-2.0.1_no-cuda-stubs-rpath.patch, PyTorch-2.0.1_skip-failing-gradtest.patch, PyTorch-2.0.1_skip-test_shuffle_reproducibility.patch, PyTorch-2.0.1_skip-tests-skipped-in-subprocess.patch, PyTorch-2.1.0_fix-vsx-vector-shift-functions.patch, PyTorch-2.1.0_remove-test-requiring-online-access.patch, PyTorch-2.1.0_skip-diff-test-on-ppc.patch
@casparvl
Copy link
Contributor

Test report by @casparvl
FAILED
Build succeeded for 4 out of 5 (2 easyconfigs in total)
tcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 7H12 64-Core Processor, Python 3.6.8
See https://gist.github.com/casparvl/dec0cc8cb4150b51c97ae32df2b214ab for a full test report.

@casparvl
Copy link
Contributor

Test failures:

dynamo/test_dynamic_shapes 1/1 failed!
inductor/test_mkldnn_pattern_matcher 1/1 failed!
test_proxy_tensor 1/1 failed!
distributed/elastic/multiprocessing/api_test 1/1 failed!
test_sparse_csr 1/1 failed!

All of those also failed for #19087 (comment)

@casparvl
Copy link
Contributor

casparvl commented Nov 11, 2023

More detail on the failures:

dynamo/test_dynamic_shapes:
_____ DynamicShapesExportTests.test_predispatch_with_for_out_dtype_nested_dynamic_shapes ______
Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.0/foss-2022b/pytorch-v2.1.0/test/dynamo/test_export.py", line 3769, in test_predispatch_with_for_out_dtype_nested
    self.assertTrue(torch.allclose(m(x), gm(x)))
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 1113, in __torch_dispatch__
    return func(*args, **kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_ops.py", line 448, in __call__
    return self._op(*args, **kwargs or {})
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 1250, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 1487, in dispatch
    op_impl_out = op_impl(self, func, *args, **kwargs)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_subclasses/fake_tensor.py", line 568, in data_dep
    raise DataDependentOutputException(func)
torch._subclasses.fake_tensor.DataDependentOutputException: aten.allclose.default

To execute this test, run the following from the base repo dir:
     python testing.py -k test_predispatch_with_for_out_dtype_nested_dynamic_shapes

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
------------------------------------ Captured stdout call -------------------------------------
inline_call []
stats [('calls_captured', 3), ('unique_graphs', 1)]
------------------------------------ Captured stderr call -------------------------------------
/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:1244: UserWarning: export(f, *args, **kwargs) is deprecated, use export(f)(*a
rgs, **kwargs) instead.  If you don't migrate, we may break your export call in the future if your user defined kwargs conflict with future kwargs added to export(f).
  warnings.warn(
------------------------------------ Captured stdout call -------------------------------------
inline_call []
stats [('calls_captured', 3), ('unique_graphs', 1)]
------------------------------------ Captured stderr call -------------------------------------
/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:1244: UserWarning: export(f, *args, **kwargs) is deprecated, use export(f)(*a
rgs, **kwargs) instead.  If you don't migrate, we may break your export call in the future if your user defined kwargs conflict with future kwargs added to export(f).
  warnings.warn(
------------------------------------ Captured stdout call -------------------------------------
inline_call []
stats [('calls_captured', 3), ('unique_graphs', 1)]
------------------------------------ Captured stderr call -------------------------------------
/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/_dynamo/eval_frame.py:1244: UserWarning: export(f, *args, **kwargs) is deprecated, use export(f)(*a
rgs, **kwargs) instead.  If you don't migrate, we may break your export call in the future if your user defined kwargs conflict with future kwargs added to export(f).
  warnings.warn(
=================================== short test summary info ===================================
FAILED [0.4992s] dynamo/test_dynamic_shapes.py::DynamicShapesExportTests::test_predispatch_with_for_out_dtype_nested_dynamic_shapes
======== 1 failed, 2012 passed, 68 skipped, 31 xfailed, 2 rerun in 1569.77s (0:26:09) =========
inductor/test_mkldnn_pattern_matcher:
========================================== FAILURES ===========================================
_____________________________ TestPatternMatcher.test_linear_fp32 _____________________________
Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.0/foss-2022b/pytorch-v2.1.0/test/inductor/test_mkldnn_pattern_matcher.py", line 254, in test_linear_fp32
    self._test_common(mod, (v,), matcher_count, matcher_nodes)
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.0/foss-2022b/pytorch-v2.1.0/test/inductor/test_mkldnn_pattern_matcher.py", line 131, in _test_common
    self.assertEqual(
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/testing/_internal/common_utils.py", line 3285, in assertEqual
    raise error_metas.pop()[0].to_error(
AssertionError: Scalars are not equal!

Expected 1 but got 0.
Absolute difference: 1
Relative difference: 1.0

To execute this test, run the following from the base repo dir:
     python test/inductor/test_mkldnn_pattern_matcher.py -k test_linear_fp32

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
------------------------------------ Captured stdout call -------------------------------------
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 1), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
------------------------------------ Captured stdout call -------------------------------------
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 1), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
------------------------------------ Captured stdout call -------------------------------------
frames [('total', 1), ('ok', 1)]
stats [('calls_captured', 1), ('unique_graphs', 1)]
aot_autograd [('total', 1), ('ok', 1)]
inductor []
=================================== short test summary info ===================================
FAILED [0.1498s] inductor/test_mkldnn_pattern_matcher.py::TestPatternMatcher::test_linear_fp32
================ 1 failed, 21 passed, 3 skipped, 2 rerun in 208.43s (0:03:28) =================
test_proxy_tensor:
========================================== FAILURES ===========================================
______________________ TestSymbolicTracing.test_constant_specialization _______________________
Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.0/foss-2022b/pytorch-v2.1.0/test/test_proxy_tensor.py", line 1491, in test_constant_specialization
    tensor = make_fx(f, tracing_mode="symbolic")(torch.randn(10))
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/fx/experimental/proxy_tensor.py", line 739, in wrapped
    shape_env = ShapeEnv()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/fx/experimental/symbolic_shapes.py", line 2116, in __init__
    if _translation_validation_enabled():
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/fx/experimental/symbolic_shapes.py", line 1483, in _translation_validation_enabled
    return translation_validation_enabled()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/fx/experimental/validator.py", line 537, in translation_validation_enabled
    assert_z3_installed_if_tv_set()
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/fx/experimental/validator.py", line 546, in assert_z3_installed_if_tv_set
    assert _HAS_Z3 or not config.translation_validation, (
AssertionError: translation validation requires Z3 package. Please, either install z3-solver or disable translation validation.

To execute this test, run the following from the base repo dir:
     python test/test_proxy_tensor.py -k test_constant_specialization

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
=================================== short test summary info ===================================
FAILED [0.0007s] test_proxy_tensor.py::TestSymbolicTracing::test_constant_specialization - A...
======== 1 failed, 2073 passed, 617 skipped, 81 xfailed, 2 rerun in 910.33s (0:15:10) =========
distributed/elastic/multiprocessing/api_test
========================================== FAILURES ===========================================
____________________________ StartProcessesNotCITest.test_wrap_bad ____________________________
Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.0/foss-2022b/pytorch-v2.1.0/test/distributed/elastic/multiprocessing/api_test.py", line 678, in test_wrap_bad
    _wrap(
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/api.py", line 369, in _wrap
    with stdout_cm, stderr_cm:
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/contextlib.py", line 135, in __enter__
    return next(self.gen)
  File "/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/redirects.py", line 87, in redirect
    std_fd = python_std.fileno()
io.UnsupportedOperation: fileno

To execute this test, run the following from the base repo dir:
     python test/distributed/elastic/multiprocessing/api_test.py -k test_wrap_bad

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
------------------------------------ Captured stdout call -------------------------------------
hello stdout from 0
------------------------------------ Captured stderr call -------------------------------------
/scratch-nvme/1/casparl/ebtmpdir/eb-wnxpxipe/tmp_7vzcpji/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/error_handler.py:51: UserWarning: Unable to enable fault
 handler. UnsupportedOperation: fileno
  warnings.warn(f"Unable to enable fault handler. {type(e).__name__}: {e}")
hello stderr from 0
------------------------------------ Captured stdout call -------------------------------------
hello stdout from 0
------------------------------------ Captured stderr call -------------------------------------
hello stderr from 0
------------------------------------ Captured stdout call -------------------------------------
hello stdout from 0
------------------------------------ Captured stderr call -------------------------------------
hello stderr from 0
=================================== short test summary info ===================================
FAILED [0.0022s] distributed/elastic/multiprocessing/api_test.py::StartProcessesNotCITest::test_wrap_bad
test_sparse_csr:
========================================== FAILURES ===========================================
__________________ TestSparseCompressedCPU.test_invalid_input_csr_large_cpu ___________________
RuntimeError: value cannot be converted to type int32 without overflow

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.0/foss-2022b/pytorch-v2.1.0/test/test_sparse_csr.py", line 868, in test_invalid_input_csr_large
    with self.assertRaisesRegex(RuntimeError, '32-bit integer overflow in nnz'):
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/unittest/case.py", line 239, in __exit__
    self._raiseFailure('"{}" does not match "{}"'.format(
  File "/scratch-nvme/1/casparl/generic/software/Python/3.10.8-GCCcore-12.2.0/lib/python3.10/unittest/case.py", line 163, in _raiseFailure
    raise self.test_case.failureException(msg)
AssertionError: "32-bit integer overflow in nnz" does not match "value cannot be converted to type int32 without overflow"

To execute this test, run the following from the base repo dir:
     python test/test_sparse_csr.py -k test_invalid_input_csr_large_cpu

This message can be suppressed by setting PYTORCH_PRINT_REPRO_ON_FAILURE=0
=================================== short test summary info ===================================
FAILED [17.6295s] test_sparse_csr.py::TestSparseCompressedCPU::test_invalid_input_csr_large_cpu
============== 1 failed, 3997 passed, 671 skipped, 2 rerun in 318.02s (0:05:18) ===============

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
FAILED
Build succeeded for 1 out of 2 (2 easyconfigs in total)
n1504 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (icelake), Python 3.8.13
See https://gist.github.com/Flamefire/727b6ed8d7c59fa1f87a384a11ab6e21 for a full test report.

@boegelbot
Copy link
Collaborator

@Flamefire: Tests failed in GitHub Actions, see https://github.com/easybuilders/easybuild-easyconfigs/actions/runs/7141071902
Output from first failing test suite run:

FAIL: test_dep_versions_per_toolchain_generation (test.easyconfigs.easyconfigs.EasyConfigTest)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/home/runner/work/easybuild-easyconfigs/easybuild-easyconfigs/test/easyconfigs/easyconfigs.py", line 898, in test_dep_versions_per_toolchain_generation
    self.assertFalse(multi_dep_vars, error_msg)
AssertionError: ['Z3'] is not false : No multi-variant deps found for '^.*-(?P<tc_gen>20(1[89]|[2-9][0-9])[ab]).*\.eb$' easyconfigs:

found 2 variants of 'Z3' dependency in easyconfigs using '2022b' toolchain generation
* version: 4.12.2; versionsuffix:  as dep for {'leidenalg-0.10.1-foss-2022b.eb', 'python-igraph-0.10.6-foss-2022b.eb'}
* version: 4.12.2; versionsuffix: -Python-3.10.8 as dep for {'PyTorch-2.1.0-foss-2022b.eb'}


----------------------------------------------------------------------
Ran 18629 tests in 702.547s

FAILED (failures=1)
ERROR: Not all tests were successful

bleep, bloop, I'm just a bot (boegelbot v20200716.01)
Please talk to my owner @boegel if you notice me acting stupid),
or submit a pull request to https://github.com/boegel/boegelbot fix the problem.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
i8028 - Linux Rocky Linux 8.7, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.6.8
See https://gist.github.com/Flamefire/0ed6380ce40f2e5e8be1ce10b1fa2d09 for a full test report.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
n1010 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (icelake), Python 3.8.13
See https://gist.github.com/Flamefire/31409aac5e05f8de0ced205ecb45a873 for a full test report.

@branfosj
Copy link
Member

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@branfosj: Request for testing this PR well received on login1

PR test command 'EB_PR=19087 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_19087 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 12426

Test results coming soon (I hope)...

- notification for comment with ID 1859087036 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@branfosj
Copy link
Member

@boegelbot please test @ jsc-zen2
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@branfosj: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=19087 EB_ARGS= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_19087 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3928

Test results coming soon (I hope)...

- notification for comment with ID 1859088543 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@branfosj
Copy link
Member

Test report by @branfosj
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
bear-pg0105u03a - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/7cf747da977dc5f6103e08e8fc69d74a for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
cnx2 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/82a77e22c44ad74197256383c0855ffd for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/d4e6869b0e3863732220ab44a28506ce for a full test report.

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 525.147.05, Python 3.10.12
See https://gist.github.com/akesandgren/d8a69178a947e149a203664fbac1d8df for a full test report.

@VRehnberg
Copy link
Contributor

VRehnberg commented Dec 18, 2023

Test report by @VRehnberg
FAILED
Failed during parsing of the easyconfigs, so no ecs were built (3 easyconfigs in total)
alvis-skylake-build - Linux Rocky Linux 8.8, x86_64, Intel Xeon Processor (Skylake, IBRS, no TSX), Python 3.6.8
See https://gist.github.com/VRehnberg/4907a58e7cba019a236948c0ac4cd6af for a full test report.


Missing OS deps

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
FAILED
Build succeeded for 2 out of 3 (3 easyconfigs in total)
alvis-cpu1 - Linux Rocky Linux 8.8, x86_64, Intel(R) Xeon(R) Gold 6226R CPU @ 2.90GHz, Python 3.6.8
See https://gist.github.com/VRehnberg/322cd2dc533d7e5b4cfa0aa1093b122f for a full test report.

@Flamefire
Copy link
Contributor Author

@VRehnberg

distributed/test_c10d_common 1/1 (5452 passed, 135 skipped)
distributed/test_c10d_common 1/1 (1 unit test(s) failed)
distributed/test_c10d_gloo 1/1 (1 unit test(s) failed)
+ distributed/test_functional_api 1/1 (at easybuild/easyblocks/p/pytorch.py:431 in test_step)

That looks odd. I remember having seen the c10d failures too but not recently. More confusingly 1 is listed twice. Are you using the latest develop easyblock?

Also I don't know what has failed in test_functional_api as I haven't seen that before. Could you upload the log please?

@VRehnberg
Copy link
Contributor

@Flamefire Ah, no, this picked up the default easyblocks for eb 4.8.2 instead.

Here are logs for a successful build and a failed build respectively if you're still interested
easybuild-PyTorch-2.1.0-20231219.023613.log.gz
easybuild-PyTorch-2.1.0-20231219.103033.ygVBf.log.gz

@akesandgren
Copy link
Contributor

FYI, running this with CUDA enabled (and the necessary packages) results in the following failed tests,

test_jit 1/1 failed!
test_ops 1/1 failed!
test_optim 1/1 failed!
distributed/_tensor/test_dtensor_ops 1/1 failed!
distributed/fsdp/test_fsdp_flatten_params 1/1 failed!
nn/test_convolution 1/1 failed!
test_cpp_extensions_aot_ninja 1/1 failed!
test_cpp_extensions_aot_no_ninja 1/1 failed!
test_jit_legacy 1/1 failed!
test_jit_profiling 1/1 failed!
test_nn 1/1 failed!

So not that bad...

(Didn't have the "Skip flaky test in test_nn" fix in my version when doing this so that one might be fixed already)

@Flamefire
Copy link
Contributor Author

(Didn't have the "Skip flaky test in test_nn" fix in my version when doing this so that one might be fixed already)

I'm wondering if that is fixed in 2.1.2. Does 2.1.0 fail for you too in test_nn? Can you try #19445 and check the log for that test too?

BTW: I have the CUDA versions prepared locally and am running them too (still waiting for the results though) but we need the CPU versions ready first in order to reduce the failures and address them individually. That's why I haven't uploaded them yet.

@akesandgren
Copy link
Contributor

I've just restarted q test report for this PR. As you can see above my previous build (prior to the test_nn fix) didn't see any problems.
I'll try to get a 2.1.2 build running later.

I just wanted to see how good/bad the CUDA version would be based on this PR, and it looks fairly good.

@Flamefire
Copy link
Contributor Author

Flamefire commented Dec 20, 2023

@Flamefire Ah, no, this picked up the default easyblocks for eb 4.8.2 instead.

Here are logs for a successful build and a failed build respectively if you're still interested easybuild-PyTorch-2.1.0-20231219.023613.log.gz easybuild-PyTorch-2.1.0-20231219.103033.ygVBf.log.gz

The test_c10d tests seem to always fail but not due to any real issue but because it is seemingly run in a SLURM env:

The application appears to have been direct launched using "srun",
but OMPI was not built with SLURM's PMI support and therefore cannot
execute [...]

So don't try to run it in a Slurm job env but "escape" it by either ssh $SLURM_NODELIST (on a single node allocation) or:

for i in \$(env | grep ^SLURM_ | cut -f1 -d=); do
  unset \$i
done

Both works for me.

test_functional_api might be the same issue: test_find_or_create_pg fails after a timeout (300s). But it might also be a buggy test as I found that it was disabled in October on the PyTorch CI: pytorch/pytorch#107278

If it still fails outside the Slurm env I can add a patch to skip this test.

@akesandgren

I've just restarted q test report for this PR. As you can see above my previous build (prior to the test_nn fix) didn't see any problems.

Please also check the build log once done if that test failed even though the test report is ok. We allow some tests to fail so it might report SUCCESS even though that specific test failed.

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 525.147.05, Python 3.10.12
See https://gist.github.com/akesandgren/c4cfaf764648a753db250189e7478eb2 for a full test report.

@casparvl
Copy link
Contributor

Test report by @casparvl
SUCCESS
Build succeeded for 3 out of 3 (3 easyconfigs in total)
gcn6.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 535.104.12, Python 3.6.8
See https://gist.github.com/casparvl/48361cb9a86eaf4c69d163bbb48c7454 for a full test report.

@Flamefire
Copy link
Contributor Author

Closing this after 2.1.2 in #19445 has been merged

@Flamefire Flamefire closed this Dec 27, 2023
@Flamefire Flamefire deleted the 20231026142937_new_pr_PyTorch210 branch December 27, 2023 10:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants