{tools}[foss/2023a] PyTorch v2.1.0, pytest-flakefinder v1.1.0, pytest-shard v0.1.2, ... #19184

schiotz · 2023-11-09T17:43:32Z

(created using eb --new-pr)

Now requires that GCC has been rebuild with PR #19253 (and #19180) fixing compiler bugs.

….1.0-GCCcore-12.3.0.eb, pytest-shard-0.1.2-GCCcore-12.3.0.eb, z3-solver-4.12.2.0-GCC-12.3.0.eb and patches: PyTorch-2.0.1_workaround-gcc12-avx512-optimizer-bug.patch, PyTorch-2.0.1_workaround-gcc12-destructor-exception-bug.patch, PyTorch-2.1.0_disable-gcc12-warning.patch, PyTorch-2.1.0_fix-test_invalid_input_csr_large.patch, PyTorch-2.1.0_fix-test_numpy_torch_operators.patch, PyTorch-2.1.0_fix-test-tolerances.patch, PyTorch-2.1.0_fix-vsx-vector-shift-functions.patch, PyTorch-2.1.0_remove-test-requiring-online-access.patch, PyTorch-2.1.0_skip-diff-test-on-ppc.patch, PyTorch-2.1.0_skip-float16-part-test_batchnorm_nhwc_cpu.patch, PyTorch-2.1.0_skip-test_indexing_duplicates.patch, PyTorch-2.1.0_skip_test_linear_fp32.patch

schiotz · 2023-11-09T17:46:35Z

Many of these patches are taken from PR #19087 by @Flamefire

Note that the test suite takes forever to run (like 8 hours), and that it sometimes hangs in the distributed/test_functional_api.py test.

easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.0-foss-2023a.eb

casparvl · 2023-11-11T03:13:22Z

Test report by @casparvl
SUCCESS
Build succeeded for 19 out of 19 (4 easyconfigs in total)
tcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 7H12 64-Core Processor, Python 3.6.8
See https://gist.github.com/casparvl/c254c628e6883d8133d9f250f4d48941 for a full test report.

Flamefire · 2023-11-12T09:28:32Z

easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.0_skip-float16-part-test_batchnorm_nhwc_cpu.patch

+-if not (TEST_WITH_DEV_DBG_ASAN or IS_WINDOWS or IS_MACOS or IS_CI):
+# Test disabled for EasyBuild, it fails with python_std.fileno() raising 
+# io.UnsupportedOperation: fileno
+if False:


This part of the patch doesn't match the description and the rest. Might be worth doing that in a separate test with a description and if possible stacktrace of the error. Actually a stacktrace including the diff exceeding the tolerance would also be good here such that one can find this patch if one encounters the error without this.

Oops, I have messed up when I tried to use a local git repo to keep track of the patches I made. I have separated it into two patches.

As for including stack traces: I assume you mean including the stack trace in the text at the top of the patch. I have not seen that done before, but it certainly makes a lot of sense, I can see it will make it easier for you or others to evaluate whether to include the patch in future versions. Stack traces added.

Thank you @Flamefire for your review!

Thank you!
And yes this hasn't been done consistently before. At least in more recent patches I included either the (reduced) stacktrace, i.e. condensed to enough to be discoverable after I noticed the failures are pretty much the same, like e.g.

AssertionError: Tensor-likes are not close! Mismatched elements: 4 / 16 (25.0%) Greatest absolute difference: 1.0 at index (1, 0) (up to 1e-05 allowed) Greatest relative difference: 0.3333333432674408 at index (1, 0) (up to 1.3e-06 allowed)

Together with the name of the test this is often already enough.
Alternatively as I usually report each issue in the PyTorch Github repo I link that with a very short description as the issue contains much more complete information already.

Flamefire · 2023-11-12T09:29:27Z

easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.0_skip-test_indexing_duplicates.patch

@@ -0,0 +1,19 @@
+Skip test test_indexing_duplicates.
+This failure is a bit worrying, and not understood.


At least including the error trace would be good for further investigation and discoverability of the need for this patch

Stack trace added here, too.

Flamefire

Looks good to me but I'd like to suggest adding the error traces to the patches skipping/fixing test failures or linking to an upstream issue with further information. This helps finding what is required for future releases, understanding new&old failures and tracking later on especially when patches need to be adapted for future versions or could be removed.

schiotz · 2023-11-13T10:54:55Z

Looks good to me but I'd like to suggest adding the error traces to the patches skipping/fixing test failures or linking to an upstream issue with further information.

I have added stack traces to the patches you commented on, but I should of course also have done that for the remaining patches. I'll to that in a few hours and update once more.

schiotz · 2023-11-13T14:06:06Z

@Flamefire
Thank you once again for your reviewing this. I have now updated all the patches with stack traces.

casparvl · 2023-11-14T08:57:26Z

Test report by @casparvl
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
tcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 7H12 64-Core Processor, Python 3.6.8
See https://gist.github.com/casparvl/4b3e738c8426354fb0f74116356026bc for a full test report.

Flamefire · 2023-11-15T01:56:03Z

Test report by @Flamefire
FAILED
Build succeeded for 33 out of 34 (4 easyconfigs in total)
taurusi8022 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/4824f53451827cc7e0bf98fe2037f443 for a full test report.

Flamefire · 2023-11-15T18:07:25Z

I'm seeing similar failures as @casparvl reported at #19087 (comment)

Especially dynamo/test_dynamic_shapes:

_____ DynamicShapesExportTests.test_predispatch_with_for_out_dtype_nested_dynamic_shapes ______
...
raise DataDependentOutputException(func)
torch._subclasses.fake_tensor.DataDependentOutputException: aten.allclose.default

To execute this test, run the following from the base repo dir:
python testing.py -k test_predispatch_with_for_out_dtype_nested_dynamic_shapes

I can easily reproduce this and it seems to be known: pytorch/pytorch#107980 and point 2 here

I'm surprised nobody else sees that especially as it reproduces easily when I run it (there is a dependency there: If the torch.dynamo.export is called twice e.g. by the 2 related tests test_predispatch_with_for_out_dtype_nested & test_predispatch_with_for_out_dtype_nested_dynamic_shapes it happens, not for a single run)

Maybe we can just disable one of them, will look into it

schiotz · 2023-11-16T10:06:12Z

I must admit that I am beginning to doubt the value of running the PyTorch self test. It seems to be very flaky, and we mostly react to errors by disabling the tests (although admittedly @Flamefire seems to have found and fixed a GCC bug!).

I have tried to expand this easyconfig to a CUDA version of PyTorch, but I get tons of errors, mostly of the kinds that hint at errors with the test suite itself:

RuntimeError: DeviceMesh only support homogeneous hardware, but found 4 ranks and 10 cuda devices!
RuntimeError: failed to run: None, with (*[(), ()], **{'requires_grad': True, 'dtype': torch.float32, 'device': 'cpu'}) (hmm...)
RuntimeError: Error(s) in loading state_dict for FullyShardedDataParallel
RuntimeError: Mesh should not be bigger than default world size, but found 4 ranks!
PML cm cannot be selected

The machine I used has 80 cores and 10 GPUs, I do not know why it chose 4 ranks.

I also see the aten.allclose.default error you report above, and there are a worrying number of numerical failures, including CPU tests that fail although they passed without CUDA. I'll give up on that one, and try to work with a pip-installed pytorch in a venv (and hope for the best).

Flamefire · 2023-11-16T11:17:57Z

I must admit that I am beginning to doubt the value of running the PyTorch self test. It seems to be very flaky, and we mostly react to errors by disabling the tests (although admittedly @Flamefire seems to have found and fixed a GCC bug!).

I do agree that the PyTorch test suite is annoyingly unstable. However it serves as a test suite for most of the software stack. By working through those failures assuming they must have passed at some point bugs in our configuration (e.g. wrong Python or Python package versions) and in other software (GCC & OpenBLAS came up a few times) become visible. And while likely hlaf of the patches are simply skipping tests or increasing tolerances such that they pass some actually fix issues, e.g. PyTorch-2.0.1_fix-ub-in-inductor-codegen, PyTorch-2.0.1_fix-vsx-loadu.patch, PyTorch-2.0.1_no-cuda-stubs-rpath.patch', PyTorch-2.0.1_workaround-gcc12-destructor-exception-bug.patch, PyTorch-2.1.0_fix-vsx-vector-shift-functions.patch

schiotz · 2023-11-16T11:43:40Z

I see the value of doing this, and I very much appreciate what you are doing. I guess I am just frustrated because I feel I have hit a wall where making any progress is no longer possible for me. :-)

Flamefire · 2023-11-17T12:03:31Z

I feel you... It's so hard find whether a failure is due to a wrong test, a different platform/architecture, a different compiler a flake or just a bug in PyTorch itself... And with each release you get new stuff failing making it like playing whack-a-mole just without the fun -.-

FWIW: PyTorch-2.0.1_workaround-gcc12-destructor-exception-bug.patch is no longer required as I added the new patch for GCC at #19253 :)

schiotz · 2023-11-17T13:48:34Z

I feel you... It's so hard find whether a failure is due to a wrong test, a different platform/architecture, a different compiler a flake or just a bug in PyTorch itself... And with each release you get new stuff failing making it like playing whack-a-mole just without the fun -.-

And I am not a computer specialist, I am a computational physicist and should use my time on that :-)

FWIW: PyTorch-2.0.1_workaround-gcc12-destructor-exception-bug.patch is no longer required as I added the new patch for GCC at #19253 :)

Well, at least that mole was properly wacked! I'll remove it, and rebuild GCC.

Flamefire · 2023-12-01T15:32:58Z

While working on this I noticed the Z3-solver EC.
We already have ECs for Z3 (the C++ library) and even for Z3 with the python bindings which massage the pip installed package to be usable from C++/C by @lexming
I have seen that we can even use CMakePythonPackage to install Z3 and the python bindings.
We should decide on which way we want.
I see that using the cmake installed version integrates nicely with cmake while the pypi variant integrates nice with pip. But the other side always semms to have a slight disadvantage (missing cmake config file or not showing up in "pip list")
So maybe we need Z3 (C++/Cmake) and Z3-solver but that will build the library twice possibly having to duplicate flags and patches and potential incompatible libs when using both modules

Best would be a pure python package depending on the C++ module similar to e.g. Protobuf. Got to check if that's possible.

Still: what to do with the different variants of the EC we have? (currently with and without the python suffix) @boegel

Note that the software is named Z3, but the python package z3-solver imported as z3 (there is another existing, unrelated Z3 python package on PYPI)

Edit: IMO the current approach of having a Z3-*-Python module is the least bad option until Z3Prover/z3#7035 gets resolved allowing to install the Python bindings without rebuilding Z3. I hence opened #19348 adding the ECs for 12.2 & 12.3 (2022b/2023a)
Having this a PythonBundle makes "z3-solver" show up as an extension in the modules list/search when enabled and supported by the modules tool

Flamefire · 2023-12-15T15:47:47Z

@schiotz I have added some more patches in the other 2.1.0 PRs and prepared a branch I'll PR against yours to add those and some other changes to get this further. This keeps you as the author for most stuff at least. If you don't want that you can close this PR and I'll open a new one for 2023a 2.1.0 with my changes.

Remove spurious executable flag.

0bed139

Flamefire reviewed Nov 10, 2023

View reviewed changes

easybuild/easyconfigs/p/PyTorch/PyTorch-2.1.0-foss-2023a.eb Outdated Show resolved Hide resolved

Remove patch for GCC bug fixed in easybuilders#19180

3221ed5

schiotz requested a review from Flamefire November 11, 2023 22:40

Flamefire reviewed Nov 12, 2023

View reviewed changes

Split up patch, add stack traces to patches as suggested by @Flamefire

fd3c912

Stack traces added to remaining patches.

dc955a7

Remove workaround for GCCcore bug fixed in easybuilders#19253.

0453f71

schiotz mentioned this pull request Nov 18, 2023

add patch for GCC 12.x to fix miscompiling C++ code causing double-free in case of exceptions #19253

Merged

boegel changed the title ~~{tools}[GCCcore/12.3.0] PyTorch v2.1.0, pytest-flakefinder v1.1.0, pytest-shard v0.1.2, ...~~ {tools}[foss/2023a] PyTorch v2.1.0, pytest-flakefinder v1.1.0, pytest-shard v0.1.2, ... Nov 21, 2023

Flamefire mentioned this pull request Dec 19, 2023

{ai}[foss/2022a] PyTorch v2.1.0 #19086

Closed

Flamefire mentioned this pull request Jan 3, 2024

{ai}[foss/2023a] PyTorch v2.1.2, pytest-flakefinder v1.1.0, pytest-shard v0.1.2 #19480

Merged

branfosj closed this in #19480 Jan 5, 2024

Flamefire mentioned this pull request Jan 8, 2024

{2023.06}[foss/2023a] PyTorch v2.1.2 EESSI/software-layer#444

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

{tools}[foss/2023a] PyTorch v2.1.0, pytest-flakefinder v1.1.0, pytest-shard v0.1.2, ... #19184

{tools}[foss/2023a] PyTorch v2.1.0, pytest-flakefinder v1.1.0, pytest-shard v0.1.2, ... #19184

schiotz commented Nov 9, 2023 •

edited

Loading

schiotz commented Nov 9, 2023 •

edited

Loading

casparvl commented Nov 11, 2023

Flamefire Nov 12, 2023

schiotz Nov 13, 2023

Flamefire Nov 13, 2023

Flamefire Nov 12, 2023

schiotz Nov 13, 2023

Flamefire left a comment

schiotz commented Nov 13, 2023

schiotz commented Nov 13, 2023

casparvl commented Nov 14, 2023

Flamefire commented Nov 15, 2023

Flamefire commented Nov 15, 2023

schiotz commented Nov 16, 2023

Flamefire commented Nov 16, 2023

schiotz commented Nov 16, 2023

Flamefire commented Nov 17, 2023

schiotz commented Nov 17, 2023

Flamefire commented Dec 1, 2023 •

edited

Loading

Flamefire commented Dec 15, 2023

		@@ -0,0 +1,19 @@
		Skip test test_indexing_duplicates.
		This failure is a bit worrying, and not understood.

{tools}[foss/2023a] PyTorch v2.1.0, pytest-flakefinder v1.1.0, pytest-shard v0.1.2, ... #19184

{tools}[foss/2023a] PyTorch v2.1.0, pytest-flakefinder v1.1.0, pytest-shard v0.1.2, ... #19184

Conversation

schiotz commented Nov 9, 2023 • edited Loading

schiotz commented Nov 9, 2023 • edited Loading

casparvl commented Nov 11, 2023

Flamefire Nov 12, 2023

Choose a reason for hiding this comment

schiotz Nov 13, 2023

Choose a reason for hiding this comment

Flamefire Nov 13, 2023

Choose a reason for hiding this comment

Flamefire Nov 12, 2023

Choose a reason for hiding this comment

schiotz Nov 13, 2023

Choose a reason for hiding this comment

Flamefire left a comment

Choose a reason for hiding this comment

schiotz commented Nov 13, 2023

schiotz commented Nov 13, 2023

casparvl commented Nov 14, 2023

Flamefire commented Nov 15, 2023

Flamefire commented Nov 15, 2023

schiotz commented Nov 16, 2023

Flamefire commented Nov 16, 2023

schiotz commented Nov 16, 2023

Flamefire commented Nov 17, 2023

schiotz commented Nov 17, 2023

Flamefire commented Dec 1, 2023 • edited Loading

Flamefire commented Dec 15, 2023

schiotz commented Nov 9, 2023 •

edited

Loading

schiotz commented Nov 9, 2023 •

edited

Loading

Flamefire commented Dec 1, 2023 •

edited

Loading