Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{tools}[foss/2023a] PyTorch v2.1.0, pytest-flakefinder v1.1.0, pytest-shard v0.1.2, ... #19184

Closed

Conversation

schiotz
Copy link
Contributor

@schiotz schiotz commented Nov 9, 2023

(created using eb --new-pr)

Now requires that GCC has been rebuild with PR #19253 (and #19180) fixing compiler bugs.

….1.0-GCCcore-12.3.0.eb, pytest-shard-0.1.2-GCCcore-12.3.0.eb, z3-solver-4.12.2.0-GCC-12.3.0.eb and patches: PyTorch-2.0.1_workaround-gcc12-avx512-optimizer-bug.patch, PyTorch-2.0.1_workaround-gcc12-destructor-exception-bug.patch, PyTorch-2.1.0_disable-gcc12-warning.patch, PyTorch-2.1.0_fix-test_invalid_input_csr_large.patch, PyTorch-2.1.0_fix-test_numpy_torch_operators.patch, PyTorch-2.1.0_fix-test-tolerances.patch, PyTorch-2.1.0_fix-vsx-vector-shift-functions.patch, PyTorch-2.1.0_remove-test-requiring-online-access.patch, PyTorch-2.1.0_skip-diff-test-on-ppc.patch, PyTorch-2.1.0_skip-float16-part-test_batchnorm_nhwc_cpu.patch, PyTorch-2.1.0_skip-test_indexing_duplicates.patch, PyTorch-2.1.0_skip_test_linear_fp32.patch
@schiotz
Copy link
Contributor Author

schiotz commented Nov 9, 2023

Many of these patches are taken from PR #19087 by @Flamefire

Note that the test suite takes forever to run (like 8 hours), and that it sometimes hangs in the distributed/test_functional_api.py test.

@casparvl
Copy link
Contributor

Test report by @casparvl
SUCCESS
Build succeeded for 19 out of 19 (4 easyconfigs in total)
tcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 7H12 64-Core Processor, Python 3.6.8
See https://gist.github.com/casparvl/c254c628e6883d8133d9f250f4d48941 for a full test report.

@schiotz schiotz requested a review from Flamefire November 11, 2023 22:40
-if not (TEST_WITH_DEV_DBG_ASAN or IS_WINDOWS or IS_MACOS or IS_CI):
+# Test disabled for EasyBuild, it fails with python_std.fileno() raising
+# io.UnsupportedOperation: fileno
+if False:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part of the patch doesn't match the description and the rest. Might be worth doing that in a separate test with a description and if possible stacktrace of the error. Actually a stacktrace including the diff exceeding the tolerance would also be good here such that one can find this patch if one encounters the error without this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, I have messed up when I tried to use a local git repo to keep track of the patches I made. I have separated it into two patches.

As for including stack traces: I assume you mean including the stack trace in the text at the top of the patch. I have not seen that done before, but it certainly makes a lot of sense, I can see it will make it easier for you or others to evaluate whether to include the patch in future versions. Stack traces added.

Thank you @Flamefire for your review!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!
And yes this hasn't been done consistently before. At least in more recent patches I included either the (reduced) stacktrace, i.e. condensed to enough to be discoverable after I noticed the failures are pretty much the same, like e.g.

AssertionError: Tensor-likes are not close!
Mismatched elements: 4 / 16 (25.0%)
Greatest absolute difference: 1.0 at index (1, 0) (up to 1e-05 allowed)
Greatest relative difference: 0.3333333432674408 at index (1, 0) (up to 1.3e-06 allowed)

Together with the name of the test this is often already enough.
Alternatively as I usually report each issue in the PyTorch Github repo I link that with a very short description as the issue contains much more complete information already.

@@ -0,0 +1,19 @@
Skip test test_indexing_duplicates.
This failure is a bit worrying, and not understood.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At least including the error trace would be good for further investigation and discoverability of the need for this patch

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Stack trace added here, too.

Copy link
Contributor

@Flamefire Flamefire left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me but I'd like to suggest adding the error traces to the patches skipping/fixing test failures or linking to an upstream issue with further information. This helps finding what is required for future releases, understanding new&old failures and tracking later on especially when patches need to be adapted for future versions or could be removed.

@schiotz
Copy link
Contributor Author

schiotz commented Nov 13, 2023

Looks good to me but I'd like to suggest adding the error traces to the patches skipping/fixing test failures or linking to an upstream issue with further information.

I have added stack traces to the patches you commented on, but I should of course also have done that for the remaining patches. I'll to that in a few hours and update once more.

@schiotz
Copy link
Contributor Author

schiotz commented Nov 13, 2023

@Flamefire
Thank you once again for your reviewing this. I have now updated all the patches with stack traces.

@casparvl
Copy link
Contributor

Test report by @casparvl
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
tcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 7H12 64-Core Processor, Python 3.6.8
See https://gist.github.com/casparvl/4b3e738c8426354fb0f74116356026bc for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 33 out of 34 (4 easyconfigs in total)
taurusi8022 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/Flamefire/4824f53451827cc7e0bf98fe2037f443 for a full test report.

@Flamefire
Copy link
Contributor

I'm seeing similar failures as @casparvl reported at #19087 (comment)

Especially dynamo/test_dynamic_shapes:

_____ DynamicShapesExportTests.test_predispatch_with_for_out_dtype_nested_dynamic_shapes ______
...
raise DataDependentOutputException(func)
torch._subclasses.fake_tensor.DataDependentOutputException: aten.allclose.default

To execute this test, run the following from the base repo dir:
python testing.py -k test_predispatch_with_for_out_dtype_nested_dynamic_shapes

I can easily reproduce this and it seems to be known: pytorch/pytorch#107980 and point 2 here

I'm surprised nobody else sees that especially as it reproduces easily when I run it (there is a dependency there: If the torch.dynamo.export is called twice e.g. by the 2 related tests test_predispatch_with_for_out_dtype_nested & test_predispatch_with_for_out_dtype_nested_dynamic_shapes it happens, not for a single run)

Maybe we can just disable one of them, will look into it

@schiotz
Copy link
Contributor Author

schiotz commented Nov 16, 2023

I must admit that I am beginning to doubt the value of running the PyTorch self test. It seems to be very flaky, and we mostly react to errors by disabling the tests (although admittedly @Flamefire seems to have found and fixed a GCC bug!).

I have tried to expand this easyconfig to a CUDA version of PyTorch, but I get tons of errors, mostly of the kinds that hint at errors with the test suite itself:

  • RuntimeError: DeviceMesh only support homogeneous hardware, but found 4 ranks and 10 cuda devices!
  • RuntimeError: failed to run: None, with (*[(), ()], **{'requires_grad': True, 'dtype': torch.float32, 'device': 'cpu'}) (hmm...)
  • RuntimeError: Error(s) in loading state_dict for FullyShardedDataParallel
  • RuntimeError: Mesh should not be bigger than default world size, but found 4 ranks!
  • PML cm cannot be selected

The machine I used has 80 cores and 10 GPUs, I do not know why it chose 4 ranks.

I also see the aten.allclose.default error you report above, and there are a worrying number of numerical failures, including CPU tests that fail although they passed without CUDA. I'll give up on that one, and try to work with a pip-installed pytorch in a venv (and hope for the best).

@Flamefire
Copy link
Contributor

I must admit that I am beginning to doubt the value of running the PyTorch self test. It seems to be very flaky, and we mostly react to errors by disabling the tests (although admittedly @Flamefire seems to have found and fixed a GCC bug!).

I do agree that the PyTorch test suite is annoyingly unstable. However it serves as a test suite for most of the software stack. By working through those failures assuming they must have passed at some point bugs in our configuration (e.g. wrong Python or Python package versions) and in other software (GCC & OpenBLAS came up a few times) become visible. And while likely hlaf of the patches are simply skipping tests or increasing tolerances such that they pass some actually fix issues, e.g. PyTorch-2.0.1_fix-ub-in-inductor-codegen, PyTorch-2.0.1_fix-vsx-loadu.patch, PyTorch-2.0.1_no-cuda-stubs-rpath.patch', PyTorch-2.0.1_workaround-gcc12-destructor-exception-bug.patch, PyTorch-2.1.0_fix-vsx-vector-shift-functions.patch

@schiotz
Copy link
Contributor Author

schiotz commented Nov 16, 2023

I see the value of doing this, and I very much appreciate what you are doing. I guess I am just frustrated because I feel I have hit a wall where making any progress is no longer possible for me. :-)

@Flamefire
Copy link
Contributor

I feel you... It's so hard find whether a failure is due to a wrong test, a different platform/architecture, a different compiler a flake or just a bug in PyTorch itself... And with each release you get new stuff failing making it like playing whack-a-mole just without the fun -.-

FWIW: PyTorch-2.0.1_workaround-gcc12-destructor-exception-bug.patch is no longer required as I added the new patch for GCC at #19253 :)

@schiotz
Copy link
Contributor Author

schiotz commented Nov 17, 2023

I feel you... It's so hard find whether a failure is due to a wrong test, a different platform/architecture, a different compiler a flake or just a bug in PyTorch itself... And with each release you get new stuff failing making it like playing whack-a-mole just without the fun -.-

And I am not a computer specialist, I am a computational physicist and should use my time on that :-)

FWIW: PyTorch-2.0.1_workaround-gcc12-destructor-exception-bug.patch is no longer required as I added the new patch for GCC at #19253 :)

Well, at least that mole was properly wacked! I'll remove it, and rebuild GCC.

@boegel boegel changed the title {tools}[GCCcore/12.3.0] PyTorch v2.1.0, pytest-flakefinder v1.1.0, pytest-shard v0.1.2, ... {tools}[foss/2023a] PyTorch v2.1.0, pytest-flakefinder v1.1.0, pytest-shard v0.1.2, ... Nov 21, 2023
@Flamefire
Copy link
Contributor

Flamefire commented Dec 1, 2023

While working on this I noticed the Z3-solver EC.
We already have ECs for Z3 (the C++ library) and even for Z3 with the python bindings which massage the pip installed package to be usable from C++/C by @lexming
I have seen that we can even use CMakePythonPackage to install Z3 and the python bindings.
We should decide on which way we want.
I see that using the cmake installed version integrates nicely with cmake while the pypi variant integrates nice with pip. But the other side always semms to have a slight disadvantage (missing cmake config file or not showing up in "pip list")
So maybe we need Z3 (C++/Cmake) and Z3-solver but that will build the library twice possibly having to duplicate flags and patches and potential incompatible libs when using both modules

Best would be a pure python package depending on the C++ module similar to e.g. Protobuf. Got to check if that's possible.

Still: what to do with the different variants of the EC we have? (currently with and without the python suffix) @boegel

Note that the software is named Z3, but the python package z3-solver imported as z3 (there is another existing, unrelated Z3 python package on PYPI)

Edit: IMO the current approach of having a Z3-*-Python module is the least bad option until Z3Prover/z3#7035 gets resolved allowing to install the Python bindings without rebuilding Z3. I hence opened #19348 adding the ECs for 12.2 & 12.3 (2022b/2023a)
Having this a PythonBundle makes "z3-solver" show up as an extension in the modules list/search when enabled and supported by the modules tool

@Flamefire
Copy link
Contributor

@schiotz I have added some more patches in the other 2.1.0 PRs and prepared a branch I'll PR against yours to add those and some other changes to get this further. This keeps you as the author for most stuff at least. If you don't want that you can close this PR and I'll open a new one for 2023a 2.1.0 with my changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants