-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{tools}[foss/2023a] PyTorch v2.1.0, pytest-flakefinder v1.1.0, pytest-shard v0.1.2, ... #19184
{tools}[foss/2023a] PyTorch v2.1.0, pytest-flakefinder v1.1.0, pytest-shard v0.1.2, ... #19184
Conversation
….1.0-GCCcore-12.3.0.eb, pytest-shard-0.1.2-GCCcore-12.3.0.eb, z3-solver-4.12.2.0-GCC-12.3.0.eb and patches: PyTorch-2.0.1_workaround-gcc12-avx512-optimizer-bug.patch, PyTorch-2.0.1_workaround-gcc12-destructor-exception-bug.patch, PyTorch-2.1.0_disable-gcc12-warning.patch, PyTorch-2.1.0_fix-test_invalid_input_csr_large.patch, PyTorch-2.1.0_fix-test_numpy_torch_operators.patch, PyTorch-2.1.0_fix-test-tolerances.patch, PyTorch-2.1.0_fix-vsx-vector-shift-functions.patch, PyTorch-2.1.0_remove-test-requiring-online-access.patch, PyTorch-2.1.0_skip-diff-test-on-ppc.patch, PyTorch-2.1.0_skip-float16-part-test_batchnorm_nhwc_cpu.patch, PyTorch-2.1.0_skip-test_indexing_duplicates.patch, PyTorch-2.1.0_skip_test_linear_fp32.patch
Many of these patches are taken from PR #19087 by @Flamefire Note that the test suite takes forever to run (like 8 hours), and that it sometimes hangs in the |
Test report by @casparvl |
-if not (TEST_WITH_DEV_DBG_ASAN or IS_WINDOWS or IS_MACOS or IS_CI): | ||
+# Test disabled for EasyBuild, it fails with python_std.fileno() raising | ||
+# io.UnsupportedOperation: fileno | ||
+if False: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part of the patch doesn't match the description and the rest. Might be worth doing that in a separate test with a description and if possible stacktrace of the error. Actually a stacktrace including the diff exceeding the tolerance would also be good here such that one can find this patch if one encounters the error without this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, I have messed up when I tried to use a local git repo to keep track of the patches I made. I have separated it into two patches.
As for including stack traces: I assume you mean including the stack trace in the text at the top of the patch. I have not seen that done before, but it certainly makes a lot of sense, I can see it will make it easier for you or others to evaluate whether to include the patch in future versions. Stack traces added.
Thank you @Flamefire for your review!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
And yes this hasn't been done consistently before. At least in more recent patches I included either the (reduced) stacktrace, i.e. condensed to enough to be discoverable after I noticed the failures are pretty much the same, like e.g.
AssertionError: Tensor-likes are not close!
Mismatched elements: 4 / 16 (25.0%)
Greatest absolute difference: 1.0 at index (1, 0) (up to 1e-05 allowed)
Greatest relative difference: 0.3333333432674408 at index (1, 0) (up to 1.3e-06 allowed)
Together with the name of the test this is often already enough.
Alternatively as I usually report each issue in the PyTorch Github repo I link that with a very short description as the issue contains much more complete information already.
@@ -0,0 +1,19 @@ | |||
Skip test test_indexing_duplicates. | |||
This failure is a bit worrying, and not understood. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At least including the error trace would be good for further investigation and discoverability of the need for this patch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stack trace added here, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me but I'd like to suggest adding the error traces to the patches skipping/fixing test failures or linking to an upstream issue with further information. This helps finding what is required for future releases, understanding new&old failures and tracking later on especially when patches need to be adapted for future versions or could be removed.
I have added stack traces to the patches you commented on, but I should of course also have done that for the remaining patches. I'll to that in a few hours and update once more. |
@Flamefire |
Test report by @casparvl |
Test report by @Flamefire |
I'm seeing similar failures as @casparvl reported at #19087 (comment) Especially dynamo/test_dynamic_shapes:
I can easily reproduce this and it seems to be known: pytorch/pytorch#107980 and point 2 here I'm surprised nobody else sees that especially as it reproduces easily when I run it (there is a dependency there: If the Maybe we can just disable one of them, will look into it |
I must admit that I am beginning to doubt the value of running the PyTorch self test. It seems to be very flaky, and we mostly react to errors by disabling the tests (although admittedly @Flamefire seems to have found and fixed a GCC bug!). I have tried to expand this easyconfig to a CUDA version of PyTorch, but I get tons of errors, mostly of the kinds that hint at errors with the test suite itself:
The machine I used has 80 cores and 10 GPUs, I do not know why it chose 4 ranks. I also see the |
I do agree that the PyTorch test suite is annoyingly unstable. However it serves as a test suite for most of the software stack. By working through those failures assuming they must have passed at some point bugs in our configuration (e.g. wrong Python or Python package versions) and in other software (GCC & OpenBLAS came up a few times) become visible. And while likely hlaf of the patches are simply skipping tests or increasing tolerances such that they pass some actually fix issues, e.g. PyTorch-2.0.1_fix-ub-in-inductor-codegen, PyTorch-2.0.1_fix-vsx-loadu.patch, PyTorch-2.0.1_no-cuda-stubs-rpath.patch', PyTorch-2.0.1_workaround-gcc12-destructor-exception-bug.patch, PyTorch-2.1.0_fix-vsx-vector-shift-functions.patch |
I see the value of doing this, and I very much appreciate what you are doing. I guess I am just frustrated because I feel I have hit a wall where making any progress is no longer possible for me. :-) |
I feel you... It's so hard find whether a failure is due to a wrong test, a different platform/architecture, a different compiler a flake or just a bug in PyTorch itself... And with each release you get new stuff failing making it like playing whack-a-mole just without the fun -.- FWIW: PyTorch-2.0.1_workaround-gcc12-destructor-exception-bug.patch is no longer required as I added the new patch for GCC at #19253 :) |
And I am not a computer specialist, I am a computational physicist and should use my time on that :-)
Well, at least that mole was properly wacked! I'll remove it, and rebuild GCC. |
While working on this I noticed the Z3-solver EC. Best would be a pure python package depending on the C++ module similar to e.g. Protobuf. Got to check if that's possible. Still: what to do with the different variants of the EC we have? (currently with and without the python suffix) @boegel Note that the software is named Z3, but the python package z3-solver imported as z3 (there is another existing, unrelated Z3 python package on PYPI) Edit: IMO the current approach of having a |
@schiotz I have added some more patches in the other 2.1.0 PRs and prepared a branch I'll PR against yours to add those and some other changes to get this further. This keeps you as the author for most stuff at least. If you don't want that you can close this PR and I'll open a new one for 2023a 2.1.0 with my changes. |
(created using
eb --new-pr
)Now requires that GCC has been rebuild with PR #19253 (and #19180) fixing compiler bugs.