-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{devel}[foss/2022a] PyTorch v1.12.0 w/ Python 3.10.4 + CUDA 11.7.0 #15924
{devel}[foss/2022a] PyTorch v1.12.0 w/ Python 3.10.4 + CUDA 11.7.0 #15924
Conversation
…these - lines change with every PT version etc - and we now have the ability to allow for a small number of failed test. Set allowed number of failing tests at 20 (14 fail on my current system).
… part of Python-3.10.4-GCCcore-11.3.0.eb
…asyconfigs into 20220728183230_new_pr_PyTorch1120
Test report by @casparvl |
Should probably add pytorch/pytorch#81691 |
Hey Casper, have a look at this one pytorch/pytorch#81691 |
@surak : I had a look, but that PR is still being actively worked on. I wouldn't be in favor of taking the current 'fix' and apply it as an EasyBuild patch, as long as they haven't settled on what exactly that fix should look like. I propose we check out if they have settled on a solution by the time we are about to merge this PR - and if so, add that patch in as a last thing. If not, I'd propose to just merge this, and update this EasyConfig with a patch once a solution has been settled on. |
Failing tests are:
Due to the We should however think about patching our EasyBlock in how it counts the number of failed tests: I'm pretty sure that currently, it counts complete test suites as single failures, and then 20 could be quite a lot. As an example: |
Analysis of the failing tests (part 1)
Click to expand
Click to expand
Click to expand
|
Analysis of the failing tests (part 2)
Click to expand
Click to expand
Click to expand
|
Analysis of the failing tests (part 3)
Click to expand
Click to expand
|
Analysis of the failing tests (part 4)
Click to expand
Click to expand
Click to expand
Click to expand
Click to expand
Click to expand
|
Trying this with |
I would love to, but I get bitten by the lib curses bug and can't do anything anymore :-( |
I just applied that manually and reinstalled all ncurses I had on the system, as this mr fails in an even weirder way:
|
In any case, this one pass for me, with a dual processor (48 cores, 92 smt) AMD EPYC with 4x RTX 3090! My github seems broken, gives me a 404 for the upload test report. |
…asyconfigs into 20220728183230_new_pr_PyTorch1120
In my opinion, this PR is ready to be merged, so if anyone wants to formally review: please do. Regarding the failing tests: as we know from previous EasyConfigs, the PyTorch test suite contains many tests that fail outside of their own CI environment. 99 out of 100 times we've investigated such issues before, it was simply the test that was broken. We used to patch these, but that's so much work that it delays the roll-out of new PyTorch EasyConfigs a lot. That's why nowadays, we accept a number of failing tests, as long as it's "reasonable". In my opinion, the current set of test failures is reasonable. I've looked through the failing tests. One of the common failure patterns we see now is
These are the result of changes in implicit type conversion in Python 3.10, something that
No longer works. This has broken a large amount of the tests, and will probably also break some existing PyTorch code. It is however not something we can/should fix on the EasyBuild side: the official PyTorch wheel for Python 3.10 shows exactly the same behaviour. It should simply be considered a known issue in PyTorch 1.12.0 combined with Python 3.10. More info on this, see pytorch/pytorch#69316 and pytorch/pytorch#72282 |
This comment was marked as off-topic.
This comment was marked as off-topic.
@boegelbot please test @ generoso |
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1256242298 processed Message to humans: this is just bookkeeping information for me, |
Test report by @SebastianAchilles |
…sybuild-easyblocks#2794 we'll actually start counting failing tests, instead of failing test suites. Thus, much higher numbers can be expected, since many test suites have multiple failing tests
…build/wiki/Conference-call-notes-20220928 we'll stick to 4.4.2 for 2022a
re-add accidentally deleted test/easyconfigs/easyconfigs.py to PR15924
@boegelbot: please test @ generoso |
@smoors: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1265050895 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
Test report by @casparvl |
Test report by @casparvl |
Test report by @smoors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Going in, thanks @casparvl! |
@Flamefire Any thoughts on the failing tests here? See the detailed overview provided by @casparvl in #15924 (comment) |
I'm currently working on PyTorch 1.12.1 and got pretty far already with the patches I've made for 1.11.0 but still investigating some failures. Mine is for the older toolchain though (2021b) |
Test report by @casparvl |
Test report by @casparvl |
(created using
eb --new-pr
)Depends on
Note that the current EasyConfig doesn't work yet, the patches need to be updated