-
Notifications
You must be signed in to change notification settings - Fork 705
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{devel}[foss/2020b] PyTorch v1.8.1 w/ Python 3.8.6 #12347
{devel}[foss/2020b] PyTorch v1.8.1 w/ Python 3.8.6 #12347
Conversation
@boegelbot please test @ generoso |
@branfosj: Request for testing this PR well received on generoso PR test command '
Test results coming soon (I hope)... - notification for comment with ID 791908884 processed Message to humans: this is just bookkeeping information for me, |
'url': 'https://github.com/pytorch', | ||
'repo_name': 'pytorch', | ||
'tag': 'v%(version)s', | ||
'recursive': True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@branfosj Any concerns here w.r.t. reproducibility? Or are the submodules "locked" to a particular commit anyway?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are all locked to specific commits - see https://github.com/pytorch/pytorch/tree/v1.8.0/third_party and subdirectories. The only issue we'd have is if PyTorch reused the tag - then we'd get a different download (with, potentially, a different set of items in the third_party
directory).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, a downside of this I see is that --fetch
likely doesn't work, i.e. a full offline install fails, or does EB handle that?
Also no checksums...
BTW: There is a script in framework to create the sources list out of a valid git checkout (must have git submodule update
done)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
--fetch
works (so long as you do not hit easybuilders/easybuild-framework#3619).
'distributed/rpc/test_process_group_agent', | ||
# Potentially problematic save/load issue with test_lstm on only some machines. Tell users to verify save&load! | ||
# https://github.com/pytorch/pytorch/issues/43209 | ||
'test_quantization', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@branfosj DId you check whether we still see failures?
I can test on our Cascade Lake system where I saw issues with this earlier (cfr. pytorch/pytorch#43209)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've not yet checked that. I'll run a test on our Cascade Lake where we run that test - though I do not know if we saw the issue you saw or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see the same failure with PyTorch 1.7.1 and 1.8.0 on our Cascade Lake.
======================================================================
FAIL: test_lstm (quantization.test_backward_compatibility.TestSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_quantized.py", line 151, in test_fn
qfunction(*args, **kwargs)
File "/rds/projects/2017/branfosj-rse/ProblemSolving/pyt18/pytorch/test/quantization/test_backward_compatibility.py", line 230, in test_lstm
self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
File "/rds/projects/2017/branfosj-rse/ProblemSolving/pyt18/pytorch/test/quantization/test_backward_compatibility.py", line 76, in _test_op
self.assertEqual(qmodule(input_tensor), expected, atol=prec)
File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1198, in assertEqual
self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg,
File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1165, in assertEqual
super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test failure still occurs when I build with MKL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Were you using a full-metal Cascade Lake machine, or were you using a VM on it?
With a Linux VM (with KVM hypervisor), I reproduced this issue on a Cascade Lake machine.
However, if you were using a full-metal machine, then isn't there a possibility that there might be some latent issues with Cascade Lake machines that might surface later?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can confirm that there are issues when optimizing for a cascade lake machine, e.g. tensorflow/tensorflow#47179
I've seen that with 2019b, not with newer compilers, but it is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the info! I had used gcc/g++ 9.3, but that TensorFlow issue you posted also seems quite relevant. I can try testing with a more recent version of gcc, although gcc 9.3 was released in March 2020.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: 2019b uses GCC 8.3.0, 2020a (IIRC) 9.3.0 (which solves the TF issue for us) but as it is a toolchain generation it might also be related to dependencies being updated, so maybe not only the compiler, but that is the best bet as it looks like a misoptimization.
Test report by @branfosj Edit: failure was |
Test report by @boegel |
Test report by @boegelbot |
The three failures are all:
I'll investigate further. (And I'll mark this as a draft in the meantime.) |
I've checked through the tests:
Method:
|
Three bugs filed with PyTorch:
The first one will not normally be an issue to us - as a standard build will have the directories in the right place for it to work. The second one I can easily patch. The final one looks bad and I've not been able to debug further. |
The second issue pointed me to trying a build with MKL - with that included in the build we do not see the seg fault mentioned in the third issue. |
Progress from PyTorch. Of the three bugs reported above:
|
Test report by @branfosj |
@boegelbot please test @ generoso |
@branfosj: Request for testing this PR well received on generoso PR test command '
Test results coming soon (I hope)... - notification for comment with ID 797769583 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
Test report by @verdurin |
Test report by @boegel |
Test report by @Flamefire |
Test report by @branfosj |
Test report by @sassy-crick |
Test report by @Flamefire |
@sassy-crick You should use |
@boegelbot please test @ generoso |
@boegel: Request for testing this PR well received on generoso PR test command '
Test results coming soon (I hope)... - notification for comment with ID 828643863 processed Message to humans: this is just bookkeeping information for me, |
@boegel That was me getting that going on our cluster. It is building now. :-) |
Test report by @Flamefire |
Test report by @boegel |
Test report by @boegelbot |
Test report by @sassy-crick |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good on x86_64
, so I'll go ahead and merge this.
For the problems on POWER, I suggest opening a follow-up PR...
Going in, thanks @branfosj! |
Test report by @boegel |
(created using
eb --new-pr
)Notes:
PyTorch-1.7.0_fix_altivec_defines.patch
,PyTorch-1.7.0_fix_test_DistributedDataParallel.patch
, andPyTorch-1.7.0_fix-fbgemm-not-implemented-issue.patch
are no longer needed (compared toPyTorch-1.7.1-foss-2020b.eb
).