Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{devel}[foss/2020b] PyTorch v1.8.1 w/ Python 3.8.6 #12347

Merged

Conversation

branfosj
Copy link
Member

@branfosj branfosj commented Mar 6, 2021

(created using eb --new-pr)

Notes:

  • I've switched to using a recursive git clone - instead of the long sources section.
  • PyTorch-1.7.0_fix_altivec_defines.patch, PyTorch-1.7.0_fix_test_DistributedDataParallel.patch, and PyTorch-1.7.0_fix-fbgemm-not-implemented-issue.patch are no longer needed (compared to PyTorch-1.7.1-foss-2020b.eb).

@branfosj branfosj added the update label Mar 6, 2021
@branfosj branfosj added this to the 4.x milestone Mar 6, 2021
@branfosj
Copy link
Member Author

branfosj commented Mar 6, 2021

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@branfosj: Request for testing this PR well received on generoso

PR test command 'EB_PR=12347 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_12347 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 16321

Test results coming soon (I hope)...

- notification for comment with ID 791908884 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

'url': 'https://github.com/pytorch',
'repo_name': 'pytorch',
'tag': 'v%(version)s',
'recursive': True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@branfosj Any concerns here w.r.t. reproducibility? Or are the submodules "locked" to a particular commit anyway?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are all locked to specific commits - see https://github.com/pytorch/pytorch/tree/v1.8.0/third_party and subdirectories. The only issue we'd have is if PyTorch reused the tag - then we'd get a different download (with, potentially, a different set of items in the third_party directory).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, a downside of this I see is that --fetch likely doesn't work, i.e. a full offline install fails, or does EB handle that?
Also no checksums...
BTW: There is a script in framework to create the sources list out of a valid git checkout (must have git submodule update done)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

--fetch works (so long as you do not hit easybuilders/easybuild-framework#3619).

'distributed/rpc/test_process_group_agent',
# Potentially problematic save/load issue with test_lstm on only some machines. Tell users to verify save&load!
# https://github.com/pytorch/pytorch/issues/43209
'test_quantization',
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@branfosj DId you check whether we still see failures?

I can test on our Cascade Lake system where I saw issues with this earlier (cfr. pytorch/pytorch#43209)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've not yet checked that. I'll run a test on our Cascade Lake where we run that test - though I do not know if we saw the issue you saw or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the same failure with PyTorch 1.7.1 and 1.8.0 on our Cascade Lake.

======================================================================
FAIL: test_lstm (quantization.test_backward_compatibility.TestSerialization)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_quantized.py", line 151, in test_fn
    qfunction(*args, **kwargs)
  File "/rds/projects/2017/branfosj-rse/ProblemSolving/pyt18/pytorch/test/quantization/test_backward_compatibility.py", line 230, in test_lstm
    self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True)
  File "/rds/projects/2017/branfosj-rse/ProblemSolving/pyt18/pytorch/test/quantization/test_backward_compatibility.py", line 76, in _test_op
    self.assertEqual(qmodule(input_tensor), expected, atol=prec)
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1198, in assertEqual
    self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg,
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1165, in assertEqual
    super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg))
AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test failure still occurs when I build with MKL.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boegel @branfosj

Were you using a full-metal Cascade Lake machine, or were you using a VM on it?
With a Linux VM (with KVM hypervisor), I reproduced this issue on a Cascade Lake machine.
However, if you were using a full-metal machine, then isn't there a possibility that there might be some latent issues with Cascade Lake machines that might surface later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm that there are issues when optimizing for a cascade lake machine, e.g. tensorflow/tensorflow#47179
I've seen that with 2019b, not with newer compilers, but it is possible.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the info! I had used gcc/g++ 9.3, but that TensorFlow issue you posted also seems quite relevant. I can try testing with a more recent version of gcc, although gcc 9.3 was released in March 2020.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW: 2019b uses GCC 8.3.0, 2020a (IIRC) 9.3.0 (which solves the TF issue for us) but as it is a toolchain generation it might also be related to dependencies being updated, so maybe not only the compiler, but that is the best bet as it looks like a misoptimization.

@branfosj
Copy link
Member Author

branfosj commented Mar 6, 2021

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0211u38b.bear.cluster - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/3c81dbbda8e7f9fc2f10afd2592ec583 for a full test report.

Edit: failure was RuntimeError: test_linalg failed! Received signal: SIGSEGV. Building in /dev/shm was not a good idea. Also, seeing the issue with a non /dev/shm build path.

@boegel
Copy link
Member

boegel commented Mar 6, 2021

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3403.kirlia.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 3.6.8
See https://gist.github.com/5e224a576c7aaa27af3c0f24cb56f0c2 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
generoso-c1-s-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/fabc387bb5ae4f835d310bbf858f941b for a full test report.

@branfosj
Copy link
Member Author

branfosj commented Mar 6, 2021

The three failures are all:

test_inverse_cpu_complex128 (__main__.TestLinalgCPU) ... Traceback (most recent call last):
  File "run_test.py", line 926, in <module>
    main()
  File "run_test.py", line 905, in main
    raise RuntimeError(err_message)
RuntimeError: test_linalg failed! Received signal: SIGSEGV

I'll investigate further. (And I'll mark this as a draft in the meantime.)

@branfosj branfosj marked this pull request as draft March 6, 2021 12:24
@branfosj
Copy link
Member Author

branfosj commented Mar 6, 2021

I've checked through the tests:

  1. test_jit, test_jit_profiling, test_jit_legacy, and test_fx all assume that the build directory will be a sibling to the test directory. This will not normally be an issue for us.
  2. I'm seeing failures in test_linalg, test_ops, test_spectral_ops.

Method:

  1. eb --from-pr 12347 -Tr --skip-test-step
  2. module load PyTorch/1.8.0-foss-2020b hypothesis/5.41.5-GCCcore-10.2.0
  3. Clone a copy of the PyTorch source code (I unpacked PyTorch-1.8.0.tar.gz that EB had generated)
  4. python run_test.py --verbose -i [test]

test_spectral_ops error:

======================================================================
ERROR: test_stft_requires_complex_cpu (__main__.TestFFTCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 295, in instantiated_test
    raise rte
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 290, in instantiated_test
    result = test_fn(self, *args)
  File "test_spectral_ops.py", line 939, in test_stft_requires_complex
    y = x.stft(10, pad_mode='constant')
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/tensor.py", line 453, in stft
    return torch.stft(self, n_fft, hop_length, win_length, window, center,
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/functional.py", line 580, in stft
    return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore
RuntimeError: fft: ATen not compiled with MKL support

----------------------------------------------------------------------

test_linalg error:

test_inverse_cpu_complex128 (__main__.TestLinalgCPU) ... Traceback (most recent call last):
  File "run_test.py", line 926, in <module>
    main()
  File "run_test.py", line 905, in main
    raise RuntimeError(err_message)
RuntimeError: test_linalg failed! Received signal: SIGSEGV

test_ops error:

test_out_linalg_inv_cpu_complex128 (__main__.TestCommonCPU) ... Traceback (most recent call last):
  File "run_test.py", line 926, in <module>
    main()
  File "run_test.py", line 905, in main
    raise RuntimeError(err_message)
RuntimeError: test_ops failed! Received signal: SIGSEGV

@branfosj
Copy link
Member Author

branfosj commented Mar 6, 2021

Three bugs filed with PyTorch:

The first one will not normally be an issue to us - as a standard build will have the directories in the right place for it to work. The second one I can easily patch. The final one looks bad and I've not been able to debug further.

@branfosj
Copy link
Member Author

branfosj commented Mar 6, 2021

The second issue pointed me to trying a build with MKL - with that included in the build we do not see the seg fault mentioned in the third issue.

@branfosj
Copy link
Member Author

Progress from PyTorch. Of the three bugs reported above:

  • 53455 - does not impact us
  • 53456 - fixed by PyTorch-1.8.0_correct-skip-tests-decorators.patch which is based off the upstream fix and modified to allow it to be applied to the 1.8.0 code
  • 53454 - fixed by PyTorch-1.8.0_fix-noMKL-linear-algebra.patch which is the upstream fix

@branfosj
Copy link
Member Author

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0206u40a.bear.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/8da12a97d386233c84fe0b3ead22629a for a full test report.

@branfosj branfosj marked this pull request as ready for review March 12, 2021 21:36
@branfosj
Copy link
Member Author

@boegelbot please test @ generoso

@boegelbot
Copy link
Collaborator

@branfosj: Request for testing this PR well received on generoso

PR test command 'EB_PR=12347 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_12347 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 16355

Test results coming soon (I hope)...

- notification for comment with ID 797769583 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
generoso-c1-s-2 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/3c79dc58d25fe1d8a531d59a9ead847c for a full test report.

@verdurin
Copy link
Member

Test report by @verdurin
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
nuc.lan - Linux Fedora 33, x86_64, Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz, Python 3.9.2
See https://gist.github.com/3c5ed6f232f4510351fd5d083298587e for a full test report.

@boegel
Copy link
Member

boegel commented Mar 17, 2021

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3125.skitty.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/5a7584133c840abb312fcb323658c3a3 for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 34 out of 36 (1 easyconfigs in total)
taurusml25 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), Python 2.7.5
See https://gist.github.com/e31738473112d29ad1c69ec63f3baef9 for a full test report.

@branfosj branfosj changed the title {devel}[foss/2020b] PyTorch v1.8.0 w/ Python 3.8.6 {devel}[foss/2020b] PyTorch v1.8.1 w/ Python 3.8.6 Mar 31, 2021
@branfosj
Copy link
Member Author

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0206u40a.bear.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/9b8f2e1771e39403c003f9f42a0bf845 for a full test report.

@sassy-crick
Copy link
Collaborator

Test report by @sassy-crick
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
gpu05.pri.rosalind2.alces.network - Linux centos linux 7.6.1810, x86_64, AMD EPYC 7552 48-Core Processor, Python 3.6.8
See https://gist.github.com/8a1a099eb4d0fce579bfdde2ca63a379 for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusi8029 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), Python 2.7.5
See https://gist.github.com/21262cff997f31a13709c3ba4d331fc2 for a full test report.

@boegel
Copy link
Member

boegel commented Apr 28, 2021

@sassy-crick You should use --robot, or make sure the dependencies are in place already. ;)

@boegel
Copy link
Member

boegel commented Apr 28, 2021

@boegelbot please test @ generoso
CORE_COUNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=12347 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_12347 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 16943

Test results coming soon (I hope)...

- notification for comment with ID 828643863 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@sassy-crick
Copy link
Collaborator

@sassy-crick You should use --robot, or make sure the dependencies are in place already. ;)

@boegel That was me getting that going on our cluster. It is building now. :-)

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusml4 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), Python 2.7.5
See https://gist.github.com/eff3545f2eb2ed780e9ee9c547b5b047 for a full test report.

@boegel
Copy link
Member

boegel commented Apr 28, 2021

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3501.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/c2822f66a9a41254d93bcc408ca48639 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
generoso-c1-s-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/f36a6776bcf03c8a4dac5c6c1b654bb3 for a full test report.

@sassy-crick
Copy link
Collaborator

Test report by @sassy-crick
SUCCESS
Build succeeded for 21 out of 21 (1 easyconfigs in total)
gpu05.pri.rosalind2.alces.network - Linux centos linux 7.6.1810, x86_64, AMD EPYC 7552 48-Core Processor, Python 3.6.8
See https://gist.github.com/01aeafab0453c39ada6034e3073295c8 for a full test report.

Copy link
Member

@boegel boegel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on x86_64, so I'll go ahead and merge this.

For the problems on POWER, I suggest opening a follow-up PR...

@boegel
Copy link
Member

boegel commented May 4, 2021

Going in, thanks @branfosj!

@boegel boegel merged commit 8ab63eb into easybuilders:develop May 4, 2021
@branfosj branfosj deleted the 20210306101354_new_pr_PyTorch180 branch May 4, 2021 08:08
@boegel boegel modified the milestones: 4.x, next release (4.3.5?) May 4, 2021
@boegel
Copy link
Member

boegel commented May 4, 2021

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
select-pika-c6gd-4xlarge-0001 - Linux centos linux 8.3.2011, AArch64, ARM UNKNOWN (graviton2), Python 3.6.8
See https://gist.github.com/95161b70bdea930435747d69cb9cdc30 for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants