{devel}[foss/2020b] PyTorch v1.8.1 w/ Python 3.8.6 #12347

branfosj · 2021-03-06T10:14:00Z

(created using eb --new-pr)

Notes:

I've switched to using a recursive git clone - instead of the long sources section.
PyTorch-1.7.0_fix_altivec_defines.patch, PyTorch-1.7.0_fix_test_DistributedDataParallel.patch, and PyTorch-1.7.0_fix-fbgemm-not-implemented-issue.patch are no longer needed (compared to PyTorch-1.7.1-foss-2020b.eb).

branfosj · 2021-03-06T10:29:31Z

@boegelbot please test @ generoso

boegelbot · 2021-03-06T10:30:06Z

@branfosj: Request for testing this PR well received on generoso

PR test command 'EB_PR=12347 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_12347 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 16321

Test results coming soon (I hope)...

- notification for comment with ID 791908884 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

boegel · 2021-03-06T10:41:09Z

easybuild/easyconfigs/p/PyTorch/PyTorch-1.8.0-foss-2020b.eb

+        'url': 'https://github.com/pytorch',
+        'repo_name': 'pytorch',
+        'tag': 'v%(version)s',
+        'recursive': True,


@branfosj Any concerns here w.r.t. reproducibility? Or are the submodules "locked" to a particular commit anyway?

They are all locked to specific commits - see https://github.com/pytorch/pytorch/tree/v1.8.0/third_party and subdirectories. The only issue we'd have is if PyTorch reused the tag - then we'd get a different download (with, potentially, a different set of items in the third_party directory).

Hm, a downside of this I see is that --fetch likely doesn't work, i.e. a full offline install fails, or does EB handle that?
Also no checksums...
BTW: There is a script in framework to create the sources list out of a valid git checkout (must have git submodule update done)

--fetch works (so long as you do not hit easybuilders/easybuild-framework#3619).

boegel · 2021-03-06T10:42:42Z

easybuild/easyconfigs/p/PyTorch/PyTorch-1.8.0-foss-2020b.eb

+        'distributed/rpc/test_process_group_agent',
+        # Potentially problematic save/load issue with test_lstm on only some machines. Tell users to verify save&load!
+        # https://github.com/pytorch/pytorch/issues/43209
+        'test_quantization',


@branfosj DId you check whether we still see failures?

I can test on our Cascade Lake system where I saw issues with this earlier (cfr. pytorch/pytorch#43209)

I've not yet checked that. I'll run a test on our Cascade Lake where we run that test - though I do not know if we saw the issue you saw or not.

I see the same failure with PyTorch 1.7.1 and 1.8.0 on our Cascade Lake.

====================================================================== FAIL: test_lstm (quantization.test_backward_compatibility.TestSerialization) ---------------------------------------------------------------------- Traceback (most recent call last): File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_quantized.py", line 151, in test_fn qfunction(*args, **kwargs) File "/rds/projects/2017/branfosj-rse/ProblemSolving/pyt18/pytorch/test/quantization/test_backward_compatibility.py", line 230, in test_lstm self._test_op(mod, input_size=[4, 4, 3], input_quantized=False, generate=False, new_zipfile_serialization=True) File "/rds/projects/2017/branfosj-rse/ProblemSolving/pyt18/pytorch/test/quantization/test_backward_compatibility.py", line 76, in _test_op self.assertEqual(qmodule(input_tensor), expected, atol=prec) File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1198, in assertEqual self.assertEqual(x_, y_, atol=atol, rtol=rtol, msg=msg, File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_utils.py", line 1165, in assertEqual super().assertTrue(result, msg=self._get_assert_msg(msg, debug_msg=debug_msg)) AssertionError: False is not true : Tensors failed to compare as equal!With rtol=1.3e-06 and atol=1e-05, found 13 element(s) (out of 112) whose difference(s) exceeded the margin of error (including 0 nan comparisons). The greatest difference was 0.9640435565029293 (4.41188467448228e-06 vs. 0.9640479683876038), which occurred at index (3, 0, 6).

This test failure still occurs when I build with MKL.

@boegel @branfosj

Were you using a full-metal Cascade Lake machine, or were you using a VM on it?
With a Linux VM (with KVM hypervisor), I reproduced this issue on a Cascade Lake machine.
However, if you were using a full-metal machine, then isn't there a possibility that there might be some latent issues with Cascade Lake machines that might surface later?

I can confirm that there are issues when optimizing for a cascade lake machine, e.g. tensorflow/tensorflow#47179
I've seen that with 2019b, not with newer compilers, but it is possible.

Thanks a lot for the info! I had used gcc/g++ 9.3, but that TensorFlow issue you posted also seems quite relevant. I can try testing with a more recent version of gcc, although gcc 9.3 was released in March 2020.

FWIW: 2019b uses GCC 8.3.0, 2020a (IIRC) 9.3.0 (which solves the TF issue for us) but as it is a toolchain generation it might also be related to dependencies being updated, so maybe not only the compiler, but that is the best bet as it looks like a misoptimization.

branfosj · 2021-03-06T10:58:48Z

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0211u38b.bear.cluster - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/3c81dbbda8e7f9fc2f10afd2592ec583 for a full test report.

Edit: failure was RuntimeError: test_linalg failed! Received signal: SIGSEGV. ~~Building in /dev/shm was not a good idea.~~ Also, seeing the issue with a non /dev/shm build path.

boegel · 2021-03-06T11:45:40Z

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3403.kirlia.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6240 CPU @ 2.60GHz (cascadelake), Python 3.6.8
See https://gist.github.com/5e224a576c7aaa27af3c0f24cb56f0c2 for a full test report.

boegelbot · 2021-03-06T12:20:23Z

Test report by @boegelbot
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
generoso-c1-s-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/fabc387bb5ae4f835d310bbf858f941b for a full test report.

branfosj · 2021-03-06T12:23:53Z

The three failures are all:

test_inverse_cpu_complex128 (__main__.TestLinalgCPU) ... Traceback (most recent call last):
  File "run_test.py", line 926, in <module>
    main()
  File "run_test.py", line 905, in main
    raise RuntimeError(err_message)
RuntimeError: test_linalg failed! Received signal: SIGSEGV

I'll investigate further. (And I'll mark this as a draft in the meantime.)

branfosj · 2021-03-06T14:33:10Z

I've checked through the tests:

test_jit, test_jit_profiling, test_jit_legacy, and test_fx all assume that the build directory will be a sibling to the test directory. This will not normally be an issue for us.
I'm seeing failures in test_linalg, test_ops, test_spectral_ops.

Method:

eb --from-pr 12347 -Tr --skip-test-step
module load PyTorch/1.8.0-foss-2020b hypothesis/5.41.5-GCCcore-10.2.0
Clone a copy of the PyTorch source code (I unpacked PyTorch-1.8.0.tar.gz that EB had generated)
python run_test.py --verbose -i [test]

test_spectral_ops error:

======================================================================
ERROR: test_stft_requires_complex_cpu (__main__.TestFFTCPU)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 295, in instantiated_test
    raise rte
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/testing/_internal/common_device_type.py", line 290, in instantiated_test
    result = test_fn(self, *args)
  File "test_spectral_ops.py", line 939, in test_stft_requires_complex
    y = x.stft(10, pad_mode='constant')
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/tensor.py", line 453, in stft
    return torch.stft(self, n_fft, hop_length, win_length, window, center,
  File "/rds/bear-apps/devel/eb-sjb-up/EL8/EL8-cas/software/PyTorch/1.8.0-foss-2020b/lib/python3.8/site-packages/torch/functional.py", line 580, in stft
    return _VF.stft(input, n_fft, hop_length, win_length, window,  # type: ignore
RuntimeError: fft: ATen not compiled with MKL support

----------------------------------------------------------------------

test_linalg error:

test_inverse_cpu_complex128 (__main__.TestLinalgCPU) ... Traceback (most recent call last):
  File "run_test.py", line 926, in <module>
    main()
  File "run_test.py", line 905, in main
    raise RuntimeError(err_message)
RuntimeError: test_linalg failed! Received signal: SIGSEGV

test_ops error:

test_out_linalg_inv_cpu_complex128 (__main__.TestCommonCPU) ... Traceback (most recent call last):
  File "run_test.py", line 926, in <module>
    main()
  File "run_test.py", line 905, in main
    raise RuntimeError(err_message)
RuntimeError: test_ops failed! Received signal: SIGSEGV

branfosj · 2021-03-06T16:46:38Z

Three bugs filed with PyTorch:

Library location assumption in test/jit/test_backends.py, test/jit/test_torchbind.py, and test/test_fx.py pytorch/pytorch#53455 - this is the library location issue. This will only be an issue if the build directory is not a sibling of the test directory.
test_stft_requires_complex in test_spectral_ops.py should be skipped if not compiled with MKL pytorch/pytorch#53456 - is for the test_stft_requires_complex_cpu issue. I expect that that test should have the @skipCPUIfNoMkl decorator. It may also need @skipCUDAIfRocm.
SIGSEGV in torch.linalg.inv pytorch/pytorch#53454 - is for the seg fault in torch.linalg.inv.

The first one will not normally be an issue to us - as a standard build will have the directories in the right place for it to work. The second one I can easily patch. The final one looks bad and I've not been able to debug further.

branfosj · 2021-03-06T20:53:58Z

The second issue pointed me to trying a build with MKL - with that included in the build we do not see the seg fault mentioned in the third issue.

branfosj · 2021-03-12T20:23:44Z

Progress from PyTorch. Of the three bugs reported above:

53455 - does not impact us
53456 - fixed by PyTorch-1.8.0_correct-skip-tests-decorators.patch which is based off the upstream fix and modified to allow it to be applied to the 1.8.0 code
53454 - fixed by PyTorch-1.8.0_fix-noMKL-linear-algebra.patch which is the upstream fix

branfosj · 2021-03-12T21:36:02Z

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0206u40a.bear.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/8da12a97d386233c84fe0b3ead22629a for a full test report.

branfosj · 2021-03-12T21:36:51Z

@boegelbot please test @ generoso

boegelbot · 2021-03-12T21:40:08Z

@branfosj: Request for testing this PR well received on generoso

PR test command 'EB_PR=12347 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_12347 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 16355

Test results coming soon (I hope)...

- notification for comment with ID 797769583 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

boegelbot · 2021-03-13T00:11:24Z

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
generoso-c1-s-2 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/3c79dc58d25fe1d8a531d59a9ead847c for a full test report.

verdurin · 2021-03-13T13:24:56Z

Test report by @verdurin
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
nuc.lan - Linux Fedora 33, x86_64, Intel(R) Core(TM) i7-8650U CPU @ 1.90GHz, Python 3.9.2
See https://gist.github.com/3c5ed6f232f4510351fd5d083298587e for a full test report.

boegel · 2021-03-17T00:04:23Z

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3125.skitty.os - Linux centos linux 7.9.2009, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz, Python 3.6.8
See https://gist.github.com/5a7584133c840abb312fcb323658c3a3 for a full test report.

Flamefire · 2021-03-26T16:32:07Z

Test report by @Flamefire
FAILED
Build succeeded for 34 out of 36 (1 easyconfigs in total)
taurusml25 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), Python 2.7.5
See https://gist.github.com/e31738473112d29ad1c69ec63f3baef9 for a full test report.

branfosj · 2021-03-31T18:59:46Z

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0206u40a.bear.cluster - Linux RHEL 8.3, x86_64, Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz (cascadelake), Python 3.6.8
See https://gist.github.com/9b8f2e1771e39403c003f9f42a0bf845 for a full test report.

sassy-crick · 2021-04-28T16:48:04Z

Test report by @sassy-crick
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
gpu05.pri.rosalind2.alces.network - Linux centos linux 7.6.1810, x86_64, AMD EPYC 7552 48-Core Processor, Python 3.6.8
See https://gist.github.com/8a1a099eb4d0fce579bfdde2ca63a379 for a full test report.

Flamefire · 2021-04-28T17:13:32Z

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
taurusi8029 - Linux centos linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), Python 2.7.5
See https://gist.github.com/21262cff997f31a13709c3ba4d331fc2 for a full test report.

boegel · 2021-04-28T17:34:28Z

@sassy-crick You should use --robot, or make sure the dependencies are in place already. ;)

boegel · 2021-04-28T17:34:36Z

@boegelbot please test @ generoso
CORE_COUNT=16

boegelbot · 2021-04-28T17:35:08Z

@boegel: Request for testing this PR well received on generoso

PR test command 'EB_PR=12347 EB_ARGS= /apps/slurm/default/bin/sbatch --job-name test_PR_12347 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 16943

Test results coming soon (I hope)...

- notification for comment with ID 828643863 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

sassy-crick · 2021-04-28T17:36:59Z

@sassy-crick You should use --robot, or make sure the dependencies are in place already. ;)

@boegel That was me getting that going on our cluster. It is building now. :-)

Flamefire · 2021-04-28T17:47:08Z

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusml4 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), Python 2.7.5
See https://gist.github.com/eff3545f2eb2ed780e9ee9c547b5b047 for a full test report.

boegel · 2021-04-28T19:26:24Z

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3501.doduo.os - Linux RHEL 8.2, x86_64, AMD EPYC 7552 48-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/c2822f66a9a41254d93bcc408ca48639 for a full test report.

boegelbot · 2021-04-28T20:09:20Z

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
generoso-c1-s-1 - Linux centos linux 8.2.2004, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/f36a6776bcf03c8a4dac5c6c1b654bb3 for a full test report.

sassy-crick · 2021-04-28T22:38:40Z

Test report by @sassy-crick
SUCCESS
Build succeeded for 21 out of 21 (1 easyconfigs in total)
gpu05.pri.rosalind2.alces.network - Linux centos linux 7.6.1810, x86_64, AMD EPYC 7552 48-Core Processor, Python 3.6.8
See https://gist.github.com/01aeafab0453c39ada6034e3073295c8 for a full test report.

boegel

Looks good on x86_64, so I'll go ahead and merge this.

For the problems on POWER, I suggest opening a follow-up PR...

boegel · 2021-05-04T08:07:22Z

Going in, thanks @branfosj!

boegel · 2021-05-04T08:27:00Z

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
select-pika-c6gd-4xlarge-0001 - Linux centos linux 8.3.2011, AArch64, ARM UNKNOWN (graviton2), Python 3.6.8
See https://gist.github.com/95161b70bdea930435747d69cb9cdc30 for a full test report.

adding easyconfigs: PyTorch-1.8.0-foss-2020b.eb

51b5bd3

branfosj added the update label Mar 6, 2021

branfosj added this to the 4.x milestone Mar 6, 2021

boegel reviewed Mar 6, 2021

View reviewed changes

branfosj marked this pull request as draft March 6, 2021 12:24

Add patches from PyTorch

3a0c33b

branfosj marked this pull request as ready for review March 12, 2021 21:36

branfosj mentioned this pull request Mar 25, 2021

--fetch fails for a git download when using a relative path for --sourcepath easybuilders/easybuild-framework#3619

Closed

branfosj changed the title ~~{devel}[foss/2020b] PyTorch v1.8.0 w/ Python 3.8.6~~ {devel}[foss/2020b] PyTorch v1.8.1 w/ Python 3.8.6 Mar 31, 2021

switch to PyTorch 1.8.1

25aa996

boegel approved these changes May 4, 2021

View reviewed changes

boegel merged commit 8ab63eb into easybuilders:develop May 4, 2021

branfosj deleted the 20210306101354_new_pr_PyTorch180 branch May 4, 2021 08:08

boegel modified the milestones: 4.x, next release (4.3.5?) May 4, 2021

{devel}[foss/2020b] PyTorch v1.8.1 w/ Python 3.8.6 #12347

{devel}[foss/2020b] PyTorch v1.8.1 w/ Python 3.8.6 #12347

Conversation

branfosj commented Mar 6, 2021 • edited Loading

branfosj commented Mar 6, 2021

boegelbot commented Mar 6, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

branfosj commented Mar 6, 2021 • edited Loading

boegel commented Mar 6, 2021

boegelbot commented Mar 6, 2021

branfosj commented Mar 6, 2021

branfosj commented Mar 6, 2021 • edited Loading

branfosj commented Mar 6, 2021 • edited Loading

branfosj commented Mar 6, 2021

branfosj commented Mar 12, 2021

branfosj commented Mar 12, 2021

branfosj commented Mar 12, 2021

boegelbot commented Mar 12, 2021

boegelbot commented Mar 13, 2021

verdurin commented Mar 13, 2021

boegel commented Mar 17, 2021

Flamefire commented Mar 26, 2021

branfosj commented Mar 31, 2021

sassy-crick commented Apr 28, 2021

Flamefire commented Apr 28, 2021

boegel commented Apr 28, 2021

boegel commented Apr 28, 2021

boegelbot commented Apr 28, 2021

sassy-crick commented Apr 28, 2021

Flamefire commented Apr 28, 2021

boegel commented Apr 28, 2021

boegelbot commented Apr 28, 2021

sassy-crick commented Apr 28, 2021

boegel left a comment

Choose a reason for hiding this comment

boegel commented May 4, 2021

boegel commented May 4, 2021

branfosj commented Mar 6, 2021 •

edited

Loading

branfosj commented Mar 6, 2021 •

edited

Loading

branfosj commented Mar 6, 2021 •

edited

Loading

branfosj commented Mar 6, 2021 •

edited

Loading