Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai}[foss/2022a] PyTorch v1.13.1 w/ Python 3.10.4 #17155

Merged

Conversation

branfosj
Copy link
Member

@branfosj branfosj commented Jan 19, 2023

…-1.13.1_fix-test-ops-conf.patch, PyTorch-1.13.1_no-cuda-stubs-rpath.patch, PyTorch-1.13.1_remove-flaky-test-in-testnn.patch, PyTorch-1.13.1_skip-ao-sparsity-test-without-fbgemm.patch
@branfosj

This comment was marked as outdated.

@branfosj

This comment was marked as outdated.

Flamefire and others added 3 commits February 10, 2023 11:43
Update patches based on PyTorch 1.13.1
Those tests require 2 pytest plugins and a bugfix.
Fix test_ops* startup failures
@boegelbot

This comment was marked as outdated.

@branfosj branfosj marked this pull request as ready for review February 10, 2023 16:54
@branfosj
Copy link
Member Author

@boegelbot please test @ jsc-zen2

@boegelbot
Copy link
Collaborator

@branfosj: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=17155 EB_ARGS= /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_17155 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 2179

Test results coming soon (I hope)...

- notification for comment with ID 1426118567 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@branfosj
Copy link
Member Author

Test report by @branfosj
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
bear-pg0105u03a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/38b9646d67afe7ea346ceaa169839769 for a full test report.

@branfosj
Copy link
Member Author

branfosj commented Feb 10, 2023

TestGradientsCPU.test_forward_mode_AD_nn_functional_max_unpool2d_cpu_float64
Unexpected success

I've lost PyTorch-1.12.1_skip-failing-grad-test.patch from the patch list.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusi8009 - Linux CentOS Linux 7.9.2009, x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 470.57.02, Python 2.7.5
See https://gist.github.com/0f4c49467e073b6697de4a6dd50923b2 for a full test report.

@branfosj
Copy link
Member Author

Test report by @branfosj
FAILED
Build succeeded for 8 out of 9 (1 easyconfigs in total)
bear-pg0210u06b.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz (broadwell), Python 3.6.8
See https://gist.github.com/bc0482f9ab1a4f51e2ed820ecf3deaf6 for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusml24 - Linux RHEL 7.6, POWER, 8335-GTX (power9le), 6 x NVIDIA Tesla V100-SXM2-32GB, 440.64.00, Python 2.7.5
See https://gist.github.com/564b9bf8970d481af146ddcb73b16cec for a full test report.

@Flamefire
Copy link
Contributor

Test report by @Flamefire
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
taurusa12 - Linux CentOS Linux 7.7.1908, x86_64, Intel(R) Xeon(R) CPU E5-2603 v4 @ 1.70GHz (broadwell), 3 x NVIDIA GeForce GTX 1080 Ti, 460.32.03, Python 2.7.5
See https://gist.github.com/6247717cb51697c6752ae381309b3d79 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/3df87c14f170259201e78bb1620a10f1 for a full test report.

@branfosj
Copy link
Member Author

branfosj commented Feb 11, 2023

Patches added:

  • test_ops_gradients: I'd missed adding PyTorch-1.13.1_skip-failing-grad-test.patch and that updated patch covers the unexpected successes that I see
  • test_cpp_extensions_aot_ninja (and test_cpp_extensions_aot_no_ninja?): PyTorch-1.13.1_install-vsx-vec-headers.patch should fix these
  • test_ops: PyTorch-1.13.1_increase-tolerance-test_ops.patch

To investigate:

  • test_ao_sparsity - POWER only
  • test_quantization - POWER only
  • distributed/rpc/test_tensorpipe_agent - POWER only
  • test_cpp_extensions_open_device_registration - POWER only
  • test_native_mha - I'm seeing this on broadwell only and also in PyTorch 1.12.1

@branfosj
Copy link
Member Author

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0105u03a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/edc885781ddaa241515bca898ae0bfab for a full test report.

@boegel boegel added this to the 4.x milestone Feb 16, 2023
@boegel
Copy link
Member

boegel commented Feb 16, 2023

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=17155 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_17155 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 10252

Test results coming soon (I hope)...

- notification for comment with ID 1432765219 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@branfosj
Copy link
Member Author

@branfosj distributed/rpc/test_tensorpipe_agent is failing, just like in #17156, so let's skip that too, and mention it in #17712?

Done in 52cbf0e

The generoso build also failed on distributed/rpc/test_tensorpipe_agent and hit the 502 error on uploading the test report.

@branfosj
Copy link
Member Author

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@branfosj: Request for testing this PR well received on login1

PR test command 'EB_PR=17155 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_17155 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 10651

Test results coming soon (I hope)...

- notification for comment with ID 1508148644 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@branfosj
Copy link
Member Author

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0104u04a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/574855412ab5b55e20b9f3602df05ab9 for a full test report.

@boegel
Copy link
Member

boegel commented Apr 14, 2023

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3101.skitty.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/boegel/51d67d8eff4c677e7dd06cd7a653d833 for a full test report.

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis-c1 - Linux Rocky Linux 8.6, x86_64, Intel Xeon Processor (Skylake), Python 3.6.8
See https://gist.github.com/VRehnberg/8a400e560a8b60563c219156171c18c6 for a full test report.

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis-s1 - Linux Rocky Linux 8.6, x86_64, Intel(R) Xeon(R) Silver 4314 CPU @ 2.40GHz, Python 3.6.8
See https://gist.github.com/VRehnberg/75c0c0548375006d6bb895020529a6d0 for a full test report.

@boegel
Copy link
Member

boegel commented Jul 5, 2023

Two tests were failing for me:

test_quantization failed!
distributed/test_c10d_gloo failed!

test_quantization should be fixed by patch PyTorch-1.12.1_add-hypothesis-suppression.patch) added in #17908
distributed/test_c10d_gloo should be fixed by patch PyTorch-1.12.1_skip-test_round_robin.patch added in #16793

@boegel
Copy link
Member

boegel commented Jul 5, 2023

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on login1

PR test command 'EB_PR=17155 EB_ARGS= EB_CONTAINER= /opt/software/slurm/bin/sbatch --job-name test_PR_17155 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 11214

Test results coming soon (I hope)...

- notification for comment with ID 1622173121 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegel
Copy link
Member

boegel commented Jul 5, 2023

@boegelbot please test @ jsc-zen2
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@boegel: Request for testing this PR well received on jsczen2l1.int.jsc-zen2.easybuild-test.cluster

PR test command 'EB_PR=17155 EB_ARGS= /opt/software/slurm/bin/sbatch --mem-per-cpu=4000M --job-name test_PR_17155 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen2.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 2956

Test results coming soon (I hope)...

- notification for comment with ID 1622324495 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/ed9e61ab9cc559265403338028d3769e for a full test report.

@branfosj
Copy link
Member Author

branfosj commented Jul 5, 2023

Test report by @branfosj
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
bear-pg0105u03a.bear.cluster - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz (icelake), Python 3.6.8
See https://gist.github.com/branfosj/49deeeaa7871b8f65f61fe65b89d9cd1 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
jsczen2c1.int.jsc-zen2.easybuild-test.cluster - Linux Rocky Linux 8.5, x86_64, AMD EPYC 7742 64-Core Processor (zen2), Python 3.6.8
See https://gist.github.com/boegelbot/1a5b106a4ce60edddefc08041dd851f4 for a full test report.

@boegel
Copy link
Member

boegel commented Jul 6, 2023

Test report by @boegel
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3108.skitty.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6140 CPU @ 2.30GHz (skylake_avx512), Python 3.6.8
See https://gist.github.com/boegel/a1fd23ab14e246966ae037d4981a1285 for a full test report.

@boegel
Copy link
Member

boegel commented Jul 6, 2023

Going in, thanks @branfosj!

@boegel boegel merged commit cdc3499 into easybuilders:develop Jul 6, 2023
@boegel boegel modified the milestones: release after 4.7.3, 4.7.3 Jul 6, 2023
@branfosj branfosj deleted the 20230119093142_new_pr_PyTorch1131 branch July 6, 2023 07:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants