Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{ai}[foss/2023b] PyTorch v2.1.2 #20171

Merged

Conversation

Flamefire
Copy link
Contributor

(created using eb --new-pr)

@Micket
Copy link
Contributor

Micket commented Mar 20, 2024

Don't we want to start working with pytorch 2.2?

@Flamefire
Copy link
Contributor Author

Don't we want to start working with pytorch 2.2?

Indeed. However with #20156 we have just fixed the tests for this version so I started porting this to 2023b and see if anything goes wrong. If not, we can update to 2.2.1 and see what new stuff breaks there. I had bad experiences with updating the version and the toolchain at the same time

@Flamefire Flamefire force-pushed the 20240320160212_new_pr_PyTorch212 branch from e36d1ac to e481b06 Compare March 21, 2024 09:23
@casparvl
Copy link
Contributor

Test report by @casparvl
FAILED
Build succeeded for 18 out of 19 (1 easyconfigs in total)
tcn1.local.snellius.surf.nl - Linux RHEL 8.6, x86_64, AMD EPYC 7H12 64-Core Processor, Python 3.6.8
See https://gist.github.com/casparvl/0c768aa7d7be73121dfeb55dcc3db58a for a full test report.

@casparvl
Copy link
Contributor

@boegelbot please test @ generoso
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=20171 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_20171 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 13178

Test results coming soon (I hope)...

- notification for comment with ID 2014681873 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@casparvl
Copy link
Contributor

So, the three errors I see are all of the form:

FAILED [2.5679s] test_jit_legacy.py::TestScript::test_file_reader_no_memory_leak - AssertionError
=============================== 1 failed, 2336 passed, 151 skipped, 12 xfailed, 2 rerun in 135.92s (0:02:15) ===============================
FINISHED PRINTING LOG FILE of test_jit_legacy 1/1 (/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2023b/pytorch-v2.1.2/test/test-reports/test_jit_legacy_8xqb7dp6.log)

test_jit_legacy 1/1 failed!
Running test_jit_profiling 1/1 ... [2024-03-21 20:33:19.721374]
Executing ['/scratch-nvme/1/casparl/generic/software/Python/3.11.5-GCCcore-13.2.0/bin/python', '-bb', 'test_jit_profiling.py', '--shard-id=0', '--num-shards=1', '-v', '-vv', '-rfEX', '-p', 'no:xdist', '--use-pytest', '--reruns=2'] ... [2024-03-21 20:33:19.721709]

Expand the folded group to see the beginning of the log file of test_jit_profiling 1/1
##[group]PRINTING BEGINNING OF LOG FILE of test_jit_profiling 1/1 (/gpfs/nvme1/1/casparl/ebbuildpath/PyTorch/2.1.2/foss-2023b/pytorch-v2.1.2/test/test-reports/test_jit_profiling_5ic2kydo.log)
CUDA not available, skipping tests
monkeytype is not installed. Skipping tests for Profile-Directed Typing
=========================================================== test session starts ============================================================

I'm not sure if that's something you recognize. It might be an issue on our system though: since recently, we have an issue with a memory leak in the kernel :\

@casparvl
Copy link
Contributor

Also, even though this counts as three failures, they all seem to be identical, just in different test files. So... I'm actually not to concerned with this - I consider it to be one failure, note three, and there's a reasonable chance it's due to our kernel.

@casparvl
Copy link
Contributor

Just to double check: do you recognize this error @Flamefire ?

@Flamefire
Copy link
Contributor Author

Just to double check: do you recognize this error @Flamefire ?

Doesn't ring a bell for me. I'll (re)start the tests for this on our systems and if I don't see it on either I'd either increase the number of allowed failures or leave it to your site to ignore it.

@casparvl
Copy link
Contributor

@boegelbot please test @ jsc-zen3
CORE_CNT=16

@boegelbot
Copy link
Collaborator

@casparvl: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=20171 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_20171 --ntasks="16" ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3837

Test results coming soon (I hope)...

- notification for comment with ID 2014990327 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@casparvl
Copy link
Contributor

I'll (re)start the tests for this on our systems and if I don't see it on either I'd either increase the number of allowed failures or leave it to your site to ignore it.

Ok. Either way is fine by me btw. Let's see how things go on the test clusters. If those pass with the current amount of max_failed_tests, I propose to leave it as is. It's easy for us to increase it with a hook if needed.

@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
n1420 - Linux RHEL 8.7 (Ootpa), x86_64, Intel(R) Xeon(R) Platinum 8470 (icelake), Python 3.8.13
See https://gist.github.com/Flamefire/46a17d478eb15aa6bb2392a1921f3502 for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/f44de677450348438b2928c39857a54b for a full test report.

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
SUCCESS
Build succeeded for 9 out of 9 (1 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/213862ece77e216a5e3d0469e2e967a2 for a full test report.

Copy link
Contributor

@casparvl casparvl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@casparvl casparvl added this to the release after 4.9.0 milestone Mar 24, 2024
@casparvl
Copy link
Contributor

Going in, thanks @Flamefire!

@casparvl casparvl merged commit f0c8a01 into easybuilders:develop Mar 24, 2024
9 checks passed
@Flamefire Flamefire deleted the 20240320160212_new_pr_PyTorch212 branch March 25, 2024 09:40
@Flamefire
Copy link
Contributor Author

Test report by @Flamefire
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
i8024 - Linux Rocky Linux 8.7 (Green Obsidian), x86_64, AMD EPYC 7352 24-Core Processor (zen2), 8 x NVIDIA NVIDIA A100-SXM4-40GB, 545.23.08, Python 3.8.13
See https://gist.github.com/Flamefire/1ebab8dc07a572b5d66adbddf95632bc for a full test report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants