Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

{lib}[GCCcore/12.3.0] PyTorch-bundle v2.1.2, SoX v14.4.2, libmad v0.15.1b, ... w/ CUDA 12.1.1 #19987

Merged

Conversation

Micket
Copy link
Contributor

@Micket Micket commented Feb 28, 2024

(created using eb --new-pr)

I have combined what I found from several conflicting PRs here to make what I think we want to have as the PyTorch-bundle.

I made some choices here:

  1. Stole some from other PRs to combine into this one
  2. Using SentencePiece with the existing gperftools instead of introducing a duplicate
  3. Verified torchvision and torchtext was using supported versions for PyTorch 2.1
  4. Added torchaudio
  5. kept torchdata despite it being abandoned (since it was only just abandoned relatively recently)

(I don't really mind how this gets merged, if someone wants to update their PR instead, feel free to close this or just use it as a reference).

…oX-14.4.2-GCCcore-12.3.0.eb, libmad-0.15.1b-GCCcore-12.3.0.eb, SentencePiece-0.2.0-GCC-12.3.0.eb
@easybuilders easybuilders deleted a comment from boegelbot Feb 28, 2024
@easybuilders easybuilders deleted a comment from boegelbot Feb 28, 2024
@jfgrimm jfgrimm added this to the 4.x milestone Feb 28, 2024
@jfgrimm
Copy link
Member

jfgrimm commented Feb 28, 2024

Test report by @jfgrimm
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
gpu22.viking2.yor.alces.network - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 1 x NVIDIA NVIDIA H100 PCIe, 535.86.10, Python 3.6.8
See https://gist.github.com/jfgrimm/0c24f489ee602ccdae65b2561aa04572 for a full test report.

@jfgrimm
Copy link
Member

jfgrimm commented Feb 28, 2024

@boegelbot: please test @ jsc-zen3

@boegelbot
Copy link
Collaborator

@jfgrimm: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=19987 EB_ARGS= EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_19987 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3691

Test results coming soon (I hope)...

- notification for comment with ID 1968569676 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

boegelbot commented Feb 28, 2024

Test report by @boegelbot
FAILED
Build succeeded for 4 out of 5 (4 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/09cf6f9d7dc502b335ffba08e041ee14 for a full test report.

edit: succeeded apart from a sanity check that requires a GPU

@Micket
Copy link
Contributor Author

Micket commented Feb 28, 2024

RuntimeError: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx

huh, jsc-zen3 should have that right?

@jfgrimm
Copy link
Member

jfgrimm commented Feb 28, 2024

@boegelbot: please test @ jsc-zen3
EB_ARGS=--include-easyblocks-from-pr=3236

@boegelbot
Copy link
Collaborator

@jfgrimm: Request for testing this PR well received on jsczen3l1.int.jsc-zen3.fz-juelich.de

PR test command 'if [[ develop != 'develop' ]]; then EB_BRANCH=develop ./easybuild_develop.sh 2> /dev/null 1>&2; EB_PREFIX=/home/boegelbot/easybuild/develop source init_env_easybuild_develop.sh; fi; EB_PR=19987 EB_ARGS="--include-easyblocks-from-pr=3236" EB_CONTAINER= EB_REPO=easybuild-easyconfigs EB_BRANCH=develop /opt/software/slurm/bin/sbatch --job-name test_PR_19987 --ntasks=8 ~/boegelbot/eb_from_pr_upload_jsc-zen3.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 3696

Test results coming soon (I hope)...

- notification for comment with ID 1969491560 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3236
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
jsczen3c1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), Python 3.9.18
See https://gist.github.com/boegelbot/edc71ffb6e1d18e07ccea070c5c1191a for a full test report.

@SebastianAchilles
Copy link
Member

Test report by @SebastianAchilles
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
jsczen3g1.int.jsc-zen3.fz-juelich.de - Linux Rocky Linux 9.3, x86_64, AMD EPYC-Milan Processor (zen3), 1 x NVIDIA NVIDIA A100 80GB PCIe, 545.23.08, Python 3.9.18
See https://gist.github.com/SebastianAchilles/3d75091724ac38810a1d1d29c8ff4744 for a full test report.

@jfgrimm
Copy link
Member

jfgrimm commented Feb 29, 2024

@boegelbot: please test @ generoso
EB_ARGS=--include-easyblocks-from-pr=3236

@boegelbot
Copy link
Collaborator

@jfgrimm: Request for testing this PR well received on login1

PR test command 'EB_PR=19987 EB_ARGS="--include-easyblocks-from-pr=3236" EB_CONTAINER= EB_REPO=easybuild-easyconfigs /opt/software/slurm/bin/sbatch --job-name test_PR_19987 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

  • exit code: 0
  • output:
Submitted batch job 12991

Test results coming soon (I hope)...

- notification for comment with ID 1970894376 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

@boegelbot
Copy link
Collaborator

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3236
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
cns1 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/boegelbot/63054e19e7841ff7438dbc464e51aea8 for a full test report.

@jfgrimm
Copy link
Member

jfgrimm commented Feb 29, 2024

@Micket I've made a PR to your fork that makes sure the easybuild provided sox is used for torchaudio: Micket#19

…dle212

patch torchaudio to use external sox; enable CUDA support
@jfgrimm
Copy link
Member

jfgrimm commented Feb 29, 2024

Test report by @jfgrimm
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
gpu21.viking2.yor.alces.network - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7413 24-Core Processor, 1 x NVIDIA NVIDIA H100 PCIe, 535.86.10, Python 3.6.8
See https://gist.github.com/jfgrimm/de8c44168aed72593159b5fbc6822460 for a full test report.

@jfgrimm
Copy link
Member

jfgrimm commented Feb 29, 2024

Test report by @jfgrimm
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#3236
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
himem01.viking2.yor.alces.network - Linux Rocky Linux 8.8, x86_64, AMD EPYC 7643 48-Core Processor, Python 3.6.8
See https://gist.github.com/jfgrimm/9887ac847fe7ff2acdcfe08bad3d9822 for a full test report.

'" and not kaldi_io_test"' # requires kaldi_io
'" and not test_dup_hw_acel"' # requires special render device permissions
'" and not test_h264_cuvid"' # requires special render device permissions
'" and not test_hecv_cuvid"' # requires special render device permissions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo, should be hevc

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-04 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 550.54.14, Python 3.6.8
See https://gist.github.com/VRehnberg/fac06f4e6ebcdcd4edb03ed44fcd5372 for a full test report.

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-04 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 550.54.14, Python 3.6.8
See https://gist.github.com/VRehnberg/d42b620745b746b59e5a7f49c403dda7 for a full test report.

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-05 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 550.54.14, Python 3.6.8
See https://gist.github.com/VRehnberg/6b7113318ae567cef3a5ff97d39fdeeb for a full test report.

@VRehnberg
Copy link
Contributor

Test report by @VRehnberg
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
alvis1-04 - Linux Rocky Linux 8.9, x86_64, Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz, 1 x NVIDIA Tesla V100-SXM2-32GB, 550.54.14, Python 3.6.8
See https://gist.github.com/VRehnberg/e1a80da7762acaa5fa8467499586c470 for a full test report.

@akesandgren
Copy link
Contributor

Most annnoying...
On my Ubuntu 20.04, broadwell, V100 node tests/ignite/contrib/handlers/test_tqdm_logger.py passes
but on my Ubuntu 22.04 zen3, A100 node it fails.

I think I've rebuilt everything recently enough on both systems...

I get this:


capsys = <_pytest.capture.CaptureFixture object at 0x7ff16d613790>

    def test_pbar(capsys):
        n_epochs = 2
        loader = [1, 2]
        engine = Engine(update_fn)
    
        pbar = ProgressBar()
        pbar.attach(engine, ["a"])
    
        engine.run(loader, max_epochs=n_epochs)
    
        captured = capsys.readouterr()
        err = captured.err.split("\r")
        err = list(map(lambda x: x.strip(), err))
        err = list(filter(None, err))
        if get_tqdm_version() < Version("4.49.0"):
            expected = "Epoch [2/2]: [1/2]  50%|     , a=1 [00:00<00:00]"
        else:
            expected = "Epoch [2/2]: [1/2]  50%|     , a=1 [00:00<?]"
>       assert err[-1] == expected
E       AssertionError: assert 'Epoch [2/2]:...a=1 [00:00<?]' == 'Epoch [2/2]:...a=1 [00:00<?]'
E         - Epoch [2/2]: [1/2]  50%|     , a=1 [00:00<?]
E         + Epoch [2/2]: [1/2]  50%|                                                   , a=1 [00:00<?]

tests/ignite/contrib/handlers/test_tqdm_logger.py:61: AssertionError

and the same in a bunch of other tqdm tests

@VRehnberg
Copy link
Contributor

So the width padding whitespace of the progress bar from tqdm is different than expected. As far as failures go it's about as benign as I can imagine. But, I agree that it is annoying.

I suppose that we can drop that with all the other loggers, probably is an issue with tqdm rather than ignite. But, it was nice to have some test that not all integration with third party loggers had failed.

@Micket
Copy link
Contributor Author

Micket commented Apr 11, 2024

I don't care anymore. I'm just ignoring til this is merged.

@akesandgren
Copy link
Contributor

Gah.... inject-checksums messed things up...
refixing...

@akesandgren
Copy link
Contributor

Test report by @akesandgren
FAILED
Build succeeded for 3 out of 4 (4 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 545.29.06, Python 3.10.12
See https://gist.github.com/akesandgren/6e4c65774270b117665773d9e099462b for a full test report.

@Micket
Copy link
Contributor Author

Micket commented Apr 12, 2024

You also removed the --ignore=tests/ignite/contrib/handlers/test_tqdm_logger.py???

@akesandgren
Copy link
Contributor

akesandgren commented Apr 12, 2024

Test report by @akesandgren
FAILED
Build succeeded for 3 out of 4 (4 easyconfigs in total)
b-cn1502.hpc2n.umu.se - Linux Ubuntu 20.04, x86_64, Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz, 2 x NVIDIA Tesla V100-PCIE-16GB, 545.29.06, Python 3.8.10
See https://gist.github.com/akesandgren/aefea610bba329c00498ceef17ae8e9d for a full test report.

Running out of CUDAmemory.... not much to do then I guess.

@akesandgren
Copy link
Contributor

You also removed the --ignore=tests/ignite/contrib/handlers/test_tqdm_logger.py???

Yeah, I forgot to reload before adding my stuff...
readded it now.

@akesandgren
Copy link
Contributor

Test report by @akesandgren
SUCCESS
Build succeeded for 4 out of 4 (4 easyconfigs in total)
b-cn1603.hpc2n.umu.se - Linux Ubuntu 22.04, x86_64, AMD EPYC 7313 16-Core Processor, 1 x NVIDIA NVIDIA A100 80GB PCIe, 545.29.06, Python 3.10.12
See https://gist.github.com/akesandgren/6c4753a4d668255c5a6ae5c2e53c6a48 for a full test report.

Copy link
Contributor

@akesandgren akesandgren left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@akesandgren akesandgren modified the milestones: 4.x, release after 4.9.1 Apr 12, 2024
@akesandgren
Copy link
Contributor

Going in, thanks @Micket!

@akesandgren akesandgren merged commit c04dd25 into easybuilders:develop Apr 12, 2024
9 checks passed
@Micket Micket deleted the 20240228001016_new_pr_PyTorch-bundle212 branch April 12, 2024 14:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants