Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gloo timeout when training on multi-GPU configurations #80

Open
TheTrustedComputer opened this issue Jul 29, 2024 · 3 comments
Open

Gloo timeout when training on multi-GPU configurations #80

TheTrustedComputer opened this issue Jul 29, 2024 · 3 comments
Labels
bug Something isn't working help wanted Extra attention is needed

Comments

@TheTrustedComputer
Copy link

TheTrustedComputer commented Jul 29, 2024

Describe the bug
I have two 8GB AMD Radeon RX 5500 XTs for creating RVC models; it's nearly twice as fast as training on a single card. I greatly appreciate the support for distributed multi-GPU training setups. However, there's a potential communication hiccup between the processes, resulting in a deadlock and an interrupted session. Here's the runtime error output I saw after 30 minutes of inactivity:

Process Process-1:
Traceback (most recent call last):
  File "/home/thetrustedcomputer/Software/Python-3.10.13/Lib/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/home/thetrustedcomputer/Software/Python-3.10.13/Lib/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/thetrustedcomputer/Software/Git/RVC-Fumiama/infer/modules/train/train.py", line 278, in run
    train_and_evaluate(
  File "/home/thetrustedcomputer/Software/Git/RVC-Fumiama/infer/modules/train/train.py", line 508, in train_and_evaluate
    scaler.scale(loss_gen_all).backward()
  File "/home/thetrustedcomputer/Software/venv/RVC-Fumiama/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/home/thetrustedcomputer/Software/venv/RVC-Fumiama/lib/python3.10/site-packages/torch/autograd/__init__.py", line 199, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete

To Reproduce

  1. Open the web GUI, e.g. python gui.py.
  2. Go to the training tab to create a model utilizing two or more GPUs, assuming you have the hardware. For example, 0-1 or 0-1-2. The GPU fans will ramp up.
  3. Wait for them to spin down prematurely. Alternatively, wait until there's no further logging output. This is when the issue I described occurs.

Expected behavior
Distributed training should continue without interruption until the last epoch or the user hits Ctrl+C.

Screenshots
I've attached two screenshots from radeontop showing the expected and actual GPU usage.

Expected (two GPUs sharing the load, happened minutes after initial training):
Screenshot_20240729_072231

Actual (one GPU at full load and the other idle, happened several hours later):
Screenshot_20240729_071916

Desktop (please complete the following information):

  • OS and version: Arch Linux
  • Python version: 3.10.13
  • Commit/Tag with the issue: Latest

Additional context
This isn't 100% reproducible due to the indeterministic nature of parallelism, so it's important to do multiple iterations to ensure it's absolutely fixed. I've tried changing the ROCm version (tested on 5.2.3 and 5.4.3) and got the same symptoms; the root cause may be within the training logic seen in the traceback. Although unlikely, it's also possible that the ROCm or PyTorch I'm using has broken concurrency libraries.

At one point, an epoch was completed in well over an hour, apparently using shared or system RAM instead of the GPU's VRAM, which is nowhere near full.

I'm uncertain if this error also affects NVIDIA GPUs. For those who have 2+ NVIDIA cards, please let us know if it applies to you.

@fumiama
Copy link
Owner

fumiama commented Jul 29, 2024

We will rewrite the whole training code later and we can see whether this problem can be solved or not.

For those who have 2+ NVIDIA cards, please let us know if it applies to you.

Agree.

@fumiama fumiama added bug Something isn't working help wanted Extra attention is needed labels Jul 29, 2024
@charleswg
Copy link

2 N A5000 cards have no issues, both GPU memory and usage used.

@TheTrustedComputer
Copy link
Author

@charleswg Thank you for your insights. It appears NVIDIA cards aren't affected and may only apply to AMD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants