You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
I have two 8GB AMD Radeon RX 5500 XTs for creating RVC models; it's nearly twice as fast as training on a single card. I greatly appreciate the support for distributed multi-GPU training setups. However, there's a potential communication hiccup between the processes, resulting in a deadlock and an interrupted session. Here's the runtime error output I saw after 30 minutes of inactivity:
Process Process-1:
Traceback (most recent call last):
File "/home/thetrustedcomputer/Software/Python-3.10.13/Lib/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/home/thetrustedcomputer/Software/Python-3.10.13/Lib/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/thetrustedcomputer/Software/Git/RVC-Fumiama/infer/modules/train/train.py", line 278, in run
train_and_evaluate(
File "/home/thetrustedcomputer/Software/Git/RVC-Fumiama/infer/modules/train/train.py", line 508, in train_and_evaluate
scaler.scale(loss_gen_all).backward()
File "/home/thetrustedcomputer/Software/venv/RVC-Fumiama/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
torch.autograd.backward(
File "/home/thetrustedcomputer/Software/venv/RVC-Fumiama/lib/python3.10/site-packages/torch/autograd/__init__.py", line 199, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:84] Timed out waiting 1800000ms for recv operation to complete
To Reproduce
Open the web GUI, e.g. python gui.py.
Go to the training tab to create a model utilizing two or more GPUs, assuming you have the hardware. For example, 0-1 or 0-1-2. The GPU fans will ramp up.
Wait for them to spin down prematurely. Alternatively, wait until there's no further logging output. This is when the issue I described occurs.
Expected behavior
Distributed training should continue without interruption until the last epoch or the user hits Ctrl+C.
Screenshots
I've attached two screenshots from radeontop showing the expected and actual GPU usage.
Expected (two GPUs sharing the load, happened minutes after initial training):
Actual (one GPU at full load and the other idle, happened several hours later):
Desktop (please complete the following information):
OS and version: Arch Linux
Python version: 3.10.13
Commit/Tag with the issue: Latest
Additional context
This isn't 100% reproducible due to the indeterministic nature of parallelism, so it's important to do multiple iterations to ensure it's absolutely fixed. I've tried changing the ROCm version (tested on 5.2.3 and 5.4.3) and got the same symptoms; the root cause may be within the training logic seen in the traceback. Although unlikely, it's also possible that the ROCm or PyTorch I'm using has broken concurrency libraries.
At one point, an epoch was completed in well over an hour, apparently using shared or system RAM instead of the GPU's VRAM, which is nowhere near full.
I'm uncertain if this error also affects NVIDIA GPUs. For those who have 2+ NVIDIA cards, please let us know if it applies to you.
The text was updated successfully, but these errors were encountered:
Describe the bug
I have two 8GB AMD Radeon RX 5500 XTs for creating RVC models; it's nearly twice as fast as training on a single card. I greatly appreciate the support for distributed multi-GPU training setups. However, there's a potential communication hiccup between the processes, resulting in a deadlock and an interrupted session. Here's the runtime error output I saw after 30 minutes of inactivity:
To Reproduce
python gui.py
.0-1
or0-1-2
. The GPU fans will ramp up.Expected behavior
Distributed training should continue without interruption until the last epoch or the user hits Ctrl+C.
Screenshots
I've attached two screenshots from
radeontop
showing the expected and actual GPU usage.Expected (two GPUs sharing the load, happened minutes after initial training):
Actual (one GPU at full load and the other idle, happened several hours later):
Desktop (please complete the following information):
Additional context
This isn't 100% reproducible due to the indeterministic nature of parallelism, so it's important to do multiple iterations to ensure it's absolutely fixed. I've tried changing the ROCm version (tested on 5.2.3 and 5.4.3) and got the same symptoms; the root cause may be within the training logic seen in the traceback. Although unlikely, it's also possible that the ROCm or PyTorch I'm using has broken concurrency libraries.
At one point, an epoch was completed in well over an hour, apparently using shared or system RAM instead of the GPU's VRAM, which is nowhere near full.
I'm uncertain if this error also affects NVIDIA GPUs. For those who have 2+ NVIDIA cards, please let us know if it applies to you.
The text was updated successfully, but these errors were encountered: