Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

commandline argument --cuda-device is ineffective for HIP/ROCm (AMD) backend #5585

Closed
Bratzmeister opened this issue Nov 11, 2024 · 1 comment
Labels
Potential Bug User is reporting a bug. This should be tested.

Comments

@Bratzmeister
Copy link
Contributor

Expected Behavior

when using e.g. --cuda-device 1 I want ComfyUI to use device with ID 1

Actual Behavior

it doesn't matter what I set in --cuda-device, ComfyUI will use device with ID 0 in any case

Steps to Reproduce

  1. start ComfyUI with --cuda-device <any ID other than 0> on an AMD/HIP system
  2. load any workflow
  3. queue/start inference or whatever else you do with comfy on your GPU
  4. see in system/OS monitoring tools that GPU with ID 0 is used no matter what was set in step 1

Debug Logs

it's not visible from the logs because there is no error created per se because the log output assumes the operation is complete by setting the environment variable CUDA_VISIBLE_DEVICES which was successful. however I can reproduce the behavior in plain python/torch. See next field for PoC and explanation

Other

For reference I have two AMD RX 7900 XT in my system and when using --cuda-device 1 it's supposed to only expose the 2nd GPU to torch/cuda but since the switch is not working due to the environment variable being ignored by pytorch+rocm it will use the default cuda device which is cuda0 i.e. my 1st GPU not the 2nd. So below is an example showcasing that the correct environment variable yields the expected result.

example setting CUDA_VISIBLE_DEVICES to 1 as implemented by --cuda-device 1 in

os.environ['CUDA_VISIBLE_DEVICES'] = str(args.cuda_device)

(comfy_env) axt@weilichskann ~/zeugs/AI/ComfyUI $ CUDA_VISIBLE_DEVICES=1 python
Python 3.11.10 (main, Sep 20 2024, 14:12:56) [GCC 13.3.1 20240614] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda # output is empty because no cuda backend exists for this torch version  
>>> torch.version.hip
'6.1.40091-a8dbc0c19'
>>> torch.cuda.device_count()
2
>>> torch.cuda.get_device_name(0)
'AMD Radeon RX 7900 XT'
>>> torch.cuda.get_device_name(1)
'AMD Radeon RX 7900 XT'

so however when setting HIP_VISIBLE_DEVICES instead it will actually work

(comfy_env) axt@weilichskann ~/zeugs/AI/ComfyUI $ HIP_VISIBLE_DEVICES=1 python 
Python 3.11.10 (main, Sep 20 2024, 14:12:56) [GCC 13.3.1 20240614] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.device_count()
1

Sadly the pytorch and ROCm documentation is a bit misleading in this regard and one would assume that the two env vars are exchangeable but apparently despite the assumption in ROCm documentation (see links below) it's not the case for pytorch

https://pytorch.org/docs/stable/notes/hip.html
https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html#cuda-visible-devices

(comfy_env) axt@weilichskann ~/zeugs/AI/ComfyUI $ rocm-smi -i

============================ ROCm System Management Interface ============================
=========================================== ID ===========================================
GPU[0]		: Device Name: 		Navi 31 [Radeon RX 7900 XT/7900 XTX/7900M]
GPU[0]		: Device ID: 		0x744c
GPU[0]		: Device Rev: 		0xcc
GPU[0]		: Subsystem ID: 	NITRO+ RX 7900 XT Vapor-X
GPU[0]		: GUID: 		56961
GPU[1]		: Device Name: 		Navi 31 [Radeon RX 7900 XT/7900 XTX/7900M]
GPU[1]		: Device ID: 		0x744c
GPU[1]		: Device Rev: 		0xcc
GPU[1]		: Subsystem ID: 	0x5317
GPU[1]		: GUID: 		28574
==========================================================================================
================================== End of ROCm SMI Log ===================================
@Bratzmeister Bratzmeister added the Potential Bug User is reporting a bug. This should be tested. label Nov 11, 2024
@Bratzmeister
Copy link
Contributor Author

I did change the code myself as suggested above. However there seems to be an issue with my dual-GPU setup as I receive a segfault after the model is loaded and inference is supposed to start. Maybe someone with a similar setup (integrated graphics might work too but I don't have any that's supported by ROCm) can test this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Potential Bug User is reporting a bug. This should be tested.
Projects
None yet
Development

No branches or pull requests

1 participant