commandline argument --cuda-device is ineffective for HIP/ROCm (AMD) backend #5585

Bratzmeister · 2024-11-11T19:47:57Z

Expected Behavior

when using e.g. --cuda-device 1 I want ComfyUI to use device with ID 1

Actual Behavior

it doesn't matter what I set in --cuda-device, ComfyUI will use device with ID 0 in any case

Steps to Reproduce

start ComfyUI with --cuda-device <any ID other than 0> on an AMD/HIP system
load any workflow
queue/start inference or whatever else you do with comfy on your GPU
see in system/OS monitoring tools that GPU with ID 0 is used no matter what was set in step 1

Debug Logs

it's not visible from the logs because there is no error created per se because the log output assumes the operation is complete by setting the environment variable CUDA_VISIBLE_DEVICES which was successful. however I can reproduce the behavior in plain python/torch. See next field for PoC and explanation

Other

For reference I have two AMD RX 7900 XT in my system and when using --cuda-device 1 it's supposed to only expose the 2nd GPU to torch/cuda but since the switch is not working due to the environment variable being ignored by pytorch+rocm it will use the default cuda device which is cuda0 i.e. my 1st GPU not the 2nd. So below is an example showcasing that the correct environment variable yields the expected result.

example setting CUDA_VISIBLE_DEVICES to 1 as implemented by --cuda-device 1 in

ComfyUI/main.py

Line 73 in 2d28b0b

os.environ['CUDA_VISIBLE_DEVICES'] = str(args.cuda_device)

(comfy_env) axt@weilichskann ~/zeugs/AI/ComfyUI $ CUDA_VISIBLE_DEVICES=1 python
Python 3.11.10 (main, Sep 20 2024, 14:12:56) [GCC 13.3.1 20240614] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.version.cuda # output is empty because no cuda backend exists for this torch version  
>>> torch.version.hip
'6.1.40091-a8dbc0c19'
>>> torch.cuda.device_count()
2
>>> torch.cuda.get_device_name(0)
'AMD Radeon RX 7900 XT'
>>> torch.cuda.get_device_name(1)
'AMD Radeon RX 7900 XT'

so however when setting HIP_VISIBLE_DEVICES instead it will actually work

(comfy_env) axt@weilichskann ~/zeugs/AI/ComfyUI $ HIP_VISIBLE_DEVICES=1 python 
Python 3.11.10 (main, Sep 20 2024, 14:12:56) [GCC 13.3.1 20240614] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.device_count()
1

Sadly the pytorch and ROCm documentation is a bit misleading in this regard and one would assume that the two env vars are exchangeable but apparently despite the assumption in ROCm documentation (see links below) it's not the case for pytorch

https://pytorch.org/docs/stable/notes/hip.html
https://rocm.docs.amd.com/en/latest/conceptual/gpu-isolation.html#cuda-visible-devices

(comfy_env) axt@weilichskann ~/zeugs/AI/ComfyUI $ rocm-smi -i

============================ ROCm System Management Interface ============================
=========================================== ID ===========================================
GPU[0]		: Device Name: 		Navi 31 [Radeon RX 7900 XT/7900 XTX/7900M]
GPU[0]		: Device ID: 		0x744c
GPU[0]		: Device Rev: 		0xcc
GPU[0]		: Subsystem ID: 	NITRO+ RX 7900 XT Vapor-X
GPU[0]		: GUID: 		56961
GPU[1]		: Device Name: 		Navi 31 [Radeon RX 7900 XT/7900 XTX/7900M]
GPU[1]		: Device ID: 		0x744c
GPU[1]		: Device Rev: 		0xcc
GPU[1]		: Subsystem ID: 	0x5317
GPU[1]		: GUID: 		28574
==========================================================================================
================================== End of ROCm SMI Log ===================================

The text was updated successfully, but these errors were encountered:

Bratzmeister · 2024-11-11T19:51:03Z

I did change the code myself as suggested above. However there seems to be an issue with my dual-GPU setup as I receive a segfault after the model is loaded and inference is supposed to start. Maybe someone with a similar setup (integrated graphics might work too but I don't have any that's supported by ROCm) can test this.

Bratzmeister added the Potential Bug User is reporting a bug. This should be tested. label Nov 11, 2024

This was referenced Nov 11, 2024

[Bug]: --gpu-device-id commandline argument doesn't work with HIP/ROCm (AMD) backend lllyasviel/Fooocus#3734

Open

fix --cuda-device arg for AMD/HIP devices #5586

Merged

Bratzmeister closed this as completed Nov 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

commandline argument --cuda-device is ineffective for HIP/ROCm (AMD) backend #5585

commandline argument --cuda-device is ineffective for HIP/ROCm (AMD) backend #5585

Bratzmeister commented Nov 11, 2024

Bratzmeister commented Nov 11, 2024

commandline argument --cuda-device is ineffective for HIP/ROCm (AMD) backend #5585

commandline argument --cuda-device is ineffective for HIP/ROCm (AMD) backend #5585

Comments

Bratzmeister commented Nov 11, 2024

Expected Behavior

Actual Behavior

Steps to Reproduce

Debug Logs

Other

Bratzmeister commented Nov 11, 2024