[bug]: Installer installs torch CUDA even when ROCm is selected #7146

max-maag · 2024-10-18T13:03:36Z

Is there an existing issue for this problem?

I have searched the existing issues

Operating system

Linux

GPU vendor

AMD (ROCm)

GPU model

RX 6650 XT

GPU VRAM

8GB

Version number

5.1.1, 5.2.0

Browser

n/a

Python dependencies

No response

What happened

When launching an Invoke server that was installed with ROCm support, the CPU is selected as the torch device.

What you expected to happen

The Invoke server should use the dedicated GPU.

How to reproduce the problem

Install Invoke with ROCm support
Disable patchmatch (see [bug]: Segmentation fault when launching invokeai-web #7091)
Launch the server using invoke.sh
Select option 1 to start the server

Additional context

I tried manually installing torch with ROCm support in a fresh venv. Invoke's installer script tries to install torch 2.4.1 with ROCm 5.6, so I those are the versions I tried to install:

>pip install torch==2.4.1 torchvision==0.19.1 --index-url https://download.pytorch.org/whl/rocm5.6
Looking in indexes: https://download.pytorch.org/whl/rocm5.6
ERROR: Could not find a version that satisfies the requirement torch==2.4.1 (from versions: 2.2.0+rocm5.6, 2.2.1+rocm5.6, 2.2.2+rocm5.6)
ERROR: No matching distribution found for torch==2.4.1

I then tried the most recent ROCm version, 6.2, with the same result except that it reports that only version 2.5.0+rocm6.2 is available.

Finally, I tried ROCm 6.1 which worked.

Fixing the installer by changing the URL in installer.py:410 to "https://download.pytorch.org/whl/rocm6.1" results in the server using the dedicated GPU by default. ~~(even without manually setting the CUDA_VERSION and HSA_OVERRIDE_GFX_VERSION environment variables, which was necessary in the last Invoke version I used, 4.2.7post1)~~ Setting the CUDA_VERSION and HSA_OVERRDIE_GFX_VERSION environment variable is still necessary. While the server will start and the log will mention that it is using the correct GPU, attempting to generate an image will fail with "RuntimeError: HIP error: invalid device function" if the variables are not set correctly.

As far as I can tell, bumping ROCm from 5.6 to 6.1 works without issues. The pytorch documentation for installing 2.4.1 also uses ROCm 6.1.

Finally, it might be worthwhile to think about why this issue happened and how to prevent it from happening again in the future. Unfortunately I don't have any good answers for that. Testing release candidates on all supported platforms would be ideal but also expensive.

Discord username

No response

The text was updated successfully, but these errors were encountered:

Each version of torch is only available for specific versions of CUDA and ROCm. The Invoke installer tries to install torch 2.4.1 with ROCm 5.6 support, which does not exist. As a result, the installation falls back to the default CUDA version so AMD GPUs aren't detected. This commits fixes that by bumping the ROCm version to 6.1. Closes invoke-ai#7146

Each version of torch is only available for specific versions of CUDA and ROCm. The Invoke installer tries to install torch 2.4.1 with ROCm 5.6 support, which does not exist. As a result, the installation falls back to the default CUDA version so AMD GPUs aren't detected. This commits fixes that by bumping the ROCm version to 6.1, as suggested by the PyTorch documentation. [1] The specified CUDA version of 12.4 is still correct according to [1] so it does need to be changed. Closes invoke-ai#7146 [1]: https://pytorch.org/get-started/previous-versions/#v241

Each version of torch is only available for specific versions of CUDA and ROCm. The Invoke installer and dockerfile try to install torch 2.4.1 with ROCm 5.6 support, which does not exist. As a result, the installation falls back to the default CUDA version so AMD GPUs aren't detected. This commits fixes that by bumping the ROCm version to 6.1, as suggested by the PyTorch documentation. [1] The specified CUDA version of 12.4 is still correct according to [1] so it does need to be changed. Closes invoke-ai#7006 Closes invoke-ai#7146 [1]: https://pytorch.org/get-started/previous-versions/#v241

max-maag added the bug Something isn't working label Oct 18, 2024

max-maag mentioned this issue Oct 18, 2024

[bug]: invokeai-rocm container doesn't support gpus #7006

Closed

1 task

max-maag mentioned this issue Oct 18, 2024

Fix AMD GPUs not being detected #7147

Merged

3 tasks

psychedelicious closed this as completed in #7147 Oct 22, 2024

psychedelicious closed this as completed in d85733f Oct 22, 2024

nmcbride mentioned this issue Nov 4, 2024

[bug]: invokeai docker v5.3.1 main-rocm still using cpu instead of rocm #7251

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug]: Installer installs torch CUDA even when ROCm is selected #7146

[bug]: Installer installs torch CUDA even when ROCm is selected #7146

max-maag commented Oct 18, 2024 •

edited

Loading

[bug]: Installer installs torch CUDA even when ROCm is selected #7146

[bug]: Installer installs torch CUDA even when ROCm is selected #7146

Comments

max-maag commented Oct 18, 2024 • edited Loading

Is there an existing issue for this problem?

Operating system

GPU vendor

GPU model

GPU VRAM

Version number

Browser

Python dependencies

What happened

What you expected to happen

How to reproduce the problem

Additional context

Discord username

max-maag commented Oct 18, 2024 •

edited

Loading