Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: gfx906 ROCM won't work with torch: 2.0.1+rocm5.4.2 but works with other AIs #10873

Open
1 task done
KEDI103 opened this issue May 30, 2023 · 15 comments
Open
1 task done
Labels
asking-for-help-with-local-system-issues This issue is asking for help related to local system; please offer assistance

Comments

@KEDI103
Copy link

KEDI103 commented May 30, 2023

Is there an existing issue for this?

  • I have searched the existing issues and checked the recent builds/commits

What happened?

Trying to make webui work with pytourch 2.0.1 + rocm 5.4.2 but It won't work.

Steps to reproduce the problem

  1. Download last dev.
  2. Extract it
  3. Run webui.sh
  4. Generate.
  5. terminal fill with it.

What should have happened?

It should be generate like normal

Commit where the problem happens

b957dcf

What Python version are you running on ?

Python 3.10.x

What platforms do you use to access the UI ?

Linux

What device are you running WebUI on?

AMD GPUs (RX 6000 above), AMD GPUs (RX 5000 below)

What browsers do you use to access the UI ?

Mozilla Firefox

Command Line Arguments

Directly webui.sh give me that terminal errors.
But if I run with this
--no-half --disable-nan-check
It render black.

List of extensions

No extra extensions directly from github

Console logs

Calculating sha256 for /media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors: Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 807.5s (import torch: 5.4s, import gradio: 3.2s, import ldm: 3.7s, other imports: 3.0s, setup codeformer: 0.3s, list SD models: 781.1s, load scripts: 9.8s, create ui: 0.7s, gradio launch: 0.2s).
6ce0161689b3853acaa03779ec93eafe75a02f4ced659bee03f50797806fa2fa
Loading weights [6ce0161689] from /media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/models/Stable-diffusion/v1-5-pruned-emaonly.safetensors
Creating model from config: /media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/configs/v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
DiffusionWrapper has 859.52 M params.
Applying optimization: sdp-no-mem... done.
Textual inversion embeddings loaded(0): 
Model loaded in 41.0s (calculate hash: 14.1s, load weights from disk: 0.3s, create model: 1.6s, apply weights to model: 18.6s, apply half(): 1.9s, load VAE: 2.8s, move model to device: 0.8s, load textual inversion embeddings: 0.9s).
  0%|                                                    | 0/20 [00:02<?, ?it/s]
Error completing request
Arguments: ('task(i3jff5ltdmmulfl)', 'miku', '', [], 20, 0, False, False, 1, 1, 7, -1.0, -1.0, 0, 0, 0, False, 512, 512, False, 0.7, 2, 'Latent', 0, 0, 0, 0, '', '', [], 0, False, False, 'positive', 'comma', 0, False, False, '', 1, '', [], 0, '', [], 0, '', [], True, False, False, False, 0) {}
Traceback (most recent call last):
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/modules/call_queue.py", line 57, in f
    res = list(func(*args, **kwargs))
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/modules/call_queue.py", line 37, in f
    res = func(*args, **kwargs)
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/modules/txt2img.py", line 57, in txt2img
    processed = processing.process_images(p)
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/modules/processing.py", line 611, in process_images
    res = process_images_inner(p)
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/modules/processing.py", line 729, in process_images_inner
    samples_ddim = p.sample(conditioning=p.c, unconditional_conditioning=p.uc, seeds=p.seeds, subseeds=p.subseeds, subseed_strength=p.subseed_strength, prompts=p.prompts)
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/modules/processing.py", line 977, in sample
    samples = self.sampler.sample(self, x, conditioning, unconditional_conditioning, image_conditioning=self.txt2img_image_conditioning(x))
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/modules/sd_samplers_kdiffusion.py", line 383, in sample
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/modules/sd_samplers_kdiffusion.py", line 257, in launch_sampling
    return func()
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/modules/sd_samplers_kdiffusion.py", line 383, in <lambda>
    samples = self.launch_sampling(steps, lambda: self.func(self.model_wrap_cfg, x, extra_args={
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/repositories/k-diffusion/k_diffusion/sampling.py", line 145, in sample_euler_ancestral
    denoised = model(x, sigmas[i] * s_in, **extra_args)
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/modules/sd_samplers_kdiffusion.py", line 169, in forward
    devices.test_for_nans(x_out, "unet")
  File "/media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/modules/devices.py", line 156, in test_for_nans
    raise NansException(message)
modules.devices.NansException: A tensor with all NaNs was produced in Unet. This could be either because there's not enough precision to represent the picture, or because your video card does not support half type. Try setting the "Upcast cross attention layer to float32" option in Settings > Stable Diffusion or using the --no-half commandline argument to fix this. Use --disable-nan-check commandline argument to disable this check.




Also with --no-half --disable-nan-check but it render black.



Creating model from config: /media/bcansin/ai/ai/mem/stable-diffusion-webui-dev/stable-diffusion-webui/configs/v1-inference.yaml
LatentDiffusion: Running in eps-prediction mode
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Startup time: 24.9s (import torch: 5.1s, import gradio: 2.7s, import ldm: 3.7s, other imports: 2.5s, setup codeformer: 0.2s, load scripts: 9.6s, create ui: 0.7s, gradio launch: 0.3s).
DiffusionWrapper has 859.52 M params.
Applying optimization: sdp-no-mem... done.
Textual inversion embeddings loaded(0): 
Model loaded in 5.7s (load weights from disk: 0.9s, create model: 1.3s, apply weights to model: 1.7s, load VAE: 0.3s, move model to device: 1.3s, load textual inversion embeddings: 0.1s).
100%|███████████████████████████████████████████| 20/20 [00:05<00:00,  3.42it/s]
Total progress: 100%|███████████████████████████| 20/20 [00:04<00:00,  4.26it/s]
Total progress: 100%|███████████████████████████| 20/20 [00:04<00:00,  4.25it/s]

Additional information

I got AMD Radeon™ VII ( GFX9 GPUs gfx906 Vega 20 ) and installed ROCM 5.5
And I only got problem with webui other AI works perfecly with
even other AI in same system I use for webui work direcly with this:
pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm5.4.2

Also I read this too. This is why I typing.
#10465

Please fix it I really want to get last pythourch and rocm version I Stuck with this
pip install torch==1.13.0+rocm5.2 torchvision==0.14.0+rocm5.2 torchaudio==0.13.0 --extra-index-url https://download.pytorch.org/whl/rocm5.2

I begging for help at this point. Please help me. I gave my days to make it work but every time I try I fail.

@KEDI103 KEDI103 added the bug-report Report of a bug, yet to be confirmed label May 30, 2023
@DGdev91
Copy link
Contributor

DGdev91 commented Jun 5, 2023

What do you mean for "Other AIs"? other UIs for StableDiffusion? or something different, like oobabooga's text-generation-webui?

@DGdev91
Copy link
Contributor

DGdev91 commented Jun 5, 2023

Anyway, i had issues too on my RX 5700XT with the webui and Pytorch2. as a workaround i kept it back for AMD cards, but then #10465 removed it.
....Wich isn't completely wrong, it makes no sense to keep all cards back if the problem is on older cards only.

I made a PR for another workaround, wich i hope makes everyone happy.
...But sadly it requires python 3.10
#11048

Menwhile, if you tell me wich AIs you were talking about i can investigate further and try to get a proper solution

@KEDI103
Copy link
Author

KEDI103 commented Jun 6, 2023

Anyway, i had issues too on my RX 5700XT with the webui and Pytorch2. as a workaround i kept it back for AMD cards, but then #10465 removed it. ....Wich isn't completely wrong, it makes no sense to keep all cards back if the problem is on older cards only.

I made a PR for another workaround, wich i hope makes everyone happy. ...But sadly it requires python 3.10 #11048

Menwhile, if you tell me wich AIs you were talking about i can investigate further and try to get a proper solution

My radeon VII gfx_906 can work with even dev version for InvokAI tested with lastest dev version of pythourch

Also I noticed this disappear after upgrade pythourch for Webui

IOpen(HIP): Warning [SQLiteBase] Missing system database file: gfx906_60.kdb Performance may degrade. Please follow instructions to install: https://github.com/ROCmSoftwarePlatform/MIOpen#installing-miopen-kernels-package

It won't print this anymore when you generate your first my thought it won't recognize video card it fail to detect it.
Also InvokeAI direcly detect my card with its name


Generate images with a browser-based interface
Initializing, be patient...
>> Initialization file /media/bcansin/1519f428-b947-449a-a54a-0aeab6646be3/home/b_cansin/InvokeAI-main/invokeai.init found. Loading...
>> Internet connectivity is True
>> InvokeAI, version 2.3.4.post1
>> InvokeAI runtime directory is "/media/bcansin/1519f428-b947-449a-a54a-0aeab6646be3/home/b_cansin/InvokeAI-main"
>> GFPGAN Initialized
>> CodeFormer Initialized
>> ESRGAN Initialized
>> Using device_type cuda
>> CUDA device 'AMD Radeon VII' (GPU 0)


And I also tried with other AI art generator won't get update for long time it also direly work with dev version of pytourch

But Webui won't want to work with it even its stable version of webui.sh typed one.

@DGdev91
Copy link
Contributor

DGdev91 commented Jun 6, 2023

Ok, but you had just got the UI running or you actually got it to generate a image?
I got InvokeAI to run too, and the card gest recognized, but it crashes with segmentation fault if i try to make something with my 5700XT

@KEDI103
Copy link
Author

KEDI103 commented Jun 6, 2023

Ok, but you had just got the UI running or you actually got it to generate a image? I got InvokeAI to run too, and the card gest recognized, but it crashes with segmentation fault if i try to make something with my 5700XT

Yes I can generate with it even dev lastest pytorch. And your gpu not in rocm support list but mine is in list.
https://rocm.docs.amd.com/en/latest/release/gpu_os_support.html
Your gpu called gfx1010 which isn't in the rocm support list.
But mine called gfx906 which still in support.
But for segmentation fault some models crash it also happend to me too I got segmentation fault on
Example DreamShaper - V6 baked vae give me segmentation fault with both webui and invokeai
https://civitai.com/models/4384/dreamshaper
but also I can get it work by --disable-nan-check but all of things get rendered black but with latest pytorch it works and using gpu but it all gone black.

@KEDI103
Copy link
Author

KEDI103 commented Jun 15, 2023

Also still get same thing on 1.4.0 dev 59419bd
Edit.
Also I tested with Invoke it also made it black with it too. I thought I installed correctly of pytorch but I was wrong. Well I think its impossible to run pytorch 2 on gfx906.

@voidanix
Copy link

voidanix commented Jul 16, 2023

gfx906 is not the only one affected: gfx1031 (RDNA2) also suffers from this exact issue.

Running on Fedora 38 with an Intel KBL-R system.

EDIT: possibly relevant and related is #10296

@KEDI103
Copy link
Author

KEDI103 commented Jul 17, 2023

gfx906 is not the only one affected: gfx1031 (RDNA2) also suffers from this exact issue.

Running on Fedora 38 with an Intel KBL-R system.

EDIT: possibly relevant and related is #10296

AMD on their live stream invent pytorch hugface etc. but in application I can't see I typed it rocm offical and pythorch but noone from them not even reply.

This is last time I making mistake to buy AMD never gona happend on my next releases unless AMD fix this mess but instead of fixing they killing support of my card for ROCM in next releases so yeah nvidia looks so awesome to my eyes. And I have been buying AMD since 2005.
I got enough of them nvidia can be expensive but AMD mean waste your all spend money not only money also your time to go trash right now. Even no windows support trying to battle with which amdgpu installer for my card or fight with installer hope its won't make atomic bomb to terminal in linux....
I mean this shouldn't be this hard for AMD my disappointment and regret level can't be type here....

@catboxanon catboxanon added asking-for-help-with-local-system-issues This issue is asking for help related to local system; please offer assistance and removed bug-report Report of a bug, yet to be confirmed labels Aug 3, 2023
@AndjayWa
Copy link

AndjayWa commented Aug 8, 2023

Generate images with a browser-based interface
Initializing, be patient...

Initialization file /media/bcansin/1519f428-b947-449a-a54a-0aeab6646be3/home/b_cansin/InvokeAI-main/invokeai.init found. Loading...
Internet connectivity is True
InvokeAI, version 2.3.4.post1
InvokeAI runtime directory is "/media/bcansin/1519f428-b947-449a-a54a-0aeab6646be3/home/b_cansin/InvokeAI-main"
GFPGAN Initialized
CodeFormer Initialized
ESRGAN Initialized
Using device_type cuda
CUDA device 'AMD Radeon VII' (GPU 0)

How can I get Radeon VII be recgonized as CUDA?
I am newbie, my issue is GPU device = cpu

invokeai --web
[2023-08-08 13:34:56,875]::[InvokeAI]::INFO --> Patchmatch initialized
/home/suus/invokeai/.venv/lib/python3.10/site-packages/torchvision/transforms/functional_tensor.py:5: UserWarning: The torchvision.transforms.functional_tensor module is deprecated in 0.15 and will be removed in 0.17. Please don't rely on it. You probably just need to use APIs in torchvision.transforms.functional or in torchvision.transforms.v2.functional.
warnings.warn(
[2023-08-08 13:34:58,352]::[uvicorn.error]::INFO --> Started server process [24789]
[2023-08-08 13:34:58,352]::[uvicorn.error]::INFO --> Waiting for application startup.
[2023-08-08 13:34:58,352]::[InvokeAI]::INFO --> InvokeAI version 3.0.1post3
[2023-08-08 13:34:58,353]::[InvokeAI]::INFO --> Root directory = /home/suus/invokeai
[2023-08-08 13:34:58,354]::[InvokeAI]::INFO --> GPU device = cpu

have you come across this approach?

https://www.reddit.com/r/StableDiffusion/comments/zu9w40/novices_guide_to_automatic1111_on_linux_with_amd/

to make sure your GPU is being detected

Name: gfx1031
Marketing Name: AMD Radeon RX 6700 XT
should be somewhere in the output.

You'll note it says gfx1031 in mine - technically the 6700XT isn't usable with ROCm for some reason, but actually it is, so you run

export HSA_OVERRIDE_GFX_VERSION=10.3.0
to make the system lie about what GPU you have and boom, it just works. We'll cover how to make this persistent further down if you want that.

Lastly you want to add yourself to the render and video groups using

sudo usermod -a -G render
sudo usermod -a -G video
3 - Install Python - this bit seems pretty straight forward, but in my case it wasn't that clean cut, rocm depends on python2, but Stable Diffusion uses python3

sudo apt-get install python3
then you want to edit your .bashrc file to make a shortcut (called an alias) to python3 when you type python - to do this you run

nano ~/.bashrc

alias python=python3
export HSA_OVERRIDE_GFX_VERSION=10.3.0
to the bottom of the file, and now your system will default to python3 instead,and makes the GPU lie persistant, neat.

@voidanix
Copy link

I would like to follow up on this with pytorch/pytorch#103973.

TL;DR is you need PCIe atomics support to get ROCm to work, even if said otherwise for post-Vega hardware. eGPU setups (even with integrated controllers) do not seem to expose the feature, basically requiring a full-fledged desktop with the PCIe x16 slot that is connected to the CPU.

It is still awkward how it all used to work with pytorch+ROCm 5.2, but AMD's documentation about atomics support has been pretty straightforward about it.

@DGdev91
Copy link
Contributor

DGdev91 commented Oct 5, 2023

As mentioned here pytorch/pytorch#106728, pytorch 2 works just fine if compiled on rocm 5.2, so i guess the problem here isn't about pytorch 1 vs 2, but it's about rocm 5.3 and newer breaking the support.

I would like to follow up on this with pytorch/pytorch#103973.

TL;DR is you need PCIe atomics support to get ROCm to work, even if said otherwise for post-Vega hardware. eGPU setups (even with integrated controllers) do not seem to expose the feature, basically requiring a full-fledged desktop with the PCIe x16 slot that is connected to the CPU.

It is still awkward how it all used to work with pytorch+ROCm 5.2, but AMD's documentation about atomics support has been pretty straightforward about it.

The pci atomics stuff is a good suggestion, but i don't think it's the case, at least for me. mi machine should be able to handle them. also, i tried to compile rocm using the new rocm 5.7 flag as described in the post you mentioned but it didn't seem to make any difference, while pytorch2 compiled on rocm5.2 is indeed working.

i opened a new issue in rocm's repo ROCm/ROCm#2527

@kode54
Copy link

kode54 commented Nov 6, 2023

Well that's just great, PyTorch deleted their rocm5.2 repo.

Edit: Oops, my bad, it's a 3.10 specific repo.

@kode54
Copy link

kode54 commented Nov 7, 2023

Hmm, works fine with Python 3.11 and upstream PyTorch+rocm5.6 and TorchVision+rocm5.6, on gfx1031, if I specify the HSA GFX version override environment variable. Does not work with Arch's builds of pytorch or the AUR torchvision.

export HSA_OVERRIDE_GFX_VERSION=10.3.0

Not sure if there's a compatible override for 906 / 9.0.6. Maybe ask the ROCm repository?

pytorch/pytorch#111355 (comment)

@KEDI103
Copy link
Author

KEDI103 commented Nov 17, 2023

Okey heres after months of trying fix for missing pci atomics problem:
pytorch/pytorch#103973 (comment)
Gona be added on pre nighty build on next week for now that one fixing it.
pytorch/pytorch#103973 (comment)
After offical build release and without problem gona close this one
But this needs also this fix for AUTOMATIC1111
#13985 (comment)

@KEDI103
Copy link
Author

KEDI103 commented May 8, 2024

@KEDI103 You can use my Repo to install Stable diffusion on ROCm RX6000 it solves by other way AMD ROCm RDNA2 & 3 problems with docker containers on linux https://github.com/hqnicolas/StableDiffusionROCm it was stable at 1.9.3 (latest) if you like this automation REPO please let a Star on it ⭐

I am on Radeon VII and after 5 months of ROCm team we finally fixed but after tons of problem and unsupport thing for my radeon VII. This is my last AMD card until AMD do somethings than make us suffer so badly to regret buying AMD not gona buy AMD. I have been using AMD since 2005 or something but Radeon VII make me give up. So much problems early cut support not even win support still for pytorch etc....

I hope your repo helps suffering AMD users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
asking-for-help-with-local-system-issues This issue is asking for help related to local system; please offer assistance
Projects
None yet
Development

No branches or pull requests

6 participants