Stable Diffusion PR optimizes VRAM, generate 576x1280 images with 6 GB VRAM #364

mgcrea · 2022-09-04T09:08:29Z

Seen on HN, might be interesting to pull in this repo? (PR looks a bit dirty with a lot of extra changes though).

neonsecret · 2022-09-04T10:24:53Z

check out my other compvis PR CompVis/stable-diffusion#177
might be more suitable for you

lstein · 2022-09-04T10:58:21Z

Thanks for the tip! I'll check them both out.

sunija-dev · 2022-09-04T11:13:34Z

Would be cool to get this implemented! ❤️
I got two users with only 4 GB VRAM and the model won't even load. If I saw it correctly, that should work with basujindal's version.

lstein · 2022-09-04T11:27:48Z

Don't I know it!

Vargol · 2022-09-04T11:31:37Z

By the looks of it without all the white space changes we get...

diff ldm/modules/attention.py ldm/modules/attention.py.opt
181a182
>         del context, x
187a189
>         del q, k
193a196
>             del mask
196c199,200
<         attn = sim.softmax(dim=-1)
---
>         sim[4:] = sim[4:].softmax(dim=-1)
>         sim[:4] = sim[:4].softmax(dim=-1)
198,200c202,204
<         out = einsum('b i j, b j d -> b i d', attn, v)
<         out = rearrange(out, '(b h) n d -> b n (h d)', h=h)
<         return self.to_out(out)
---
>         sim = einsum('b i j, b j d -> b i d', sim, v)
>         sim = rearrange(sim, '(b h) n d -> b n (h d)', h=h)
>         return self.to_out(sim)

attached in in diff -u format for patching

attn.patch.txt

lstein · 2022-09-04T11:45:32Z

Oh thank you very much for that! I actually just did the same thing with @neonsecret 's attention optimization and it works amazingly. Without any change to execution speed my test prompt now uses 3.60G of VRAM. Previously it was using 4.42G.

Now I'm looking to see if image quality is affected.

Any reason to prefer basunjal's attention.py optimization?

smoke2007 · 2022-09-04T11:50:58Z

nothing to add to this conversation, except to say that i'm excited for this lol ;)

lstein · 2022-09-04T11:51:01Z

I've merged @neonsecret 's optimizations into the development branch "refactoring-simplet2i" and would welcome people testing it and sending feedback. This branch probably still has major bugs in it, but I refactored the code to make it much easier to add optimizations and new features (particularly inpainting, which I'd hoped to have done by today).

Vargol · 2022-09-04T11:51:21Z

Hmmm....on my barely coping 8G M1 its not so hot, the image is different and it took twice as long, but its an old clone, let me try in on a fresher one

lstein · 2022-09-04T11:57:02Z

Darn. I'd hoped that there was such a thing as a free lunch.

I'm on an atypical system with 32G of VRAM, so maybe my results aren't representative. I did timing and peak VRAM usage, and then looked at two images generated with the same seed and they were indistinguishable to the eye. Let me know what you find out.

Are you on an Apple? I didn't know there were clones. The M1 MPS support in this fork is really new, and I wouldn't be surprised it needs additional tweaking to get it to work properly with the optimization.

Vargol · 2022-09-04T12:01:15Z

Sorry meant its an old local clone of your repo as I didn't want to make changes in my local clone of the current one as hot works quite nicely :-) , but yes I'm not surprised mps is breaking things and pytorch is pretty buggy too, I raised a few mps related issues over there that stable diffusion hits.

neonsecret · 2022-09-04T12:04:41Z

hey guys see https://www.reddit.com/r/StableDiffusion/comments/x56e8x/the_optimized_stable_diffusion_repo_got_a_pr_that/in032db

Vargol · 2022-09-04T12:34:40Z

okay, on the main branch the images are the same, but it is really slow, even compared to my normal times...

10/10 [06:49<00:00, 40.94s/it]
compared
10/10 [02:14<00:00, 13.44s/it]
(spot the man with the 8Gb M1)

I'll do some more digging

@magnusviri any chance you can check this out on a bigger M1 ?

lstein · 2022-09-04T12:35:00Z

Ok, this is @neonsecret 's PR, which I just tested and merged into the refactor branch. I'm seeing a 20% reduction in memory footprint, but unfortunately not the 35% reduction reported in the Reddit post. Presumably this is due to the earlier optimizations in basunjal's branch. I haven't really wanted to use those opts because the code is complex and I hear it has a performance hit. Advice?

Vargol · 2022-09-04T12:36:34Z

Last I looks at basunjal's there where loads of assumption about using cuda, a big chunk of the memory saving seemed to come from forcing half. Was a week ago things might have changed

lstein · 2022-09-04T12:45:59Z

I've got half precision on already as the default. I think what I'm missing is basunjal's optimization of splitting the task into several chunks and loading them into the GPU sequentially.

lstein · 2022-09-04T12:46:26Z

Frankly, I'm happy with the 20% savings for now.

Vargol · 2022-09-04T12:54:52Z

Seems the speed loss is coming from the twin calls to softmax.

<         attn = sim.softmax(dim=-1)
---
>         sim[4:] = sim[4:].softmax(dim=-1)
>         sim[:4] = sim[:4].softmax(dim=-1)

If I change it to use
sim = sim.softmax(dim=-1)

instead I get all my speed back (I Assume more memory usage though I need better diagnostic stuff than Activity Monitor).

I can do 640x512 images now so there do appear to be some memory saving even reverting that change, be interesting to see what happens on a larger box and if is it worth wrapping them into one variation in a "if mps" statement

EDIT:

seems I can also now do 384 x320 without it using swap.

lstein · 2022-09-04T13:50:10Z

No measurable slowdown at all on CUDA. Maybe we make the twin softmax conditional on not MPS?

veprogames · 2022-09-04T14:05:20Z

test run:

751283a (main) EDIT: corrected hash
"test" -s50 -W512 -H512 -C7.5 -Ak_lms -S42
00:15<00:00, 3.30it/s
Max VRAM used for this generation: 4.44G

89a7622 (refactoring-simplet2i)
"test" -s50 -W512 -H512 -C7.5 -Ak_lms -S42
00:13<00:00, 3.77it/s
Max VRAM used for this generation: 3.61G

XOR of both images gave a pitch black result -> seems to be no difference

maybe this helps. Inference even seems to be a little faster in this test run.

lstein · 2022-09-04T14:10:03Z

CUDA platform? What hardware?

test run:

4406fd1 (main) "test" -s50 -W512 -H512 -C7.5 -Ak_lms -S42 00:15<00:00, 3.30it/s Max VRAM used for this generation: 4.44G

89a7622 (refactoring-simplet2i) "test" -s50 -W512 -H512 -C7.5 -Ak_lms -S42 00:13<00:00, 3.77it/s Max VRAM used for this generation: 3.61G

XOR of both images gave a pitch black result -> seems to be no difference

maybe this helps. Inference even seems to be a little faster in this test run.

veprogames · 2022-09-04T14:13:23Z

Windows 10, NVIDIA GeForce RTX 2060 SUPER 8GB, CUDA

if there's info missing I'll edit it in

blessedcoolant · 2022-09-04T16:13:24Z

I made a new local PR with just changes to the attention.py. There are definitely memory improvements but nothing as drastic as what the PR claims.

Here are some of my test results after extensive testing - RTX 3080 8GB Card.

Base Repo:

Max Possible Resolution: 512x768
Max VRAM Usage: 7.11GB

Updated attention.py

Max Possible Resolution: 576x768
Max VRAM Usage: 6.91GB

For a 512x768 image, the updated repo consumes 5.94GB of memory. That's approximately an 18% memory boost.

I saw no difference in performance or inference time when using a single or twin softmax. On CUDA, the difference seems to be negligible if there is any.

tl;dr -- Just the attention.py changes are giving an approximate 18% VRAM saving.

Vargol · 2022-09-04T16:31:13Z

I think its a side affect of the unified architecture, it looks okay at 256x256 when all of the python 'image' fits in memory, but as soon as swapping kicks in I get half speed compared to the original code or single softmax

bmaltais · 2022-09-04T17:34:29Z

I just tried regenerating an image done from the current development branch and the new refactor branch and they are totally different... Not sure if it is the memory saving feature doing it or something else. Is there a switch to activate/deactivate the memory saving?

Would be nice to isolate if the diff is related to that or something else. Running on a GTX 3060.

EDIT:

OK... for some reason the picture I got when running the command the 1st time is different from the result when running the prompt logged in the prompt log file... Strange... but using the log file prompt on both the dev and the refactoring does indeed produce the same result with much less VRAM usage...

I will try to reproduce the variation... This might be an issue with the variation code base not producing consistent result on 1st run vs reruns from logs.

EDIT 2: I tracked the issue with the different outputs... it was a PEBKAC... I pasted the file name and directory info in front of the prompt in the log... this is why it resulted in a different output... so all good, it was my error.

So as far as I can see the memory optimisation has no side effect on time not quality to generate an image.

6.4G on the dev branch vs 4.84G on the refactoring branch... so a 34% memory usage reduction and exactly the same run time.

thelemuet · 2022-09-04T18:03:53Z

I did not do extensive testing to compare generation times but so far I have gotten the same exact results visually when comparing to images I've generated yesterday on main with same prompt/seeds.

And I can cramp up resolution from max 576x576 up to 640x704, using 6.27G on my RTX 2070.

Last time I have tried basunjal's, I could manage 704x768 but it was very slow. However if they implemented this PR + their original optimization, if it uses up even less memory than it used to, I can image doing even higher resolution. Very impressive.

cvar66 · 2022-09-04T18:19:29Z

The basujindal fork is very slow. I would take a 20% memory improvement with no speed hit over 35% that is much slower any day.

bmaltais · 2022-09-04T18:25:49Z

The basujindal fork is very slow. I would take a 20% memory improvement with no speed hit over 35% that is much slower any day.

On my 3060 with 12GB VRAM I am seeing a 34% memory improvement... so this is pretty great. hat is when generating 512x704 images.

Ratinod · 2022-09-04T22:15:03Z

~~768x2048 or 1216x1216 on 8 gb vram~~ (neonsecret/stable-diffusion). ~~Incredible.~~
1024x1024 on 8 gb vram (and maybe even more)

My mistake... believed what was written before checking...

upd.
It works!

ryudrigo · 2022-09-08T09:56:41Z

I think it's possible to incorporate the best of both methods. I've been working on adapting Doggettx's dynamic threshold. Will update the branch as soon as I can

lstein · 2022-09-08T11:44:34Z

@Doggettx 's dynamic threshold works great, but has the issue that it makes calls to torch.cuda's memory stats functions. These aren't supported on Apple M1 hardware or other non-CUDA devices. See Issue #431

Any-Winter-4079 · 2022-09-08T12:05:57Z

I'm porting @Doggettx optimizations to M1 and it looks promising :)
Initial results.

Development branch

Size	Time for 3 images (s)	Peak RAM Used (GB)
512x512	89.33	15.83
640x448	102.43	17.44
704x512	132.35	20.74
832x768	312.43	29.15
896x896	X	X
1024x768	X	X
1024x1024	X	X

Doggettx-optimizations branch (M1 workarounds)

Size	Time for 3 images (s)	Peak RAM Used (GB)
512x512	95.24	15.15
640x448	108.42	19.00
704x512	135.68	23.26
832x768	345.68	38.52
896x896	520.39	38.29
1024x768	464.30	21.36
1024x1024	X	X

X means failed test.
Overall, I got 2 more image sizes to work using Doggettx's code, and I'm hopeful the performance drop (6-10%) can be recovered. It's still a very basic adaptation for M1. I'll add it to #431 in case someone wants to test it and will try to improve it.

lstein · 2022-09-08T13:03:06Z

Great! I want to make a release soon and the choice of optimization(s) is a key milestone.

willcohen · 2022-09-08T14:37:14Z

After reviewing the back and forth on this thread, I do have one clarifying question:
Is it correct that none of the branches on this fork under discussion work on a 4GB card, and that the @basujindal fork's optimized scripts are the primary way to go there?

I've very much appreciated the set of optimizations and additional functionality provided by dream and the rest of this fork on my personal M2 machine -- having some option to deploy this superset of functionality (even if with reduced speed) on 4GB cards would be quite helpful too. At my place of employment, most of my colleagues and I are on 4GB GPUs which are clearly not primarily built for ML, but this particular use case is still right up many of our alleys.

JohnAlcatraz · 2022-09-08T16:02:55Z

It looks like the 'ryudrigo-optimizations' branch uses significantly more VRAM at high resolutions, causing it to run out of memory much sooner than the 'doggettx-optimizations' branch. The 'doggettx-optimizations' branch can generate resolutions where the 'ryudrigo-optimizations' branch already is OOM, so the winner regarding VRAM usage is the 'doggettx-optimizations' branch.
It also looks like the 'ryudrigo-optimizations', at least on some hardware, is significantly slower than the 'doggettx-optimizations' branch, especially at lower resolutions. I have not yet seen any data that shows that the 'ryudrigo-optimizations' branch is faster than the 'doggettx-optimizations' branch anywhere. So in this regard, the 'doggettx-optimizations' branch is also the clear winner.
The 'ryudrigo-optimizations' branch seems to on some resolutions on some hardware manage to have the same speed as the 'doggettx-optimizations' branch while having a lower VRAM usage, but there is not really any benefit to that, as it does not translate to being able to generate higher resolutions, it's actually the opposite.

So overall, the 'doggettx-optimizations' branch seems to be superior in every regard.

JohnAlcatraz · 2022-09-08T16:04:02Z

After reviewing the back and forth on this thread, I do have one clarifying question: Is it correct that none of the branches on this fork under discussion work on a 4GB card, and that the @basujindal fork's optimized scripts are the primary way to go there?

I don't have a 4 GB GPU to test that, but my guess would be that the 'doggettx-optimizations' branch works fine on 4 GB GPUs.

willcohen · 2022-09-08T16:14:29Z

I thought it might too, but with a Quadro P1000 (4GB), I get a CUDA OOM using doggettx-optimizations, ryudrigo-optimizations, and development on Windows 11 via conda. The optimizedSD fork works, though.

lstein · 2022-09-08T16:42:59Z

I thought it might too, but with a Quadro P1000 (4GB), I get a CUDA OOM using doggettx-optimizations, ryudrigo-optimizations, and development on Windows 11 via conda. The optimizedSD fork works, though.

All the memory optimizations have been directed at reducing memory requirements during image generation. Loading the model itself, which happens during initialization, requires 4.2 GB on its own. So 4GB cards are currently not supported by this fork. Sorry!

willcohen · 2022-09-08T17:39:37Z

Right, of course. Thanks!

lstein · 2022-09-08T20:33:38Z

I should add that the basujindal fork does reduce the memory used during model loading and allows you to run SD on 4 GB cards.

lkewis · 2022-09-08T23:00:43Z

Have tested out these branches and just want to say the work being done he is amazing, both for people with less VRAM and those like myself that can squeeze out larger res. Though I notice larger res is not always better since the model is still effectively 512x512, there's definitely a point where you hit tradeoffs.

Just wondered if these changes would / could pass through into the upstitching methods like txt2imghd allowing for larger sized chunks to be processed during upscaling and detailing?

JohnAlcatraz · 2022-09-08T23:33:28Z

Though I notice larger res is not always better since the model is still effectively 512x512, there's definitely a point where you hit tradeoffs.

Take a look at what I explained here, that way you can generate at high resolutions without tradeoffs: #364 (comment)

lstein · 2022-09-09T01:12:52Z

I'm not sure how etiquette goes in these cases.

What @tildebyte said. The goal is to make the repo better and better. The whole idea of open source is to allow everyone to contribute so we can end up with the best experience. Some discussions might lead to debates but as long as we keep it civil, reasonable and reach an amicable understanding, I think we're all good.

No code is perfect. Everything can be bettered in some way or another.

Just piping in here to second what @tildebyte and @blessedcoolant said. 99% of this repository is other people's code, and I and my collaborators are just trying to pull together the best innovations that are out there to create a stable base and a good user experience. All contributions are gratefully accepted, but are subject to review and testing.

lstein · 2022-09-09T01:15:38Z

After reviewing the back and forth on this thread, I do have one clarifying question: Is it correct that none of the branches on this fork under discussion work on a 4GB card, and that the @basujindal fork's optimized scripts are the primary way to go there?

I don't have a 4 GB GPU to test that, but my guess would be that the 'doggettx-optimizations' branch works fine on 4 GB GPUs.

No, I don't think they will. All the optimizations we have been working on affect VRAM usage during image inference and generation. The loading of the model that takes place during initialization takes more than 4 GB (I think it's 4.2, so tantalizingly close!) and will cause an OOM error before you get to the inference prompt on 4 GB cards.

The basunjal optimizations reduce memory requirements at load time, but unfortunately slow down performance noticeably and so haven't been incorporated here.

lstein · 2022-09-09T01:17:48Z

After reviewing the back and forth on this thread, I do have one clarifying question: Is it correct that none of the branches on this fork under discussion work on a 4GB card, and that the @basujindal fork's optimized scripts are the primary way to go there?

I've very much appreciated the set of optimizations and additional functionality provided by dream and the rest of this fork on my personal M2 machine -- having some option to deploy this superset of functionality (even if with reduced speed) on 4GB cards would be quite helpful too. At my place of employment, most of my colleagues and I are on 4GB GPUs which are clearly not primarily built for ML, but this particular use case is still right up many of our alleys.

Personally I don't like the memory->speed tradeoff in optimizedSD. I'm hoping that stable-diffusion-v1.5 will have reduced memory requirements and will run on 4 GB cards out of the box.

willcohen · 2022-09-09T02:23:47Z

That makes a lot of sense, and I think all the testing above bears out that in any situation where the model loads it’s probably the right move.

The only potential thing that might be worth considering (even with a potential sub-4GB later release) is that many of the potential users with 4GB cards may be precisely the people who can’t fully free the full 4GB if it’s their laptop or machine’s sole GPU. A colleague with a small-ish XPS laptop can get the optimizedSD fork to run, but only with full precision and all the optimizations active and only after closing out all other open apps. In an odd way, that kind of user would be especially appreciative of the other UX improvements here — batch queuing a set of commands to run, logging all the output in a clean out file for easier reference given that even low DDIM runs still can take some time, etc.

I guess after this next round of optimization settles here, it might still be worth considering the possibility of including a flag that splits the model for the dream workflow — with all the serious performance caveats documented in a big way — as a singular final fallback for the machines that have no other choice.

Separately — thank you so much to all for all of your work. Watching this develop so rapidly and collaboratively the last few weeks has been absolutely fascinating!

lstein · 2022-09-09T04:01:11Z

Yes, I think we'll get the inference optimization squared away and then can look into the model loading optimizations as a user flag.

Thanks for the kind words! This is a fun project to work on.

lstein · 2022-09-09T04:02:15Z

I'm porting @Doggettx optimizations to M1 and it looks promising :)
Initial results.

@Any-Winter-4079, how's your progress on the M1 port?

Any-Winter-4079 · 2022-09-09T09:40:33Z

I'm porting @Doggettx optimizations to M1 and it looks promising :)
Initial results.

@Any-Winter-4079, how's your progress on the M1 port?

Check #431 (comment)
It works reasonably well. Speed is on par with the best I've had. Memory is a bit better.
There's more improvements to be made -for a future PR-, but those may take a bit more time b/c there seems to be a bug pertaining Metal

lstein · 2022-09-09T17:40:53Z

Thank you everyone for the wonderful work you did benchmarking and debugging the various optimizations. I have chosen the @Doggettx optimizations, with fixes contributed by @Any-Winter-4079 to run correctly on Macintosh M1 hardware. These optimizations have gone into the development branch and will be in the soon-forthcoming (I hope) 1.14 release.

Doggettx · 2022-09-10T09:21:07Z

I made some more minor improvements when running in auto_cast or half mode, which seems to have made it run a lot faster
and need much less vram. Can even run at 2560x2560 on my 3090 now, haven't tested higher. Performance has gone up by about 45% on 1024x1024 when running in auto_cast

Could use some testing though, since I don't know what it does on lower VRAM cards. If someone is willing to try, I've put it in a separate branch at

https://github.com/Doggettx/stable-diffusion/tree/autocast-improvements

lstein · 2022-09-10T13:14:54Z

Just a heads up. Mac users who have been testing on the release candidate (which contains the previous set of @Doggettx optimizations) are reporting a 2-3x decrease in speed on M1 hardware. This needs to be fixed before we announce a release, unfortunately.

mh-dm · 2022-09-17T22:36:42Z

With latest performance changes like #540 and #582 and #569 and #495 and #653 it's possible to generate standard 512x512 images faster than ever (ex: 2.0it/s or 25.97s for full 50 step run) or larger and larger images, like up to 1532x1280 with video card with 8GB.
Closing this issue.

lstein mentioned this issue Sep 4, 2022

BUG: OOMs with c78b496 #358

Closed

lstein closed this as completed Sep 9, 2022

ryudrigo mentioned this issue Sep 10, 2022

Alternate optimization #432

Closed

lstein reopened this Sep 10, 2022

lstein unpinned this issue Sep 11, 2022

mh-dm closed this as completed Sep 17, 2022

Any-Winter-4079 mentioned this issue Oct 26, 2022

test -H 1024 -W 1024 fail on MacOS(VENTURA) with NDArray > 2**32 #1244

Closed

Stable Diffusion PR optimizes VRAM, generate 576x1280 images with 6 GB VRAM #364

Stable Diffusion PR optimizes VRAM, generate 576x1280 images with 6 GB VRAM #364

Comments

mgcrea commented Sep 4, 2022 • edited Loading

neonsecret commented Sep 4, 2022

lstein commented Sep 4, 2022

sunija-dev commented Sep 4, 2022

lstein commented Sep 4, 2022

Vargol commented Sep 4, 2022 • edited Loading

lstein commented Sep 4, 2022

smoke2007 commented Sep 4, 2022

lstein commented Sep 4, 2022

Vargol commented Sep 4, 2022 • edited Loading

lstein commented Sep 4, 2022

Vargol commented Sep 4, 2022

neonsecret commented Sep 4, 2022 • edited Loading

Vargol commented Sep 4, 2022

lstein commented Sep 4, 2022

Vargol commented Sep 4, 2022 • edited Loading

lstein commented Sep 4, 2022

lstein commented Sep 4, 2022

Vargol commented Sep 4, 2022 • edited Loading

lstein commented Sep 4, 2022

veprogames commented Sep 4, 2022 • edited Loading

lstein commented Sep 4, 2022

veprogames commented Sep 4, 2022

blessedcoolant commented Sep 4, 2022

Vargol commented Sep 4, 2022 • edited Loading

bmaltais commented Sep 4, 2022 • edited Loading

thelemuet commented Sep 4, 2022 • edited Loading

cvar66 commented Sep 4, 2022

bmaltais commented Sep 4, 2022

Ratinod commented Sep 4, 2022 • edited Loading

ryudrigo commented Sep 8, 2022

lstein commented Sep 8, 2022

Any-Winter-4079 commented Sep 8, 2022 • edited Loading

lstein commented Sep 8, 2022

willcohen commented Sep 8, 2022

JohnAlcatraz commented Sep 8, 2022 • edited Loading

JohnAlcatraz commented Sep 8, 2022

willcohen commented Sep 8, 2022

lstein commented Sep 8, 2022

willcohen commented Sep 8, 2022

lstein commented Sep 8, 2022

lkewis commented Sep 8, 2022

JohnAlcatraz commented Sep 8, 2022

lstein commented Sep 9, 2022

lstein commented Sep 9, 2022

lstein commented Sep 9, 2022

willcohen commented Sep 9, 2022

lstein commented Sep 9, 2022

lstein commented Sep 9, 2022

Any-Winter-4079 commented Sep 9, 2022

lstein commented Sep 9, 2022

Doggettx commented Sep 10, 2022

lstein commented Sep 10, 2022

mh-dm commented Sep 17, 2022 • edited Loading

mgcrea commented Sep 4, 2022 •

edited

Loading

Vargol commented Sep 4, 2022 •

edited

Loading

Vargol commented Sep 4, 2022 •

edited

Loading

neonsecret commented Sep 4, 2022 •

edited

Loading

Vargol commented Sep 4, 2022 •

edited

Loading

Vargol commented Sep 4, 2022 •

edited

Loading

veprogames commented Sep 4, 2022 •

edited

Loading

Vargol commented Sep 4, 2022 •

edited

Loading

bmaltais commented Sep 4, 2022 •

edited

Loading

thelemuet commented Sep 4, 2022 •

edited

Loading

Ratinod commented Sep 4, 2022 •

edited

Loading

Any-Winter-4079 commented Sep 8, 2022 •

edited

Loading

JohnAlcatraz commented Sep 8, 2022 •

edited

Loading

mh-dm commented Sep 17, 2022 •

edited

Loading