Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
partial cherrypick of @neonsecret's optimized attention CompVis#177
I didn't implement the most consequential part (splitting the softmax in two) because M1 Mac is not so VRAM-constrained. but I implemented the reference-freeing, and also freed x earlier.
- Loading branch information
dab78e9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
its a lot more then that..
dab78e9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for reviewing!
I think the only part I missed was this?
ah, certainly I neglected to free the reference to
sim
.ah, and you're re-using sim's storage (to hold the
softmax()
result and then theeinsum()
result)?okay, I'll certainly add those.
but the
sim[4:]
split… what does this achieve? it reduces concurrency. so if I have enough VRAM (I have 64GB), presumably it's faster to avoid doing this?dab78e9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
take at look at my current fork https://github.com/neonsecret/stable-diffusion
and it's changes. it's a whole lot more then that..
dab78e9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! I'll have a dig. it's a 91-file diff though, and in most cases starts new files altogether, so hard to find the important bits…
CompVis/stable-diffusion@main...neonsecret:stable-diffusion:main
any key files you'd recommend looking at?
dab78e9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah that's why you probably shouldn't try to merge it, it differs too much now
dab78e9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
okay, I've reviewed the attention.py in your branch. certainly there's more there than was in CompVis#177.
however to my understanding, the attention.py changes are only to reduce memory usage, and come at the expense of inference speed?
I have the opposite problem. M1 GPUs are slow at inference, but have loads of VRAM.
we also cannot use
torch.cuda.memory_stats(device)
,torch.cuda.mem_get_info(torch.cuda.current_device())
ortorch.cuda.empty_cache()
(because we do not have CUDA) and cannot use FP16.is there anything you'd recommend for improving inference speed?
dab78e9
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wait, are you load-balancing between CUDA and CPU? is that for speed or for memory?