Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantized models on multi-GPU #1813

Open
hugoabonizio opened this issue Mar 7, 2024 · 6 comments
Open

Quantized models on multi-GPU #1813

hugoabonizio opened this issue Mar 7, 2024 · 6 comments

Comments

@hugoabonizio
Copy link
Contributor

hugoabonizio commented Mar 7, 2024

I'm experimenting with the new implementation of CUDA acceleration for quantized models and wondering how to use sharded tensors in this context. I'm having a hard time adapting the ShardedVarBuilder to load like quantized_var_builder::VarBuilder::from_gguf.

Do you have any recommendations on the best approach in this case?

@LaurentMazare

@hugoabonizio
Copy link
Contributor Author

hugoabonizio commented Mar 14, 2024

@LaurentMazare I'm sorry to bother, but I just want to ask: is it possible to use the current implementation of quantized models in a multi-GPU setup (like the llama_multiprocess example)? If not, is there any plan to support this feature in the future?

I appreciate your work on pushing forward the CUDA kernels for quantization.

@LaurentMazare
Copy link
Collaborator

I'm not sure that having the same technique as what is used for llama-multiprocess would make sense here. The llama-multiprocess version is useful when some tensors have to be shared across different gpus, however I don't think there would be quantized models that would be large enough so that this would actually be useful?
If the goal is just to have multiple models that live on different gpus, then that part should be reasonably easy to do even with the current api by creating one device per gpu that you want to target, but maybe you're after something more complex than this?

@hugoabonizio
Copy link
Contributor Author

I'm after sharding larger models that wouldn't fit on a single 24GB GPU and could instead be split across, for example, 4 of them. If I'm not mistaken, llama.cpp supports multi-GPU through pipeline parallelism but supported tensor splitting between GPUs before that.

@LaurentMazare
Copy link
Collaborator

If there is no need to shard one tensor one multiple gpus, I would recommend doing something a lot simpler than llama-multiprocess and instead put the different weights on different gpus. I guess it's likely what the pipeline processing of llama.cpp is doing.

@hugoabonizio
Copy link
Contributor Author

Unfortunately, it is necessary to shard the tensors for both larger models (40b+ params) and to speed up larger batch sizes. My use case is an API serving multiple concurrent requests.

Is the solution you're suggesting of putting different weights (layers?) on different GPUs similar to transformers' device_map? I suppose it's slower than sharding, right?

@joshpopelka20
Copy link

joshpopelka20 commented Jun 8, 2024

I have a similar use case, where I need to shard a large model (gradient.ai llama3 262K context) across multiple GPUs. Looks like Pytorch has "fully sharded data parallel" https://pytorch.org/blog/introducing-pytorch-fully-sharded-data-parallel-api/ .

Is there long term plans to add something similar to candle?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants