Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Model memory usage / quantization #91

Open
coder543 opened this issue Sep 5, 2023 · 2 comments
Open

Model memory usage / quantization #91

coder543 opened this issue Sep 5, 2023 · 2 comments
Assignees

Comments

@coder543
Copy link

coder543 commented Sep 5, 2023

According to this Refact blog post:

Check out the docs on self-hosting to get your AI code assistant up and running.
To run StarCoder using 4-bit quantization, you’ll need a 12GB GPU, and for 8-bit you’ll need 24GB.
It’s currently available for VS Code, and JetBrains IDEs.

I am currently using a 12GB GPU (RTX 4070), so that sounds great.

However, the interface does not offer any options to select quantization:

image

If I attempt to select codellama/7b or starcoder/15b/base, it claims that I don't have enough memory... which is strange, considering I've been running (quantized) 13B parameter llama2 models on my GPU just fine using other software.

image


The memory usage of even the smallest models is rather weird.

According to this Refact blog post:

With the smaller size, running the model is much faster and affordable than ever: the model can be served on most of all modern GPUs requiring just 3Gb[sic, 0] RAM and works great for real-time code completion tasks.

I've noticed that the Refact/1.6B model starts at about 5.3GB of VRAM usage, but jumps up to the full 12GB of VRAM as soon as I start doing any completions, which seems confusingly inefficient for a 1.6B parameter model, and a stark contrast to the stated goal of running on GPUs with only 3GB of VRAM. When I kill the container, my VRAM usage drops back to around zero, so it's not some other program using all this VRAM.

[sic, 0]: 3Gb (Gigabits) == ~0.375GB (Gigabytes), I'm assuming this should be 3GB.

image


I'm just running the Docker container under Docker on Windows using WSL2, and everything works fine, it's just memory usage that is confusing and concerning compared to other LLM software I have used (also under WSL2).

I'm not sure if there are plans to offer quantized model options through the GUI, if there are ways of selecting these quantized models without the GUI, or whatever other options.

@mitya52
Copy link
Member

mitya52 commented Sep 6, 2023

@coder543 hi!

We're using auto-gptq backend for most of the models (except of Refact, CONTRASTcode and codellama). They are 4bit quantized and should work with you setup (not sure about 15b). Required memory exceeds is just a warning and may be confusing. It only tells that you can get OOM with large file/chat context.

CodeLLama is 8bit quantized dynamically with bitsandbytes. I think we'll move models from auto-gptq to bitsandbytes or ggml backend and add quiantization option.

Refact model shouldn't use too much memory, your estimation is close to our. I should admit it as bug.

Thanks for your report!

@olegklimov
Copy link
Contributor

We have sharding, should be solved! (not yet in docker today)

@olegklimov olegklimov moved this from Implemented to Released in Docker in Self-hosted / Enterprise Oct 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

3 participants