Model memory usage / quantization #91

coder543 · 2023-09-05T14:56:48Z

Check out the docs on self-hosting to get your AI code assistant up and running.
To run StarCoder using 4-bit quantization, you’ll need a 12GB GPU, and for 8-bit you’ll need 24GB.
It’s currently available for VS Code, and JetBrains IDEs.

I am currently using a 12GB GPU (RTX 4070), so that sounds great.

However, the interface does not offer any options to select quantization:

If I attempt to select codellama/7b or starcoder/15b/base, it claims that I don't have enough memory... which is strange, considering I've been running (quantized) 13B parameter llama2 models on my GPU just fine using other software.

The memory usage of even the smallest models is rather weird.

According to this Refact blog post:

With the smaller size, running the model is much faster and affordable than ever: the model can be served on most of all modern GPUs requiring just 3Gb[sic, 0] RAM and works great for real-time code completion tasks.

I've noticed that the Refact/1.6B model starts at about 5.3GB of VRAM usage, but jumps up to the full 12GB of VRAM as soon as I start doing any completions, which seems confusingly inefficient for a 1.6B parameter model, and a stark contrast to the stated goal of running on GPUs with only 3GB of VRAM. When I kill the container, my VRAM usage drops back to around zero, so it's not some other program using all this VRAM.

[sic, 0]: 3Gb (Gigabits) == ~0.375GB (Gigabytes), I'm assuming this should be 3GB.

I'm just running the Docker container under Docker on Windows using WSL2, and everything works fine, it's just memory usage that is confusing and concerning compared to other LLM software I have used (also under WSL2).

I'm not sure if there are plans to offer quantized model options through the GUI, if there are ways of selecting these quantized models without the GUI, or whatever other options.

The text was updated successfully, but these errors were encountered:

mitya52 · 2023-09-06T16:28:42Z

@coder543 hi!

We're using auto-gptq backend for most of the models (except of Refact, CONTRASTcode and codellama). They are 4bit quantized and should work with you setup (not sure about 15b). Required memory exceeds is just a warning and may be confusing. It only tells that you can get OOM with large file/chat context.

CodeLLama is 8bit quantized dynamically with bitsandbytes. I think we'll move models from auto-gptq to bitsandbytes or ggml backend and add quiantization option.

Refact model shouldn't use too much memory, your estimation is close to our. I should admit it as bug.

Thanks for your report!

olegklimov · 2023-10-03T11:42:43Z

We have sharding, should be solved! (not yet in docker today)

olegklimov added this to Self-hosted / Enterprise Oct 3, 2023

olegklimov moved this to Implemented in Self-hosted / Enterprise Oct 3, 2023

olegklimov assigned mitya52 Oct 3, 2023

olegklimov moved this from Implemented to Released in Docker in Self-hosted / Enterprise Oct 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Model memory usage / quantization #91

Model memory usage / quantization #91

coder543 commented Sep 5, 2023 •

edited

Loading

mitya52 commented Sep 6, 2023

olegklimov commented Oct 3, 2023

Model memory usage / quantization #91

Model memory usage / quantization #91

Comments

coder543 commented Sep 5, 2023 • edited Loading

mitya52 commented Sep 6, 2023

olegklimov commented Oct 3, 2023

coder543 commented Sep 5, 2023 •

edited

Loading