You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Could gpt4all be adapted so that llama.cpp can be launched with x number of layers offloaded to the GPU?
At the moment, it is either all or nothing, complete GPU-offloading or completely CPU.
Llama.cpp supports partial GPU-offloading for many months now.
E.g. I'm able to run Mistral 7b 4-bit (Q4_K_S) partially on a 4GB GDDR6 GPU with about 75% of the layers offloaded to my GPU.
On my low-end system it gives maybe a 50% speed boost compared to CPU only.
This works with llama.cpp from the command line with the -ngl parameter.
Motivation
Faster inference on low-end systems.
The text was updated successfully, but these errors were encountered:
Partial GPU offloading has been supported since many versions of GPT4All. The code was introduced with PR #1890 back in January.
If you go to settings < model settings and scroll down to GPU layers, you can experiment with the number of layers and find the optimal performance.
Feature request
Could gpt4all be adapted so that llama.cpp can be launched with x number of layers offloaded to the GPU?
At the moment, it is either all or nothing, complete GPU-offloading or completely CPU.
Llama.cpp supports partial GPU-offloading for many months now.
E.g. I'm able to run Mistral 7b 4-bit (Q4_K_S) partially on a 4GB GDDR6 GPU with about 75% of the layers offloaded to my GPU.
On my low-end system it gives maybe a 50% speed boost compared to CPU only.
This works with llama.cpp from the command line with the -ngl parameter.
Motivation
Faster inference on low-end systems.
The text was updated successfully, but these errors were encountered: