Cuda build and --n-gpu-layers set to 0 #10200

clarinevong · 2024-11-07T09:33:05Z

clarinevong
Nov 7, 2024

Hello, I have a LLamaSharp related question. I commented on an issue in the LLamaSharp repo and was redirected here.

I am using the LLamaSharp library version 1.18.0 which corresponds to the c35e586e commit id in llama.cpp.
I am using the CUDA backend but I would also like to serve models on CPU. To do that I set the GpuLayerCount parameter to 0 which seems to be an equivalent of --n-gpu-layers. However I noticed that some memory was allocated on my GPU.

1: ggml_cuda_init: found 1 CUDA devices:
1:   Device 0: Quadro RTX 3000, compute capability 7.5, VMM: yes
1: llm_load_tensors: ggml ctx size =    0.14 MiB
1: llm_load_tensors: offloading 0 repeating layers to GPU
1: llm_load_tensors: offloaded 0/33 layers to GPU
1: llm_load_tensors:        CPU buffer size =  4492.40 MiB
0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0: .0:
1: llama_new_context_with_model: n_ctx      = 512
1: llama_new_context_with_model: n_batch    = 512
1: llama_new_context_with_model: n_ubatch   = 512
1: llama_new_context_with_model: flash_attn = 0
1: llama_new_context_with_model: freq_base  = 1000000.0
1: llama_new_context_with_model: freq_scale = 1
1: llama_kv_cache_init:  CUDA_Host KV buffer size =    64.00 MiB
1: llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
1: llama_new_context_with_model:  CUDA_Host  output buffer size =     0.35 MiB
1: llama_new_context_with_model:      CUDA0 compute buffer size =   493.29 MiB
1: llama_new_context_with_model:  CUDA_Host compute buffer size =     9.01 MiB
1: llama_new_context_with_model: graph nodes  = 1030
1: llama_new_context_with_model: graph splits = 356

I this considered normal when --n-gpu-layers is set to 0?

I noticed in the llama.cpp build documentation that

When built with Metal support, you can explicitly disable GPU inference with the --n-gpu-layers|-ngl 0 command-line argument.

Does that mean that when llama.cpp is build with CUDA acceleration we can't disable GPU inference?

Answered by ggerganov

Nov 7, 2024

Even with 0 layers offloaded, some ops (like large matrix multiplications) can still be offloaded to the GPU. You can disable the GPU completely by unsetting the environment variable CUDA_VISIBLE_DEVICES. For example:

CUDA_VISIBLE_DEVICES= ./my-app ...

View full answer

ggerganov · 2024-11-07T12:44:48Z

ggerganov
Nov 7, 2024
Maintainer

Even with 0 layers offloaded, some ops (like large matrix multiplications) can still be offloaded to the GPU. You can disable the GPU completely by unsetting the environment variable CUDA_VISIBLE_DEVICES. For example:

CUDA_VISIBLE_DEVICES= ./my-app ...

2 replies

clarinevong Nov 7, 2024
Author

Thanks for your quick response, when I unset CUDA_VISIBLE_DEVICES I still have some GPU RAM allocated (CUDA_Host compute buffer size = 188.75 MiB)


ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/33 layers to GPU
llm_load_tensors:        CPU buffer size =  4492.40 MiB
..........................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 1000000.0
llama_new_context_with_model: freq_scale = 1
ggml_cuda_host_malloc: failed to allocate 64.00 MiB of pinned memory: no CUDA-capable device is detected
llama_kv_cache_init:        CPU KV buffer size =    64.00 MiB
llama_new_context_with_model: KV self size  =   64.00 MiB, K (f16):   32.00 MiB, V (f16):   32.00 MiB
ggml_cuda_host_malloc: failed to allocate 0.35 MiB of pinned memory: no CUDA-capable device is detected
llama_new_context_with_model:        CPU  output buffer size =     0.35 MiB
ggml_cuda_host_malloc: failed to allocate 188.75 MiB of pinned memory: no CUDA-capable device is detected
llama_new_context_with_model:  CUDA_Host compute buffer size =   188.75 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1

slaren Nov 7, 2024
Collaborator

CUDA_Host is not VRAM, it is a pinned buffer in system memory. It's not ideal, but it shouldn't cause significant issues. In the future I would like to allow applications to choose which devices to use with a model, but we are not there yet.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda build and --n-gpu-layers set to 0 #10200

{{title}}

Replies: 1 comment 2 replies

{{title}}

{{title}}

{{title}}

Select a reply

Cuda build and --n-gpu-layers set to 0 #10200

clarinevong Nov 7, 2024

Replies: 1 comment · 2 replies

ggerganov Nov 7, 2024 Maintainer

clarinevong Nov 7, 2024 Author

slaren Nov 7, 2024 Collaborator

clarinevong
Nov 7, 2024

Replies: 1 comment 2 replies

ggerganov
Nov 7, 2024
Maintainer

clarinevong Nov 7, 2024
Author

slaren Nov 7, 2024
Collaborator