supports running on CPU for GGML_USE_CUBLAS=ON build #3946

wsxiaoys · 2023-11-04T22:15:44Z

Should work on a machine without CUDA runtime but model.n_gpu_layers = 0.

The current behavior in master is throwing following error on a non-cuda machine when GGML_USE_CUBLAS=ON

master

CPU

CUDA_VISIBLE_DEVICES=-1 ./bin/main -m ../models/q8_0.v2.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128

CUDA error 100 at /home/ubuntu/sky_workdir/llama.cpp/ggml-cuda.cu:5830: no CUDA-capable device is detected
current device: 48

this PR

CPU

CUDA_VISIBLE_DEVICES=-1 ./bin/main -m ../models/q8_0.v2.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128

llama_print_timings:        load time =     198.51 ms
llama_print_timings:      sample time =     100.40 ms /   128 runs   (    0.78 ms per token,  1274.95 tokens per second)
llama_print_timings: prompt eval time =    1383.23 ms /    20 tokens (   69.16 ms per token,    14.46 tokens per second)
llama_print_timings:        eval time =    9991.21 ms /   127 runs   (   78.67 ms per token,    12.71 tokens per second)
llama_print_timings:       total time =   11540.18 ms

CPU but requesting ngl > 0

CUDA_VISIBLE_DEVICES=-1 ./bin/main -m ../models/q8_0.v2.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 999

CUDA error 100 at /home/ubuntu/sky_workdir/llama.cpp/ggml-cuda.cu:478: no CUDA-capable device is detected

CUDA

./bin/main -m ../models/q8_0.v2.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 999

llama_print_timings:        load time =     397.99 ms
llama_print_timings:      sample time =     101.63 ms /   128 runs   (    0.79 ms per token,  1259.51 tokens per second)
llama_print_timings: prompt eval time =      54.91 ms /    20 tokens (    2.75 ms per token,   364.23 tokens per second)
llama_print_timings:        eval time =    1768.41 ms /   127 runs   (   13.92 ms per token,    71.82 tokens per second)
llama_print_timings:       total time =    1979.19 ms

llama.cpp

ggerganov · 2023-11-05T07:55:53Z

We might want to merge something like this PR, so that 3rd party projects have an easier way to support optional CPU-only runs. Though, I'm not sure if this is the best way to do.
Any alternative suggestions?

ggml.c

wsxiaoys · 2023-11-05T21:09:51Z

We might want to merge something like this PR, so that 3rd party projects have an easier way to support optional CPU-only runs. Though, I'm not sure if this is the best way to do. Any alternative suggestions?

Some alternative ideas considered when prototyping this pull request (assuming GGML_USE_CUBLAS=ON):

Utilizing the CUDA_VISIBLE_DEVICES environment variable: If no valid indices are present in CUDA_VISIBLE_DEVICES, and there is insufficient deduction for CUDA availability, we can expose an API that is valid without depending on ggml_cublas_init.
Relocating ggml_cublas_init to llama.cpp, similar to the 'metal' backend: I haven't delved deeply into this approach, as it appears to be incompatible with the ggml backend v2.

slaren · 2023-11-05T21:27:55Z

Most of this will become obsolete after llama.cpp is adapted to use ggml-backend. After that, the way this will be implemented is by making ggml_backend_cuda_init return NULL (or something to the same effect). I cannot say when ggml-backend will be ready to be used in llama.cpp, so I think it is ok to add a temporary solution, but considering that, the best solution IMO is whatever is easiest to remove in the future.

wsxiaoys · 2023-11-06T00:56:12Z

I cleaned up it a bit and think it should be relative easy for a ggml_backend migration.

…=on build

ggerganov · 2023-11-06T05:59:31Z

llama.cpp

+    if (ggml_cublas_loaded()) {
+        return ggml_cuda_host_malloc(n);
+    } else {
+        return malloc(n);
+    }


Should we move the ggml_cublas_loaded() checks in side the ggml_cuda_host_malloc() and ggml_cuda_host_free() calls?

I can do that, but I feel it might be better to make it explicit. This way, downstream code will have a chance to differentiate whether the memory is actually allocating CUDA RAM, which is likely to involve certain memory alignment requirements.

ggerganov

I think this can serve as a temp solution, until we start using the new backend and refactor this code.

One question: if I had a machine with a CUDA device, but I still wanted to force CPU-only computation, what would be my option - set CUDA_VISIBLE_DEVICES=-1? Note that simply setting -ngl 0 would not work because ggml will keep moving data to the device and do some of the computations there instead of the CPU

ggerganov · 2023-11-06T06:04:10Z

ggml-cuda.h

@@ -18,6 +18,8 @@ extern "C" {
 #define GGML_CUDA_MAX_DEVICES       16

 GGML_API void   ggml_init_cublas(void);
+GGML_API bool   ggml_cublas_loaded(void);


Add comment that we differentiate between "initialized" and "loaded" state. The former means we have called ggml_init_cublas() but it's not guaranteed that there has been a CUDA device available, in which case ggml_cublas_loaded() is false.

wsxiaoys · 2023-11-06T06:46:26Z

Note that simply setting -ngl 0 would not work because ggml will keep moving data to the device and do some of the computations there instead of the CPU

I experimented with -ngl 0, and it appeared to match the performance of the CPU-only option. Nevertheless, this may not be a very common use case. Therefore, we should consider recommending the usage of CUDA_VISIBLE_DEVICES=-1 instead.

ggerganov · 2023-11-06T06:53:18Z

It's possible that the behaviour with -ngl 0 has changed and now would actually perform everything on the CPU without moving data back and forth as it used to do in the past. If you don't see any GPU usage in your test, then all should be good.

llama.cpp

slaren · 2023-11-06T12:01:18Z

It's possible that the behaviour with -ngl 0 has changed and now would actually perform everything on the CPU without moving data back and forth as it used to do in the past. If you don't see any GPU usage in your test, then all should be good.

That's not the case, CUDA is still used for large matrix multiplication regardless of the value of -ngl.

wsxiaoys · 2023-11-07T06:05:11Z

Kindly ping @cebtenzzre

…v#3946) * protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build * doc: add comments to ggml_cublas_loaded() * fix defined(...)

zhangp365 · 2023-11-07T09:17:11Z

I compiled the code using the latest version and copied the resulting executable file to a computer without a graphics card and no CUDA installed. When I attempted to run the main.exe file, it failed to execute. The relevant error message is as follows:

…v#3946) * protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build * doc: add comments to ggml_cublas_loaded() * fix defined(...)

wsxiaoys · 2024-02-03T18:07:33Z

Due to newer changes adding link against libcuda - the fix is no longer working.

It will generates following error message for a cuda build when running in non-cuda environment:

/opt/tabby/bin/tabby: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

#4606 seems to be the culprint

Looking into workaround

wsxiaoys marked this pull request as draft November 4, 2023 22:21

cebtenzzre reviewed Nov 4, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

wsxiaoys changed the title ~~Prototyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build~~ Prototyping the idea that supports running on CPU for a GGML_USE_CUBLAS=ON build Nov 4, 2023

wsxiaoys force-pushed the support-cuda-separate-init branch 2 times, most recently from 1caf0c4 to 1a1ffd4 Compare November 4, 2023 23:22

wsxiaoys marked this pull request as ready for review November 4, 2023 23:46

wsxiaoys changed the title ~~Prototyping the idea that supports running on CPU for a GGML_USE_CUBLAS=ON build~~ supports running on CPU for GGML_USE_CUBLAS=ON build Nov 4, 2023

ggerganov added the need feedback Testing and feedback with results are needed label Nov 5, 2023

slaren reviewed Nov 5, 2023

View reviewed changes

ggml.c Outdated Show resolved Hide resolved

ggerganov mentioned this pull request Nov 5, 2023

whisper : add context param for disable gpu ggerganov/whisper.cpp#1293

Merged

wsxiaoys force-pushed the support-cuda-separate-init branch from 32f07ea to 42e642a Compare November 5, 2023 22:39

wsxiaoys force-pushed the support-cuda-separate-init branch from 9c655dc to b66fdd1 Compare November 6, 2023 02:17

protyping the idea that supports running on CPU for a GGML_USE_CUBLAS…

c58e809

…=on build

wsxiaoys force-pushed the support-cuda-separate-init branch from b66fdd1 to c58e809 Compare November 6, 2023 03:08

ggerganov reviewed Nov 6, 2023

View reviewed changes

ggerganov approved these changes Nov 6, 2023

View reviewed changes

doc: add comments to ggml_cublas_loaded()

fe381b0

ggerganov requested a review from slaren November 6, 2023 08:24

slaren reviewed Nov 6, 2023

View reviewed changes

llama.cpp Outdated Show resolved Hide resolved

fix defined(...)

fa23494

slaren approved these changes Nov 6, 2023

View reviewed changes

ggerganov merged commit 46876d2 into ggerganov:master Nov 7, 2023
32 checks passed

slaren mentioned this pull request Nov 13, 2023

sync : llama.cpp (CUDA ReLU, CPU-only with CUDA, bloom fix, etc) ggerganov/ggml#607

Merged

martindevans mentioned this pull request Nov 15, 2023

Only Distribute CUDA Binaries? SciSharp/LLamaSharp#298

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

supports running on CPU for GGML_USE_CUBLAS=ON build #3946

supports running on CPU for GGML_USE_CUBLAS=ON build #3946

wsxiaoys commented Nov 4, 2023 •

edited

Loading

ggerganov commented Nov 5, 2023

wsxiaoys commented Nov 5, 2023 •

edited

Loading

slaren commented Nov 5, 2023 •

edited

Loading

wsxiaoys commented Nov 6, 2023

ggerganov Nov 6, 2023

wsxiaoys Nov 6, 2023

ggerganov left a comment

ggerganov Nov 6, 2023

wsxiaoys Nov 6, 2023

wsxiaoys commented Nov 6, 2023

ggerganov commented Nov 6, 2023

slaren commented Nov 6, 2023

wsxiaoys commented Nov 7, 2023

zhangp365 commented Nov 7, 2023

wsxiaoys commented Feb 3, 2024 •

edited

Loading

supports running on CPU for GGML_USE_CUBLAS=ON build #3946

supports running on CPU for GGML_USE_CUBLAS=ON build #3946

Conversation

wsxiaoys commented Nov 4, 2023 • edited Loading

master

this PR

ggerganov commented Nov 5, 2023

wsxiaoys commented Nov 5, 2023 • edited Loading

slaren commented Nov 5, 2023 • edited Loading

wsxiaoys commented Nov 6, 2023

ggerganov Nov 6, 2023

Choose a reason for hiding this comment

wsxiaoys Nov 6, 2023

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

ggerganov Nov 6, 2023

Choose a reason for hiding this comment

wsxiaoys Nov 6, 2023

Choose a reason for hiding this comment

wsxiaoys commented Nov 6, 2023

ggerganov commented Nov 6, 2023

slaren commented Nov 6, 2023

wsxiaoys commented Nov 7, 2023

zhangp365 commented Nov 7, 2023

wsxiaoys commented Feb 3, 2024 • edited Loading

wsxiaoys commented Nov 4, 2023 •

edited

Loading

wsxiaoys commented Nov 5, 2023 •

edited

Loading

slaren commented Nov 5, 2023 •

edited

Loading

wsxiaoys commented Feb 3, 2024 •

edited

Loading