Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

supports running on CPU for GGML_USE_CUBLAS=ON build #3946

Merged
merged 3 commits into from
Nov 7, 2023

Conversation

wsxiaoys
Copy link
Contributor

@wsxiaoys wsxiaoys commented Nov 4, 2023

Should work on a machine without CUDA runtime but model.n_gpu_layers = 0.

The current behavior in master is throwing following error on a non-cuda machine when GGML_USE_CUBLAS=ON

master

CPU

CUDA_VISIBLE_DEVICES=-1 ./bin/main -m ../models/q8_0.v2.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128

CUDA error 100 at /home/ubuntu/sky_workdir/llama.cpp/ggml-cuda.cu:5830: no CUDA-capable device is detected
current device: 48

this PR

CPU

CUDA_VISIBLE_DEVICES=-1 ./bin/main -m ../models/q8_0.v2.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128

llama_print_timings:        load time =     198.51 ms
llama_print_timings:      sample time =     100.40 ms /   128 runs   (    0.78 ms per token,  1274.95 tokens per second)
llama_print_timings: prompt eval time =    1383.23 ms /    20 tokens (   69.16 ms per token,    14.46 tokens per second)
llama_print_timings:        eval time =    9991.21 ms /   127 runs   (   78.67 ms per token,    12.71 tokens per second)
llama_print_timings:       total time =   11540.18 ms

CPU but requesting ngl > 0

CUDA_VISIBLE_DEVICES=-1 ./bin/main -m ../models/q8_0.v2.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 999

CUDA error 100 at /home/ubuntu/sky_workdir/llama.cpp/ggml-cuda.cu:478: no CUDA-capable device is detected

CUDA

./bin/main -m ../models/q8_0.v2.gguf -p "# Dijkstra's shortest path algorithm in Python (4 spaces indentation) + complexity analysis:\n\n" -e -t 4 --temp -1 -n 128 -ngl 999

llama_print_timings:        load time =     397.99 ms
llama_print_timings:      sample time =     101.63 ms /   128 runs   (    0.79 ms per token,  1259.51 tokens per second)
llama_print_timings: prompt eval time =      54.91 ms /    20 tokens (    2.75 ms per token,   364.23 tokens per second)
llama_print_timings:        eval time =    1768.41 ms /   127 runs   (   13.92 ms per token,    71.82 tokens per second)
llama_print_timings:       total time =    1979.19 ms

@wsxiaoys wsxiaoys marked this pull request as draft November 4, 2023 22:21
llama.cpp Outdated Show resolved Hide resolved
@wsxiaoys wsxiaoys changed the title Prototyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build Prototyping the idea that supports running on CPU for a GGML_USE_CUBLAS=ON build Nov 4, 2023
@wsxiaoys wsxiaoys force-pushed the support-cuda-separate-init branch 2 times, most recently from 1caf0c4 to 1a1ffd4 Compare November 4, 2023 23:22
@wsxiaoys wsxiaoys marked this pull request as ready for review November 4, 2023 23:46
@wsxiaoys wsxiaoys changed the title Prototyping the idea that supports running on CPU for a GGML_USE_CUBLAS=ON build supports running on CPU for GGML_USE_CUBLAS=ON build Nov 4, 2023
@ggerganov
Copy link
Owner

We might want to merge something like this PR, so that 3rd party projects have an easier way to support optional CPU-only runs. Though, I'm not sure if this is the best way to do.
Any alternative suggestions?

@ggerganov ggerganov added the need feedback Testing and feedback with results are needed label Nov 5, 2023
ggml.c Outdated Show resolved Hide resolved
@wsxiaoys
Copy link
Contributor Author

wsxiaoys commented Nov 5, 2023

We might want to merge something like this PR, so that 3rd party projects have an easier way to support optional CPU-only runs. Though, I'm not sure if this is the best way to do. Any alternative suggestions?

Some alternative ideas considered when prototyping this pull request (assuming GGML_USE_CUBLAS=ON):

  1. Utilizing the CUDA_VISIBLE_DEVICES environment variable: If no valid indices are present in CUDA_VISIBLE_DEVICES, and there is insufficient deduction for CUDA availability, we can expose an API that is valid without depending on ggml_cublas_init.

  2. Relocating ggml_cublas_init to llama.cpp, similar to the 'metal' backend: I haven't delved deeply into this approach, as it appears to be incompatible with the ggml backend v2.

@slaren
Copy link
Collaborator

slaren commented Nov 5, 2023

Most of this will become obsolete after llama.cpp is adapted to use ggml-backend. After that, the way this will be implemented is by making ggml_backend_cuda_init return NULL (or something to the same effect). I cannot say when ggml-backend will be ready to be used in llama.cpp, so I think it is ok to add a temporary solution, but considering that, the best solution IMO is whatever is easiest to remove in the future.

@wsxiaoys
Copy link
Contributor Author

wsxiaoys commented Nov 6, 2023

I cleaned up it a bit and think it should be relative easy for a ggml_backend migration.

Comment on lines +601 to +605
if (ggml_cublas_loaded()) {
return ggml_cuda_host_malloc(n);
} else {
return malloc(n);
}
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move the ggml_cublas_loaded() checks in side the ggml_cuda_host_malloc() and ggml_cuda_host_free() calls?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that, but I feel it might be better to make it explicit. This way, downstream code will have a chance to differentiate whether the memory is actually allocating CUDA RAM, which is likely to involve certain memory alignment requirements.

Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can serve as a temp solution, until we start using the new backend and refactor this code.

One question: if I had a machine with a CUDA device, but I still wanted to force CPU-only computation, what would be my option - set CUDA_VISIBLE_DEVICES=-1? Note that simply setting -ngl 0 would not work because ggml will keep moving data to the device and do some of the computations there instead of the CPU

@@ -18,6 +18,8 @@ extern "C" {
#define GGML_CUDA_MAX_DEVICES 16

GGML_API void ggml_init_cublas(void);
GGML_API bool ggml_cublas_loaded(void);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment that we differentiate between "initialized" and "loaded" state. The former means we have called ggml_init_cublas() but it's not guaranteed that there has been a CUDA device available, in which case ggml_cublas_loaded() is false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@wsxiaoys
Copy link
Contributor Author

wsxiaoys commented Nov 6, 2023

Note that simply setting -ngl 0 would not work because ggml will keep moving data to the device and do some of the computations there instead of the CPU

I experimented with -ngl 0, and it appeared to match the performance of the CPU-only option. Nevertheless, this may not be a very common use case. Therefore, we should consider recommending the usage of CUDA_VISIBLE_DEVICES=-1 instead.

@ggerganov
Copy link
Owner

It's possible that the behaviour with -ngl 0 has changed and now would actually perform everything on the CPU without moving data back and forth as it used to do in the past. If you don't see any GPU usage in your test, then all should be good.

@ggerganov ggerganov requested a review from slaren November 6, 2023 08:24
llama.cpp Outdated Show resolved Hide resolved
@slaren
Copy link
Collaborator

slaren commented Nov 6, 2023

It's possible that the behaviour with -ngl 0 has changed and now would actually perform everything on the CPU without moving data back and forth as it used to do in the past. If you don't see any GPU usage in your test, then all should be good.

That's not the case, CUDA is still used for large matrix multiplication regardless of the value of -ngl.

@wsxiaoys
Copy link
Contributor Author

wsxiaoys commented Nov 7, 2023

Kindly ping @cebtenzzre

@ggerganov ggerganov merged commit 46876d2 into ggerganov:master Nov 7, 2023
32 checks passed
wsxiaoys added a commit to TabbyML/llama.cpp that referenced this pull request Nov 7, 2023
…v#3946)

* protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build

* doc: add comments to ggml_cublas_loaded()

* fix defined(...)
@zhangp365
Copy link

I compiled the code using the latest version and copied the resulting executable file to a computer without a graphics card and no CUDA installed. When I attempted to run the main.exe file, it failed to execute. The relevant error message is as follows:
error

olexiyb pushed a commit to Sanctum-AI/llama.cpp that referenced this pull request Nov 23, 2023
…v#3946)

* protyping the idea that supports running on CPU for a GGML_USE_CUBLAS=on build

* doc: add comments to ggml_cublas_loaded()

* fix defined(...)
@wsxiaoys
Copy link
Contributor Author

wsxiaoys commented Feb 3, 2024

Due to newer changes adding link against libcuda - the fix is no longer working.

It will generates following error message for a cuda build when running in non-cuda environment:

/opt/tabby/bin/tabby: error while loading shared libraries: libcuda.so.1: cannot open shared object file: No such file or directory

#4606 seems to be the culprint

Looking into workaround

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need feedback Testing and feedback with results are needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants