Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kudos on a great job! Need a little help with BLAS #32

Closed
regstuff opened this issue Apr 6, 2023 · 9 comments
Closed

Kudos on a great job! Need a little help with BLAS #32

regstuff opened this issue Apr 6, 2023 · 9 comments

Comments

@regstuff
Copy link

regstuff commented Apr 6, 2023

Let me first congratulate everyone working on this for:

  1. Python bindings for llama.cpp
  2. Making them compatible with openai's api
  3. Superb documentation!

Was wondering if anyone can help me get this working with BLAS? Right now when the model loads, I see BLAS=0.
I've been using kobold.cpp, and they have a BLAS flag at compile time which enables BLAS. It cuts down the prompt loading time by 3-4X. This is a major factor in handling longer prompts and chat-style messages.

P.S - Was also wondering what the difference is between create_embedding(input) and embed(input)?

@abetlen
Copy link
Owner

abetlen commented Apr 6, 2023

Thank you!

P.S - Was also wondering what the difference is between create_embedding(input) and embed(input)?

Just the return signature, create_embedding returns an object identical to openai.Embeddings.create whereas embed just returns a list of floats.

Was wondering if anyone can help me get this working with BLAS? Right now when the model loads, I see BLAS=0.

At the moment installing this library is equivalent to building llama.cpp as a shared library with cmake with more or less the default args. There's an issue at the moment for loading a custom shared library version but I don't think that's the right solution to configuration.

I think we could support e.g. setting environment variables before installation to force certain features. Do you mind installing llama.cpp standalone for me with BLAS support and telling me the process so I can add something to the setup.py. Thanks

@regstuff
Copy link
Author

regstuff commented Apr 6, 2023

I did make LLAMA_OPENBLAS=1

@ghost
Copy link

ghost commented Apr 7, 2023

I got OpenBLAS working with llama-cpp-python, though it requires modification to the llama.cpp CMakeLists.txt file. This provides a nice performance boost during prompt ingestion compared to builds without OpenBLAS.

This was tested on Ubuntu 22 and I'll leave the exercise of getting this configurable and working on all platforms to the devs 😀

In CMakeLists.txt add after project(llama_cpp)

set(CMAKE_POLICY_DEFAULT_CMP0077 NEW)

set(LLAMA_OPENBLAS ON)

In vendor/llama.cpp/CMakeLists.txt replace line 247 with:

target_link_libraries(llama PRIVATE ggml ${LLAMA_EXTRA_LIBS} openblas)

For generating the shared llama.cpp library -lopenblas was required to get the symbols to properly appear in the .so file ggerganov/llama.cpp#412 (comment). This is not required when generating the regular executable version of llama.cpp.

@ghost
Copy link

ghost commented Apr 8, 2023

I got CMake OpenBlas support into upstream llama.cpp ggerganov/llama.cpp@f2d1c47 but it looks like you guys jumped the gun on me and switched to using the Makefile to build llama.cpp.

Since the Makefile is being used we can easily enable OpenBlas support using an environment variable (and I believe there are ways to append an argument to pip install so that we can send flags over to the installer). Or perhaps the setup script could detect if the user has OpenBlas installed and automatically enable it if that's the case.

@abetlen
Copy link
Owner

abetlen commented Apr 8, 2023

@eiery I think the environment variable approach is the way to go, we can document some common settings in the README and ask the user to pip install --force-reinstall --ignore-installed llama-cpp-python

@ghost
Copy link

ghost commented Apr 9, 2023

Great! For the record the correct command to get OpenBlas working in the pip install is:

LLAMA_OPENBLAS=on pip install --force-reinstall --ignore-installed --no-cache-dir llama-cpp-python

We need to clear the cache as well else pip just uses the cached build and does not recompile llama.cpp. Feel free to add this to the README.

Now to get this up into oobabooga...

@gjmulder
Copy link
Contributor

I can't get BLAS to enable:

$ rm -rf _skbuild/

$ LLAMA_OPENBLAS=on pip install --force-reinstall --ignore-installed --no-cache-dir llama-cpp-python
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.1.33.tar.gz (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 18.5 MB/s eta 0:00:00
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... done
Collecting typing-extensions>=4.5.0
  Downloading typing_extensions-4.5.0-py3-none-any.whl (27 kB)
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml) ... done
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.1.33-cp310-cp310-linux_x86_64.whl size=136284 sha256=01c535e6d8a3245619b03971ed647dd657c09c069c6c0d12904f86b836d3899f
  Stored in directory: /data/tmp/pip-ephem-wheel-cache-j_6kc3tv/wheels/7d/56/a8/1f25f650cc0e65111f077cc49454a388ee6ae62de56236ee79
Successfully built llama-cpp-python
Installing collected packages: typing-extensions, llama-cpp-python
Successfully installed llama-cpp-python-0.1.33 typing-extensions-4.5.0

$ python3 -m llama_cpp.server
llama.cpp: loading model from /data/llama/alpaca-13B-ggml/ggml-model-q4_0.bin
llama.cpp: can't use mmap because tensors are not aligned; convert to new format to avoid this
llama_model_load_internal: format     = ggmf v1 (old version with no mmap support)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size = 7945693.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
...................................................................................................
.
llama_init_from_file: kv self size  = 1600.00 MB
AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | 
INFO:     Started server process [917202]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on http://localhost:8000 (Press CTRL+C to quit)

@abetlen
Copy link
Owner

abetlen commented Apr 15, 2023

Are you on Windows? I think the env variable passing only works for the Makefile builds which are currently only for Unix, not sure how to pass environment variables to cmake, maybe a change to the root CMakeLists.txt.

@abetlen
Copy link
Owner

abetlen commented Apr 15, 2023

@gjmulder also wonder if this is related ggerganov/llama.cpp#992

xaptronic pushed a commit to xaptronic/llama-cpp-python that referenced this issue Jun 13, 2023
this restricts malicious weights from executing arbitrary code by restricting the unpickler to only loading tensors, primitive types, and dictionaries
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants