Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Swiftui metal update #1

Closed
wants to merge 274 commits into from
Closed

Swiftui metal update #1

wants to merge 274 commits into from

Conversation

bachittle
Copy link
Owner

@bachittle bachittle commented Oct 6, 2023

NOTE: this merge request is more so for documenting purposes, but the actual merge will not occur.

trying to update to latest llama (the batch update), encountering some issues.

Some models are able to load successfully, but they crash when hitting the llama_decode function. This is the following error message and call stack, that is reproducible with multiple models:
image

Some other models do not even load at all, such as the starcoder-1b model. It states that there is an "invalid character", maybe some sort of utf-8 issue?
image

For now will just stick to using swiftui_metal with the older version of llama.cpp until a solution is found.

a10y and others added 30 commits September 29, 2023 14:15
* Add link to grammars app per @ggernagov suggestion

Adding a sentence in the Grammars section of README to point to grammar app, per ggerganov#2494 (comment)

* Update README.md
…rganov#3206)

* llama : enable mmap in quantize on Linux -> 31% faster

* also enable mmap on Windows

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* train : fix KQ_pos allocation

* make sure KQ_pos is not reallocated in finetune

---------

Co-authored-by: xaedes <xaedes@gmail.com>
…gerganov#3401)

* llama.cpp : add documentation about rope_freq_base and scale values

* add notice to hot topics
…ov#3412)

* ggml-cuda : perform cublas matrix multiplication of quantized types as fp16

* rename CC_TURING to CC_VOLTA

* disable fp16 mat mul completely with multi GPU
* vvhg-code-infill (#1)

* infill in separate example (#2)

* reverted changes to main and added infill example

* cleanup

* naming improvement

* make : add missing blank line

* fix missing semicolon

* brought infill up to current main code

* cleanup

---------

Co-authored-by: Cebtenzzre <cebtenzzre@gmail.com>
the shapes for init model of gqa models was wrong
* cmake : fix misuse of cxx_flags

* cmake : make CUDA flags more similar to the Makefile

* cmake : fix MSVC build
)

Broadcast src0 into src1 across dimensions 2 and 3 when needed.
This is required for models that use GQA.
* Work on the BPE tokenizer

Tokenizer tests work for Falcon-7B

* Try to fix build problem

* Fix debug assertion failure

* Fix MSVC Unicode BOM problem

* Cleanup and an improvement

* Fix compiler warning

* Cleanup

* Test doesn't work over the full range of Unicodes

* Update .gitignore and Makefile

* Another Makefile rule

* Testing Aquila

* Moving byte decoding back to `token_to_piece` ...

... because everyone is using it.

* Guarding some unusable code pathes

* Streamlining code and adding some more assertions

Important change: I'm classifying added tokens as control tokens now for BPE.

* Adding a comment

* Adding another assertion

* Fixed vocabulary guarding assertions

* Fix PR for recent change

* Fix PR for recent change

* Fix for compiler warning

* Fix PR for recent change

* Fix PR for recent change

* Fix PR for recent change

* Fix for compiler warning

* Fixes for more compiler warnings

* Remove unused code

* Fix initialization of static maps

* Add scores and token types back, adapt gptneox

* Update llama.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update unicode.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Update unicode.h

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* Ported Starcoder and added some assertions

* Fix coding style

* Apply @jploski 's fix for missing tokens

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…d by the processor (ggerganov#3273)

* fix LLAMA_NATIVE

* syntax

* alternate implementation

* my eyes must be getting bad...

* set cmake LLAMA_NATIVE=ON by default

* march=native doesn't work for ios/tvos, so disable for those targets. also see what happens if we use it on msvc

* revert 8283237 and only allow LLAMA_NATIVE on x86 like the Makefile

* remove -DLLAMA_MPI=ON

---------

Co-authored-by: netrunnereve <netrunnereve@users.noreply.github.com>
so it can be scaled further before creating a context.
* llama : fix session saving/loading

* llama : temp fix for clearing "future" tokens from the KV cache

* llama : fix handling of "future" tokens when loading sessions

* llama : fix comments for llama_kv_cache API
* Typo

* No `--in-prefix` coloring

The `--in-prefix` text was inconsistently colored. Now, it's never colored, just like the `--in-suffix` text.
…ng intrinsics (ggerganov#3453)

* Added RVV intrinsics support for Q8 quantize row and also improved the existing dot product function for risc-v.

The RVV intrinsics is added for the following quantize row functions
   quantize_row_q8_0
   quantize_row_q8_1

The following dot product functions have also been optimized by using LMUL = 1/2 instead of LMUL = 1
   ggml_vec_dot_q4_0_q8_0
   ggml_vec_dot_q4_1_q8_1
   ggml_vec_dot_q5_0_q8_0
   ggml_vec_dot_q5_1_q8_1

And vector initialization in Q5 by temporary array is also replaced by the vid intrinsics

Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>

* Added RVV intrinsics support for k_quants

This adds RISC-V Vector intrinsics support for the following K_quants functions for both QKK = 256 and QKK = 64
   ggml_vec_dot_q2_K_q8_K
   ggml_vec_dot_q3_K_q8_K
   ggml_vec_dot_q4_K_q8_K
   ggml_vec_dot_q5_K_q8_K
   ggml_vec_dot_q6_K_q8_K

Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>

---------

Signed-off-by: Ahmad Tameem <ahmad.tameem@10xengineers.ai>
* sync : ggml (conv 1d + 2d updates)

ggml-ci

* ggml : fix UB in q5_0 and q5_1 quantize code

ggml.c:1033:39: runtime error: left shift of 1 by 31 places cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior

ggml.c:1081:39: runtime error: left shift of 1 by 31 places cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior

ggml-ci

* tests : fix UB in test-quantize-perf
* add refact model

* resolve comments

* rebase to the latest

* solve alibi cpu error

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
…erganov#3299)

Use local GGUF package when possible in Baichuan converter
ggerganov and others added 29 commits November 13, 2023 16:55
* add safetensors to convert.py help message

* Check for single-file safetensors model

* Update convert.py "model" option help message

* revert convert.py help message change
* Add support for stablelm-3b-4e1t
* Supports GPU offloading of (n-1) layers
Co-authored-by: Jared Van Bortel <jared@nomic.ai>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Bernhard Gstrein <gstrein@cs.uni-freiburg.de>
…#4040)

* gguf-py: gguf-dump: Respect --no-tensor flag in JSON mode.

* Respect add_bos_token GGUF metadata value

* gguf-py: Try to fix SpecialVocab giving up too easily for the Nth time
* llama : fix data units

ggml-ci

* Revert "llama : fix data units"

This reverts commit f5feac8.

* llama : disambiguate data units

ggml-ci
* Fix ggerganov#4017

* Update ggml-cuda.cu

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

* Update ggml-cuda.cu

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
* finetune : zero the loraB initial vectors

Without this, the first iteration is starting out far from the base model, instead of exactly on it.
Zeroing loraB is what the paper recommends. loralib also zeroes at least one of the init vector pairs
(though it departs from the paper in using a different distribution for the other vector, in some cases).

* tabs to spaces

* Use ggml_set_zero instead of adding a new function
…anov#4079)

* Remove logically superfluous assertions and order by dimension

* Use cblas_sgemm() to implement ggml_compute_forward_out_prod()

* Remove ggml_compute_forward_out_prod_use_blas(), fix compiling errors on cmake/zig, remove trailing whitespace

* Add openBLAS support for sgemm() in compute_forward_out_prod()
* llama : add functions to get the model's metadata

* format -> std::to_string

* better documentation
ggerganov#4074)

- introduces help entry for the argument
 - cuts '--gpu-layers' form in order to simplify usage and documentation.

Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
Co-authored-by: Jiri Podivin <jpodivin@redhat.com>
Signed-off-by: Jiri Podivin <jpodivin@gmail.com>
Co-authored-by: Jiri Podivin <jpodivin@redhat.com>
* logging: improve escaping in yaml output

* logging: include review feedback
* build: support ppc64le build for make and CMake

* build: keep __POWER9_VECTOR__ ifdef and extend with __powerpc64__

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
@bachittle bachittle closed this Nov 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.