Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quantized dot products for CUDA mul mat vec #2067

Merged
merged 1 commit into from
Jul 5, 2023

Conversation

JohannesGaessler
Copy link
Collaborator

This PR aims to implement CUDA kernels for matrix vector multiplication that utilize dot products with quantized data instead of dequantizing the data on-the-fly. So far this is only implemented for q4_0. In order to get good performance integer intrinsics are used. Unfortunately these have very poor performance on Pascal cards so the current implementation with dequantization should be kept. For my RTX 3090 I found:

GPU Model Test t/s master t/s PR Speedup
RTX 3090 7b q4_0 tg128 91.06 101.39 1.11
RTX 3090 13b q4_0 tg128 51.88 57.95 1.12
RTX 3090 33b q4_0 tg128 22.83 25.71 1.13

For master I used the option LLAMA_CUDA_DMMV_F16 which uses f16 intrinsics for the calculation. Since this option is also only beneficial on relatively new cards and seemingly inferior to integer intrinsics I would suggest that the f16 option be removed in favor of this implementation.

@JohannesGaessler
Copy link
Collaborator Author

I forgot: because I'm changing the way quantization is used in this PR I would like to prioritize it over #2043 and then think about how to approach the dequantization for that PR again.

@slaren
Copy link
Collaborator

slaren commented Jul 1, 2023

This is probably not going to make much of a difference in practice, but the __syncthreads or __syncwarp before the warp shuffles shouldn't be necessary, since these already imply a sync (at least since they gained the _sync suffix, it wasn't always the case). See this for more details: https://developer.nvidia.com/blog/using-cuda-warp-level-primitives/

All the participating threads must be synchronized for the collective operation to work correctly. Therefore, these primitives first synchronize the threads if they are not already synchronized.

You may also gain some additional performance if instead of quantizing the vector to DRAM, you do it to shared memory at the beginning of vec_dot_q. It should be small enough to fit, in most cases at least.

@JohannesGaessler
Copy link
Collaborator Author

You may also gain some additional performance if instead of quantizing the vector to DRAM, you do it to shared memory at the beginning of vec_dot_q. It should be small enough to fit, in most cases at least.

The problem is that the vector is loaded thousands of times by different blocks. So I think that dequantizing once and then writing the dequantized version to DRAM is faster than dequantizing thousands of times.

@slaren
Copy link
Collaborator

slaren commented Jul 1, 2023

You could use one block only and a lot more threads, and compute each row in a different warp. That worked for me in some tests I have been doing with the attention, but these matrices/vectors are very small, and it may not work so well for other (larger) matrix-vector multiplications.

@JohannesGaessler
Copy link
Collaborator Author

If I remember correctly the maximum block size is 1024 threads/32 warps. So for 7b where the smallest matrix has 4096 rows that would still mean quantizing the vector 128 times. Currently the quantization to q8_0 takes up 2.7% of the runtime so I don't think doing that several times is going to be viable.

@slaren
Copy link
Collaborator

slaren commented Jul 1, 2023

If you adjust the block and grid size so that all blocks can be executed simultaneously, it may not matter that you have to quantize in each block, since it will be done simultaneously anyway. That may not work if the number of blocks is higher than the capacity of the GPU, but in that case you can still compute multiple rows in each warp and adjust the number of blocks accordingly. The 3090 can execute 1536 threads per SM, so a block size of 768 to fit two blocks in each SM may work best.

@Midaychi
Copy link

Midaychi commented Jul 2, 2023

Unfortunately these have very poor performance on Pascal cards so the current implementation with dequantization should be kept.

If you use fp32 based operations on pascal cards instead of fp16 it should have much better performance

@JohannesGaessler
Copy link
Collaborator Author

I'm not using any f16 intrinsics. The option for that is already on master. I'm using __vsub4 and __dp4a to do byte-wise subtractions and dot products on integers.

@JohannesGaessler
Copy link
Collaborator Author

I have implemented a kernel for q4_1 and to my surprise I've found that the performance is ~10% better than for q4_0. The reason seems to be that due to q4_1 having a size of 20 bits vs. the 18 bits of q4_0 it is possible to directly cast the pointer for the quants to int instead of having to resort to memcpy. Since I'm currently still using memcpy for q8_0 this implies that performance could be significantly improved by padding or reordering the q8_0 vector; I'll investigate.

More generally this may also mean that reordering the weights in some way may be of benefit after all.

@JohannesGaessler
Copy link
Collaborator Author

JohannesGaessler commented Jul 3, 2023

I pushed a version in which the vector is quantized to q8_1 (36 bytes) instead of q8_0 (34 bytes). This allows you to directly cast the quant int 8 pointers to int 32 pointers which is significantly faster. With this I get 123 t/s for q4_0 using an RTX 3090. Reordering the data so that the scales and quants are in two separate blocks seems to have slightly worse performance, presumably due to cache locality.

@casper-hansen
Copy link

I have implemented a kernel for q4_1 and to my surprise I've found that the performance is ~10% better than for q4_0. The reason seems to be that due to q4_1 having a size of 20 bits vs. the 18 bits of q4_0 it is possible to directly cast the pointer for the quants to int instead of having to resort to memcpy. Since I'm currently still using memcpy for q8_0 this implies that performance could be significantly improved by padding or reordering the q8_0 vector; I'll investigate.

More generally this may also mean that reordering the weights in some way may be of benefit after all.

GPTQ implements a reordering approach based on quantization error. Weights with the smallest error first and weights with largest error last.

Not sure if it’s possible to achieve in llama.cpp - side effect in GPTQ seemed to be performance issues.

@JohannesGaessler
Copy link
Collaborator Author

I don't mean changing the order of the weights itself, I mean changing the way the data is laid out for better memory alignment.

@JohannesGaessler
Copy link
Collaborator Author

I pushed implementations for q5_0, q5_1, and q8_0. I think I've done the low-hanging fruits in terms of performance so I think I'll focus on making the new features usable now. Since the integer intrinsics seem to rely on hardware implementations I think I'll enable them based on compute capability. Ideally I can just set two compute capabilities in cmake and it will automatically use the highest one that a particular GPU supports.

@JohannesGaessler
Copy link
Collaborator Author

@slaren do you think we should keep the dequantize_mul_mat_vec implementations using f16 intrinsics? They were slightly faster on recent NVIDIA cards but the integer intrinsics seem to be superior for those cases.

@slaren
Copy link
Collaborator

slaren commented Jul 4, 2023

I think that can still be useful for f16 models, so I would say keep it.

@JohannesGaessler JohannesGaessler marked this pull request as ready for review July 4, 2023 16:20
@JohannesGaessler
Copy link
Collaborator Author

Alright, I now consider this ready to be merged. By default the new kernels are used (if the compute capability is high enough), the old DMMV kernels can still be used by setting LLAMA_CUDA_FORCE_DMMV. These are the final performance numbers on my system:

GPU Model Test t/s master t/s PR Speedup
RTX 3090 7b q4_0 tg128 90.40 121.52 1.34
RTX 3090 13b q4_0 tg128 51.32 69.23 1.35
RTX 3090 33b q4_0 tg128 22.65 31.91 1.41
RTX 3090 7b q4_1 tg128 84.30 115.00 1.36
RTX 3090 7b q4_0 tg128 60.75 103.35 1.70
RTX 3090 7b q5_1 tg128 60.69 99.08 1.63
RTX 3090 7b q8_0 tg128 72.89 77.55 1.06

ggml-cuda.cu Show resolved Hide resolved
CMakeLists.txt Outdated Show resolved Hide resolved
Makefile Outdated Show resolved Hide resolved
@@ -263,7 +268,7 @@ if (LLAMA_CUBLAS)
if (LLAMA_CUDA_DMMV_F16)
set(CMAKE_CUDA_ARCHITECTURES "61") # needed for f16 CUDA intrinsics
else()
set(CMAKE_CUDA_ARCHITECTURES "52") # lowest CUDA 12 standard
set(CMAKE_CUDA_ARCHITECTURES "52;61") # lowest CUDA 12 standard + lowest for integer intrinsics
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the lowest for the integer intrinsics is 70 in practice, I think this could be changed too, if only for clarity.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For single GPU I would agree but for multi GPU settings that would be an issue. If you were to combine e.g. a Pascal and an Ampere card you would want to use the integer intrinsics with the 8.6 Ampere card (but not the 6.1 Pascal card). The decision which implementation to use can be done at runtime by checking the compute capability per card but only if the integer intrinsics are available at compile time.

@JohannesGaessler JohannesGaessler merged commit 924dd22 into ggerganov:master Jul 5, 2023
Copy link
Owner

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice speed-up! 🦙

My guess is that a similar approach for qmat x qmat should result in better performance than the existing mat x mat using cuBLAS.

@mirek190
Copy link

mirek190 commented Jul 5, 2023

is possible improve like that q_K models?

@JohannesGaessler
Copy link
Collaborator Author

It is very likely possible to apply the same techniques to q_K models. The reason I didn't do it is merely that the CUDA implementation for those was done very differently compared to the older quantization methods which use a template. So I would rather work out all of the details on the older quantization methods before I touch half a dozen different k-quant implementations.

@mirek190
Copy link

mirek190 commented Jul 5, 2023

I am asking because q_K4_m has very similar perplexity to q5_1 ... BUT 33B 63 layers model q5_1 we cannot put entirely on consumer GPU ( RTX 3090, 4090 with 24 GB ) on the other hand q_K4_m is fitting perfectly where I have 18.5T/s ... thinking I COULD get something close 30 T/s with 33B and q4K_m .... just OMG

@JohannesGaessler
Copy link
Collaborator Author

Sorry, but you'll just need to be patient.

@LostRuins
Copy link
Collaborator

Ever since this was merged, I am getting rubbish outputs when using CUDA (ref #2136).

The outputs are normal if GGML_CUDA_FORCE_DMMV is set to true, or if 0 layers are offloaded. Otherwise, it ranges from a mix of garbled tokens to just a single repeated token.

YellowRoseCx added a commit to YellowRoseCx/koboldcpp-rocm that referenced this pull request Jul 10, 2023
commit 8432e9d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 9 16:55:30 2023 -0500

    Update Makefile

commit b58c189
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 9 16:20:00 2023 -0500

    Add multi-gpu CuBLAS support to new GUI

commit 0c1c71b
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 8 07:56:57 2023 -0500

    Update Makefile

commit f864f60
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Sat Jul 8 00:25:15 2023 +0200

    CUDA: add __restrict__ to mul mat vec kernels (ggerganov#2140)

commit 4539bc2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 8 01:36:14 2023 -0500

    update makefile for changes

commit 912e31e
Merge: 74e2703 ddaa4f2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jul 7 23:15:37 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit ddaa4f2
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Fri Jul 7 22:14:14 2023 +0800

    fix cuda garbage results and gpu selection issues

commit 95eca51
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Fri Jul 7 18:39:47 2023 +0800

    add gpu choice for GUI for cuda

commit a689a66
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Fri Jul 7 17:52:34 2023 +0800

    make it work with pyinstaller

commit 9ee9a77
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Fri Jul 7 16:25:37 2023 +0800

    warn outdated GUI (+1 squashed commits)

    Squashed commits:

    [15aec3d] spelling error

commit 32102c2
Merge: 8424a35 481f793
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Fri Jul 7 14:15:39 2023 +0800

    Merge branch 'master' into concedo_experimental

    # Conflicts:
    #	README.md

commit 481f793
Author: Howard Su <howard0su@gmail.com>
Date:   Fri Jul 7 11:34:18 2023 +0800

    Fix opencl by wrap #if-else-endif with \n (ggerganov#2086)

commit dfd9fce
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Thu Jul 6 19:41:31 2023 +0300

    ggml : fix restrict usage

commit 36680f6
Author: Judd <foldl@users.noreply.github.com>
Date:   Fri Jul 7 00:23:49 2023 +0800

    convert : update for baichuan (ggerganov#2081)

    1. guess n_layers;
    2. relax warnings on context size;
    3. add a note that its derivations are also supported.

    Co-authored-by: Judd <foldl@boxvest.com>

commit a17a268
Author: tslmy <tslmy@users.noreply.github.com>
Date:   Thu Jul 6 09:17:50 2023 -0700

    alpaca.sh : update model file name (ggerganov#2074)

    The original file name, `ggml-alpaca-7b-q4.bin`, implied the first-generation GGML. After the breaking changes (mentioned in ggerganov#382), `llama.cpp` requires GGML V3 now. Those model files are named `*ggmlv3*.bin`. We should change the example to an actually working model file, so that this thing is more likely to run out-of-the-box for more people, and less people would waste time downloading the old Alpaca model.

commit 8424a35
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Thu Jul 6 23:24:21 2023 +0800

    added the ability to ban any substring tokens

commit 27a0907
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Thu Jul 6 22:33:46 2023 +0800

    backport MM256_SET_M128I to ggml_v2, updated lite, added support for selecting the GPU for cublas

commit 220aa70
Merge: 4d1700b 31cfbb1
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Thu Jul 6 15:40:40 2023 +0800

    Merge branch 'master' into concedo_experimental

    # Conflicts:
    #	.github/workflows/build.yml
    #	CMakeLists.txt
    #	Makefile
    #	README.md
    #	pocs/vdot/q8dot.cpp
    #	pocs/vdot/vdot.cpp
    #	scripts/sync-ggml.sh
    #	tests/test-grad0.c
    #	tests/test-quantize-fns.cpp
    #	tests/test-quantize-perf.cpp

commit 4d1700b
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Thu Jul 6 15:17:47 2023 +0800

    adjust some ui sizing

commit 1c80002
Author: Vali-98 <137794480+Vali-98@users.noreply.github.com>
Date:   Thu Jul 6 15:00:57 2023 +0800

    New UI using customtkinter (LostRuins#284)

    * Initial conversion to customtkinter.

    * Initial conversion to customtkinter.

    * Additions to UI, still non-functional

    * UI now functional, untested

    * UI now functional, untested

    * Added saving configs

    * Saving and loading now functional

    * Fixed sliders not loading

    * Cleaned up duplicate arrays

    * Cleaned up duplicate arrays

    * Fixed loading bugs

    * wip fixing all the broken parameters. PLEASE test before you commit

    * further cleaning

    * bugfix completed for gui. now evaluating save and load

    * cleanup prepare to merge

    ---------

    Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>

commit 31cfbb1
Author: Tobias Lütke <tobi@shopify.com>
Date:   Wed Jul 5 16:51:13 2023 -0400

    Expose generation timings from server & update completions.js (ggerganov#2116)

    * use javascript generators as much cleaner API

    Also add ways to access completion as promise and EventSource

    * export llama_timings as struct and expose them in server

    * update readme, update baked includes

    * llama : uniform variable names + struct init

    ---------

    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

commit 74e2703
Merge: cf65429 f9108ba
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jul 5 15:16:49 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit 983b555
Author: Jesse Jojo Johnson <williamsaintgeorge@gmail.com>
Date:   Wed Jul 5 18:03:19 2023 +0000

    Update Server Instructions (ggerganov#2113)

    * Update server instructions for web front end
    * Update server README
    * Remove duplicate OAI instructions
    * Fix duplicate text

    ---------

    Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>

commit ec326d3
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Wed Jul 5 20:44:11 2023 +0300

    ggml : fix bug introduced in LostRuins#1237

commit 1b6efea
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Wed Jul 5 20:20:05 2023 +0300

    tests : fix test-grad0

commit 1b107b8
Author: Stephan Walter <stephan@walter.name>
Date:   Wed Jul 5 16:13:06 2023 +0000

    ggml : generalize `quantize_fns` for simpler FP16 handling (LostRuins#1237)

    * Generalize quantize_fns for simpler FP16 handling

    * Remove call to ggml_cuda_mul_mat_get_wsize

    * ci : disable FMA for mac os actions

    ---------

    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

commit 8567c76
Author: Jesse Jojo Johnson <williamsaintgeorge@gmail.com>
Date:   Wed Jul 5 15:13:35 2023 +0000

    Update server instructions for web front end (ggerganov#2103)

    Co-authored-by: Jesse Johnson <thatguy@jessejojojohnson.com>

commit 924dd22
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Wed Jul 5 14:19:42 2023 +0200

    Quantized dot products for CUDA mul mat vec (ggerganov#2067)

commit 051c70d
Author: Howard Su <howard0su@gmail.com>
Date:   Wed Jul 5 18:31:23 2023 +0800

    llama: Don't double count the sampling time (ggerganov#2107)

commit ea79e54
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Wed Jul 5 17:29:35 2023 +0800

    fixed refusing to quantize some models

commit 9e4475f
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Wed Jul 5 08:58:05 2023 +0200

    Fixed OpenCL offloading prints (ggerganov#2082)

commit 7f0e9a7
Author: Nigel Bosch <pnigelb@gmail.com>
Date:   Tue Jul 4 18:33:33 2023 -0500

    embd-input: Fix input embedding example unsigned int seed (ggerganov#2105)

commit b472f3f
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Tue Jul 4 22:25:22 2023 +0300

    readme : add link web chat PR

commit ed9a54e
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Tue Jul 4 21:54:11 2023 +0300

    ggml : sync latest (new ops, macros, refactoring) (ggerganov#2106)

    - add ggml_argmax()
    - add ggml_tanh()
    - add ggml_elu()
    - refactor ggml_conv_1d() and variants
    - refactor ggml_conv_2d() and variants
    - add helper macros to reduce code duplication in ggml.c

commit f257fd2
Author: jwj7140 <32943891+jwj7140@users.noreply.github.com>
Date:   Wed Jul 5 03:06:12 2023 +0900

    Add an API example using server.cpp similar to OAI. (ggerganov#2009)

    * add api_like_OAI.py
    * add evaluated token count to server
    * add /v1/ endpoints binding

commit 7ee76e4
Author: Tobias Lütke <tobi@shopify.com>
Date:   Tue Jul 4 10:05:27 2023 -0400

    Simple webchat for server (ggerganov#1998)

    * expose simple web interface on root domain

    * embed index and add --path for choosing static dir

    * allow server to multithread

    because web browsers send a lot of garbage requests we want the server
    to multithread when serving 404s for favicon's etc. To avoid blowing up
    llama we just take a mutex when it's invoked.

    * let's try this with the xxd tool instead and see if msvc is happier with that

    * enable server in Makefiles

    * add /completion.js file to make it easy to use the server from js

    * slightly nicer css

    * rework state management into session, expose historyTemplate to settings

    ---------

    Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

commit acc111c
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 4 15:38:04 2023 +0300

    Allow old Make to build server. (ggerganov#2098)

    Also make server build by default.

    Tested with Make 3.82

commit 23c7c6f
Author: ZhouYuChen <zhouyuchen@naver.com>
Date:   Tue Jul 4 20:15:16 2023 +0800

    Update Makefile: clean simple (ggerganov#2097)

commit 69add28
Merge: 00e35d0 698efad
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Tue Jul 4 18:51:42 2023 +0800

    Merge branch 'master' into concedo_experimental

    # Conflicts:
    #	.github/workflows/build.yml

commit 00e35d0
Merge: fff705d f9108ba
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Tue Jul 4 18:46:40 2023 +0800

    Merge branch 'concedo' into concedo_experimental

commit f9108ba
Author: Michael Moon <triffid.hunter@gmail.com>
Date:   Tue Jul 4 18:46:08 2023 +0800

    Make koboldcpp.py executable on Linux (LostRuins#293)

commit fff705d
Merge: 784628a c6c0afd
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Tue Jul 4 18:42:02 2023 +0800

    Merge remote-tracking branch 'ycros/improve-sampler-api-access' into concedo_experimental

commit c6c0afd
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Tue Jul 4 18:35:03 2023 +0800

    refactor to avoid code duplication

commit 784628a
Merge: ca9a116 309534d
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Tue Jul 4 16:38:32 2023 +0800

    Merge remote-tracking branch 'ycros/improve-sampler-api-access' into concedo_experimental

commit 698efad
Author: Erik Scholz <Green-Sky@users.noreply.github.com>
Date:   Tue Jul 4 01:50:12 2023 +0200

    CI: make the brew update temporarily optional. (ggerganov#2092)

    until they decide to fix the brew installation in the macos runners.
    see the open issues. eg actions/runner-images#7710

commit 14a2cc7
Author: Govlzkoy <gotope@users.noreply.github.com>
Date:   Tue Jul 4 07:50:00 2023 +0800

    [ggml] fix index for ne03 value in ggml_cl_mul_f32 (ggerganov#2088)

commit cf65429
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 3 16:56:40 2023 -0500

    print cuda or opencl based on what's used

commit 72c16d2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 3 16:45:39 2023 -0500

    Revert "fix my mistake that broke other arches"

    This reverts commit 777aed5.

commit 1cf14cc
Author: Henri Vasserman <henv@hot.ee>
Date:   Tue Jul 4 00:05:23 2023 +0300

    fix server crashes (ggerganov#2076)

commit 777aed5
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jul 3 15:53:32 2023 -0500

    fix my mistake that broke other arches

commit cc45a7f
Author: Howard Su <howard0su@gmail.com>
Date:   Tue Jul 4 02:43:55 2023 +0800

    Fix crash of test-tokenizer-0 under Debug build (ggerganov#2064)

    * Fix crash of test-tokenizer-0 under Debug build

    * Change per comment

commit ca9a116
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Tue Jul 4 00:35:02 2023 +0800

    possibly slower, but cannot use larger batches without modifying ggml library.

commit bfeb347
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Mon Jul 3 21:36:42 2023 +0800

    fix typos

commit 55dbb91
Author: Howard Su <howard0su@gmail.com>
Date:   Mon Jul 3 19:58:58 2023 +0800

    [llama] No need to check file version when loading vocab score (ggerganov#2079)

commit d7d2e6a
Author: WangHaoranRobin <56047610+WangHaoranRobin@users.noreply.github.com>
Date:   Mon Jul 3 05:38:44 2023 +0800

    server: add option to output probabilities for completion (ggerganov#1962)

    * server: add option to output probabilities for completion
    * server: fix issue when handling probability output for incomplete tokens for multibyte character generation
    * server: fix llama_sample_top_k order
    * examples/common.h: put all bool variables in gpt_params together

commit 27780a9
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 16:03:27 2023 -0500

    rocm fixes

commit f52c7d4
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 16:02:58 2023 -0500

    Revert "rocm fixes"

    This reverts commit 2fe9927.

commit 2fe9927
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 15:58:21 2023 -0500

    rocm fixes

commit efe7560
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 15:55:43 2023 -0500

    Revert "move HIPBLAS definitions into ggml-cuda.h"

    This reverts commit bf49a93.

commit 4fc0181
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 15:55:36 2023 -0500

    Revert "move hipblas definitions to header files"

    This reverts commit 2741ffb.

commit 89eb576
Merge: 2741ffb 3d2907d
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jul 2 14:44:13 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit 309534d
Author: Ycros <18012+ycros@users.noreply.github.com>
Date:   Sun Jul 2 18:15:34 2023 +0000

    implement sampler order, expose sampler order and mirostat in api

commit 3d2907d
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Sun Jul 2 18:28:09 2023 +0800

    make gptneox and gptj work with extended context too

commit d6b47e6
Merge: e17c849 46088f7
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Sun Jul 2 17:26:39 2023 +0800

    Merge branch 'master' into concedo_experimental

commit e17c849
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Sun Jul 2 17:25:08 2023 +0800

    switched to NTK aware scaling

commit e19483c
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Sun Jul 2 14:55:08 2023 +0800

    increase scratch for above 4096

commit 46088f7
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Sun Jul 2 09:46:46 2023 +0300

    ggml : fix build with OpenBLAS (close ggerganov#2066)

commit b85ea58
Merge: ef3b8dc 0bc2cdf
Author: Concedo <39025047+LostRuins@users.noreply.github.com>
Date:   Sun Jul 2 14:45:25 2023 +0800

    Merge branch 'master' into concedo_experimental

    # Conflicts:
    #	README.md

commit 2741ffb
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 1 17:07:42 2023 -0500

    move hipblas definitions to header files

commit bf49a93
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 1 16:38:50 2023 -0500

    move HIPBLAS definitions into ggml-cuda.h

commit 540f4e0
Merge: 2c3b46f eda663f
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jul 1 14:58:32 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 0bc2cdf
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Sat Jul 1 21:49:44 2023 +0200

    Better CUDA synchronization logic (ggerganov#2057)

commit befb3a3
Author: Johannes Gäßler <johannesg@5d6.de>
Date:   Sat Jul 1 21:47:26 2023 +0200

    Test-based VRAM scratch size + context adjustment (ggerganov#2056)

commit b213227
Author: Daniel Drake <drake@endlessos.org>
Date:   Sat Jul 1 20:31:44 2023 +0200

    cmake : don't force -mcpu=native on aarch64 (ggerganov#2063)

    It's currently not possible to cross-compile llama.cpp for aarch64
    because CMakeLists.txt forces -mcpu=native for that target.

    -mcpu=native doesn't make sense if your build host is not the
    target architecture, and clang rejects it for that reason, aborting the
    build. This can be easily reproduced using the current Android NDK to build
    for aarch64 on an x86_64 host.

    If there is not a specific CPU-tuning target for aarch64 then -mcpu
    should be omitted completely. I think that makes sense, there is not
    enough variance in the aarch64 instruction set to warrant a fixed -mcpu
    optimization at this point. And if someone is building natively and wishes
    to enable any possible optimizations for the host device, then there is
    already the LLAMA_NATIVE option available.

    Fixes LostRuins#495.

commit 2f8cd97
Author: Aaron Miller <apage43@ninjawhale.com>
Date:   Sat Jul 1 11:14:59 2023 -0700

    metal : release buffers when freeing metal context (ggerganov#2062)

commit 471aab6
Author: Judd <foldl@users.noreply.github.com>
Date:   Sun Jul 2 01:00:25 2023 +0800

    convert : add support of baichuan-7b (ggerganov#2055)

    Co-authored-by: Judd <foldl@boxvest.com>

commit 463f2f4
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Sat Jul 1 19:05:09 2023 +0300

    llama : fix return value of llama_load_session_file_internal (ggerganov#2022)

commit cb44dbc
Author: Rand Xie <randxiexyy29@gmail.com>
Date:   Sun Jul 2 00:02:58 2023 +0800

    llama : catch llama_load_session_file_internal exceptions (ggerganov#2022)

    * convert checks in llama_load_session_file to throw and handle them

    * make llama_load_session_file_internal static

    * address feedbacks to avoid using exceptions

commit 79f634a
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Sat Jul 1 18:46:00 2023 +0300

    embd-input : fix returning ptr to temporary

commit 04606a1
Author: Georgi Gerganov <ggerganov@gmail.com>
Date:   Sat Jul 1 18:45:44 2023 +0300

    train : fix compile warning

commit b1ca8f3
Author: Qingyou Meng <meng.qingyou@gmail.com>
Date:   Sat Jul 1 23:42:43 2023 +0800

    ggml : disable GGML_TASK_INIT and GGML_TASK_FINALIZE by default (ggerganov#1995)

    Will not be scheduled unless explicitly enabled.

commit 2c3b46f
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 18:43:43 2023 -0500

    changes to fix build

commit c9e1103
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 18:20:07 2023 -0500

    Update ggml_v2-cuda-legacy.cu for ROCM

commit b858fc5
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 17:49:39 2023 -0500

    changes to work with upstream

commit 69a0c25
Merge: 096f0b0 1347d3a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 29 16:59:06 2023 -0500

    Merge remote-tracking branch 'upstream/concedo'

commit 096f0b0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 28 15:27:02 2023 -0500

    revert unnecessary hipblas conditionals

commit d81e81a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 28 14:48:23 2023 -0500

    Update Makefile hipblas nvcc correction

commit 2579ecf
Merge: abed427 d2034ce
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 25 17:50:04 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit abed427
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jun 24 19:16:30 2023 -0500

    reorganize If statements to include proper headers

commit 06c3bf0
Merge: ea6d320 8342fe8
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sat Jun 24 16:57:20 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit ea6d320
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Fri Jun 23 01:53:28 2023 -0500

    Update README.md

commit 4d56ad8
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 22 16:19:43 2023 -0500

    Update README.md

commit 21f9308
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 22 15:42:05 2023 -0500

    kquants_iter for hipblas and add gfx803

commit b6ff890
Merge: eb094f0 e6ddb15
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Thu Jun 22 12:42:09 2023 -0500

    Merge branch 'LostRuins:concedo' into main

commit eb094f0
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 21 23:59:18 2023 -0500

    lowvram parameter description

commit 3a5dfeb
Merge: 665cc11 b1f00fa
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 21 16:53:03 2023 -0500

    Merge branch 'LostRuins:concedo' into koboldcpp-rocm

commit 665cc11
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Wed Jun 21 01:13:19 2023 -0500

    add lowvram parameter

commit 222cbbb
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jun 20 19:03:28 2023 -0500

    add additional hipblas conditions for cublas

commit e1f9581
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jun 20 16:51:59 2023 -0500

    Add hip def for cuda v2

commit 3bff5c0
Merge: a7e74b3 266d47a
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Tue Jun 20 13:38:06 2023 -0500

    Merge branch 'LostRuins:concedo' into koboldcpp-rocm

commit a7e74b3
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jun 19 22:04:18 2023 -0500

    Update README.md

commit 5e99b3c
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jun 19 22:03:42 2023 -0500

    Update Makefile

commit 9190b17
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Mon Jun 19 21:47:10 2023 -0500

    Update README.md

commit 2780ea2
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 15:48:00 2023 -0500

    Update Makefile

commit 04a3e64
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:33:39 2023 -0500

    remove extra line

commit cccbca9
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:31:17 2023 -0500

    attempt adding ROCM hipblas

commit a44a1d4
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:31:01 2023 -0500

    attempt adding ROCM hipblas

commit b088184
Author: YellowRoseCx <80486540+YellowRoseCx@users.noreply.github.com>
Date:   Sun Jun 18 14:30:54 2023 -0500

    attempt adding ROCM hipblas
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants