Good ideas from llama.cpp #15

setzer22 · 2023-03-15T21:19:28Z

I've been tracking the llama.cpp repo. I'll use this issue to list any good ideas / things we should be aware of to keep up with in Rust land:

GPTQ quantization 👀 GPTQ Quantization (3-bit and 4-bit) ggerganov/llama.cpp#9
Not sure how that is even possible (isn't the task I/O bound?), but people are claiming great speedups when loading the modelling in parallel. This should be pretty easy to implement using rayon. Faster loading of the model ggerganov/llama.cpp#85 (comment)
Seems there's an issue with the normalization function used. It should be RMSNorm. Would be good to keep an eye on this, and simply swap the the ggml function once it's implemented on the C++ side 👀 Use RMSNorm ggerganov/llama.cpp#173 (comment)
It looks like dropping to F16 for the memory_k and memory_v reduces memory usage. It is not known whether this hurts quality, but we should follow the C++ side and add a flag to drop to F16 for the memory. This would also make the cached prompts added as part of Implementation of prompt caching #14 take half the size on disk, which is a nice bonus: Use F16 for memory_k and memory_v (as suggested in #146) ggerganov/llama.cpp#154 (review)
Looks like the fix from bytesFromNibbles error #1 just landed upstream. We should make sure to fix it here too FIX: "inline" -> "static inline" for bytesFromNibbles and packNibbles ggerganov/llama.cpp#161
The tokenizer used in llama.cpp has some issues. It would be better to use sentencepiece, which is the one that was used during the original LLaMA training. There seems to be a rust crate for sentencepiece. We should check if a drop-in replacement is possible Differences with the llama tokenizer ggerganov/llama.cpp#167

The text was updated successfully, but these errors were encountered:

philpax · 2023-03-16T01:39:52Z

Suggest pinning this issue :>

Narsil · 2023-03-16T07:53:09Z

For the tokenizer item I suggest using https://github.com/huggingface/tokenizers/

Should work out of the box once converted (when this PR lands: huggingface/transformers#21955 it should become a simple let tokenizer = Tokenizer::from_file("filename") ) Cheers!

philpax · 2023-03-16T11:15:05Z

RMS norm landed, but they've reported regressions. Need to keep an eye on that.

dongs0104 · 2023-03-16T12:13:52Z

@Narsil Llamatokenizer need to byte fallback option.🥹

For the tokenizer item I suggest using https://github.com/huggingface/tokenizers/

Should work out of the box once converted (when this PR lands: huggingface/transformers#21955 it should become a simple let tokenizer = Tokenizer::from_file("filename") ) Cheers!

Narsil · 2023-03-17T14:52:34Z

Good news everyone !

huggingface/tokenizers#1183

(If this goes, I'll try to make a release soon after)

philpax · 2023-03-17T15:06:48Z

Awesome! Looking forward to it :D

dnlmlr · 2023-03-18T15:10:10Z

A small comment on the parallel loading: It is definitely possible to improve IO reads by parallelizing. This is much more effective on SSDs but still works on HDDs due to caching at different layers. However this should be configurable since the performance can start to degrade at certain points of parallelism, depending on the storage medium and also stuff like the kernel and buffer sizes

Narsil · 2023-03-20T16:24:34Z

@dnlmlr Do you have bench to back that up ? I didn't found that to be the case whenever I tried.

Memory-mapping was always consistently better than reading a file (Provided you need the whole file) and it doesn't require parallism (at user-level that is, no idea how the kernel is handling it)

philpax · 2023-03-26T00:57:09Z

@setzer22 Are you okay with me closing this issue and splitting it into individual issues?

setzer22 · 2023-03-26T13:52:03Z

Yup, sounds good 👍

philpax · 2023-03-26T14:07:05Z

This issue has been superseded by #35, #62, #78, #79 and #80.

setzer22 mentioned this issue Mar 16, 2023

Librarification #10

Merged

setzer22 pinned this issue Mar 20, 2023

philpax added the issue:enhancement New feature or request label Mar 24, 2023

philpax closed this as completed Mar 26, 2023

philpax unpinned this issue Apr 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Good ideas from llama.cpp #15

Good ideas from llama.cpp #15

setzer22 commented Mar 15, 2023 •

edited by philpax

Loading

philpax commented Mar 16, 2023

Narsil commented Mar 16, 2023

philpax commented Mar 16, 2023

dongs0104 commented Mar 16, 2023

Narsil commented Mar 17, 2023

philpax commented Mar 17, 2023

dnlmlr commented Mar 18, 2023

Narsil commented Mar 20, 2023 •

edited

Loading

philpax commented Mar 26, 2023

setzer22 commented Mar 26, 2023

philpax commented Mar 26, 2023

Good ideas from llama.cpp #15

Good ideas from llama.cpp #15

Comments

setzer22 commented Mar 15, 2023 • edited by philpax Loading

philpax commented Mar 16, 2023

Narsil commented Mar 16, 2023

philpax commented Mar 16, 2023

dongs0104 commented Mar 16, 2023

Narsil commented Mar 17, 2023

philpax commented Mar 17, 2023

dnlmlr commented Mar 18, 2023

Narsil commented Mar 20, 2023 • edited Loading

philpax commented Mar 26, 2023

setzer22 commented Mar 26, 2023

philpax commented Mar 26, 2023

setzer22 commented Mar 15, 2023 •

edited by philpax

Loading

Narsil commented Mar 20, 2023 •

edited

Loading