-
Notifications
You must be signed in to change notification settings - Fork 368
Good ideas from llama.cpp #15
Comments
Suggest pinning this issue :> |
For the tokenizer item I suggest using https://github.com/huggingface/tokenizers/ Should work out of the box once converted (when this PR lands: huggingface/transformers#21955 it should become a simple |
RMS norm landed, but they've reported regressions. Need to keep an eye on that. |
@Narsil Llamatokenizer need to byte fallback option.🥹
|
Good news everyone ! (If this goes, I'll try to make a release soon after) |
Awesome! Looking forward to it :D |
A small comment on the parallel loading: It is definitely possible to improve IO reads by parallelizing. This is much more effective on SSDs but still works on HDDs due to caching at different layers. However this should be configurable since the performance can start to degrade at certain points of parallelism, depending on the storage medium and also stuff like the kernel and buffer sizes |
@dnlmlr Do you have bench to back that up ? I didn't found that to be the case whenever I tried. Memory-mapping was always consistently better than reading a file (Provided you need the whole file) and it doesn't require parallism (at user-level that is, no idea how the kernel is handling it) |
@setzer22 Are you okay with me closing this issue and splitting it into individual issues? |
Yup, sounds good 👍 |
I've been tracking the
llama.cpp
repo. I'll use this issue to list any good ideas / things we should be aware of to keep up with in Rust land:rayon
. Faster loading of the model ggerganov/llama.cpp#85 (comment)ggml
function once it's implemented on the C++ side 👀 Use RMSNorm ggerganov/llama.cpp#173 (comment)sentencepiece
, which is the one that was used during the original LLaMA training. There seems to be a rust crate for sentencepiece. We should check if a drop-in replacement is possible Differences with the llama tokenizer ggerganov/llama.cpp#167The text was updated successfully, but these errors were encountered: