-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added the fact that llama.cpp supports Mistral AI release 0.1 #3362
Added the fact that llama.cpp supports Mistral AI release 0.1 #3362
Conversation
Are there any details available about this model? All I could find about this release is a link to a torrent. |
Inspecting the tokenizer model, there is evidence indicated a training dataset of 8T tokens ( The convert script worked and I am currently evaluating the model... |
F16 ppl looks good for a 7B model. Some generation[I believe the meaning of life is] simple, just like that “Life” episode of “The Twilight Zone”, with William Shatner. But sometimes it’s also easy to forget, and in those times a reminder from something or someone else can be very welcome.Sometimes those reminders are small things that don’t seem important at all, but later on become more meaningful than one would have thought. Other times they are big, life-altering events. But as long as you get to see the beauty in them when they happen, they will help you live your best possible life. Here is a list of reminders that I’ve had. Some may seem silly or irrelevant, but they’re all important and meaningful for me. These are my 50 things that make life worth living:
|
param.json: {
"dim": 4096,
"n_layers": 32,
"head_dim": 128,
"hidden_dim": 14336,
"n_heads": 32,
"n_kv_heads": 8,
"norm_eps": 1e-05,
"sliding_window": 4096,
"vocab_size": 32000
} Looks like it uses |
Does sliding window attention actually work here, or it really only works with 4096 context length with llama.cpp? What happens if we set context length to 8192? |
I did test before they released the model card on HF. I'll try that |
Currently convert.py is failing for me on the vocab - doesn't like that it's adding tokens 0, 1 and 2 in If anyone has converted this successfully, how did you make the fp16? Oh never mind, I just deleted the added_tokens.json duh :) |
Setting the context size to 8k actually works. I got the model (a q6_K version) to perform a summary and the results are promising |
@TheBloke I just converted from the pth file in the torrent. There is no |
Ah OK fair enough, I've been using the official release from https://huggingface.co/mistralai/Mistral-7B-v0.1, which is in HF format and they added an Anyway my quants are up here and seem to work fine: https://huggingface.co/TheBloke/Mistral-7B-v0.1-GGUF |
Actually no my quants don't work fine! I needed that permute fix. Re-making now |
You have to tell us how you can upload this fast to HF. For me it took forever ! |
OK all my quants are remade and re-uploaded and are working fine now.
10Gbit internet! :) I don't always have it sadly, but when only making GGUFs for a repo I use a Lambda Labs instance with beautiful 10GBit network - my record speed transferring to HF is 950MB/s 🤣 |
Considering that sliding window attention is not implemented, this shouldn't be added yet.
They just produced a press release. It's a 7B model that apparently performs like LLaMA 2 13B and is under an Apache 2 license. |
They released the instruct model. I tried quantize but all I got is gibberish ... I'll try again (with the fix you mentioned) EDIT: that was it (the fix) |
Yeah Instruct is working well for me. Q5_K_M:
|
Does GQA work with it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sliding window will be tracked here: #3377
…example * 'master' of github.com:ggerganov/llama.cpp: ggml-cuda : perform cublas mat mul of quantized types as f16 (ggerganov#3412) llama.cpp : add documentation about rope_freq_base and scale values (ggerganov#3401) train : fix KQ_pos allocation (ggerganov#3392) llama : quantize up to 31% faster on Linux and Windows with mmap (ggerganov#3206) readme : update hot topics + model links (ggerganov#3399) readme : add link to grammars app (ggerganov#3388) swift : fix build on xcode 15 (ggerganov#3387) build : enable more non-default compiler warnings (ggerganov#3200) ggml_tensor: update the structure comments. (ggerganov#3283) ggml : release the requested thread pool resource (ggerganov#3292) llama.cpp : split llama_context_params into model and context params (ggerganov#3301) ci : multithreaded builds (ggerganov#3311) train : finetune LORA (ggerganov#2632) gguf : basic type checking in gguf_get_* (ggerganov#3346) gguf : make token scores and types optional (ggerganov#3347) ci : disable freeBSD builds due to lack of VMs (ggerganov#3381) llama : custom attention mask + parallel decoding + no context swaps (ggerganov#3228) docs : mark code as Bash (ggerganov#3375) readme : add Mistral AI release 0.1 (ggerganov#3362) ggml-cuda : perform cublas fp16 matrix multiplication as fp16 (ggerganov#3370)
Mistral AI v0.1 model works out of the box once converted with the script convert.py.