Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added mistral instruct chat format as "mistral-instruct" #799

Merged
merged 4 commits into from
Jan 29, 2024

Conversation

Rafaelblsilva
Copy link
Contributor

@abetlen Thanks for you work!

Here is my small contribution.

@alex4321
Copy link

Did you checked - does it (llama.cpp nowadays tokenizer) even tokenize text correctly?

Because from what I found at https://huggingface.co/TheBloke/Mistral-7B-Instruct-v0.1-GGUF - prompt format is:

<s>[INST] {prompt} [/INST]

And inside Llama::create_chat_completion prompt converted to a string before passing to the tokenizer into Llama:_create_completion:

        prompt_tokens: List[int] = (
            self.tokenize(prompt.encode("utf-8"))
            if prompt != ""
            else [self.token_bos()]
        )

While I tried the following code:

for token in model.tokenize("<s>[INST] Test prompt [/INST]".encode(), add_bos=False):
    print(model.detokenize([token]))

And the output was:

b' <'
b's'
b'>'
b'['
b'INST'
b']'
b' Test'
b' prompt'
b' ['
b'/'
b'INST'
b']'

I guess <s>, [INST] and [/INST] should be a special tokens?
However, I did not dive into the issue much yet and when the tokenization will work it will be useful.

So I am asking just in case I missed something and this stuff should work already.

@Rafaelblsilva

@Rafaelblsilva
Copy link
Contributor Author

@alex4321

I have not checked that deep.
If i understood correctly your observation suggest there might be underlying issues even with llama-2 tokenization?

Is this model dependent, or something that would need to be implemented on llama.cpp?

I've quickly looked on open issues there and found these:

ggerganov/llama.cpp#3475
ggerganov/llama.cpp#2820

But at least on mistral model with my changes it works a bit better than just using the llama-2 prompt.

@abetlen abetlen force-pushed the main branch 2 times, most recently from 8c93cf8 to cc0fe43 Compare November 14, 2023 20:24
@fakerybakery
Copy link
Contributor

Hi can we merge this?

@zhangp365
Copy link

zhangp365 commented Nov 26, 2023

Based on this document, I believe this pull request is accurate for a single round. However, according to the document, for multiple rounds of history, each round should conclude with '< /s >'.

@fakerybakery
Copy link
Contributor

Hi @abetlen might it be possible to merge this, now that Mistral Instruct v0.2 has been released?

@handshape
Copy link

handshape commented Dec 14, 2023

I'm not certain that the models are consistently emitting </s> as a single token. I'm hand-cranking the sampling/detokenization loop, and seeing that about 80% of the time, the eos token is coming out as a single token, and the other 20% of the time it's coming out as a sequence of four single-character tokens. I'm having to catch all permutations.

@abetlen
Copy link
Owner

abetlen commented Jan 29, 2024

Hey @Rafaelblsilva sorry for missing this, for some reason I thought it overlapped with another format that was already supported. Thank you for the contribution!

Note: I renamed the chat format to mistral-instruct so it's consistent with the finetuned model name.

@abetlen abetlen changed the title Added mistral instruct chat format as "mistral" Added mistral instruct chat format as "mistral-instruct" Jan 29, 2024
@abetlen abetlen merged commit ce38dbd into abetlen:main Jan 29, 2024
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants