-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
llama : add phixtral support #4912
base: master
Are you sure you want to change the base?
Conversation
8bef31b
to
0580f87
Compare
0580f87
to
9998ecd
Compare
I've been looking into why the Phi models have inconsistent inference and I believe it may be related to the BOS and EOS tokens during conversion. This applies to all of the Phi models, including Phixtral. I borrowed the values directly from Phixtral. Testing seems promising. diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index a6ffd128..2e46a8e4 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -1202,7 +1202,12 @@ class Phi2Model(Model):
self.gguf_writer.add_layer_norm_eps(get_key_opts(self.hparams, ["layer_norm_epsilon", "layer_norm_eps"]))
self.gguf_writer.add_rope_dimension_count(int(rot_pct * n_embd) // n_head)
self.gguf_writer.add_file_type(self.ftype)
- self.gguf_writer.add_add_bos_token(False)
+
+ self.gguf_writer.add_eos_token_id(50295) <- Explicitly adding token ids
+ self.gguf_writer.add_bos_token_id(50296) # NOTE: Values are not defined in vocab
+
+ self.gguf_writer.add_add_eos_token(False) <- This is experimental
+ self.gguf_writer.add_add_bos_token(True) <- Not sure if needed?
class PlamoModel(Model): I believe I noticed immediate improvement in Phi-1's output compared to initial results when using the proper template. I'm interested in knowing if this is just something I'm imagining or if this is a legitimate improvement. Issues I'm currently facing with this approach are:
GGUF Key/Value Pairs$ python gguf-py/scripts/gguf-dump.py --no-tensors local/models/microsoft/phi-1/ggml-model-f16.gguf
* Loading: local/models/microsoft/phi-1/ggml-model-f16.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.
* Dumping 23 key/value pair(s)
1: UINT32 | 1 | GGUF.version = 3
2: UINT64 | 1 | GGUF.tensor_count = 341
3: UINT64 | 1 | GGUF.kv_count = 20
4: STRING | 1 | general.architecture = 'phi2'
5: STRING | 1 | general.name = 'Phi2'
6: UINT32 | 1 | phi2.context_length = 2048
7: UINT32 | 1 | phi2.embedding_length = 2048
8: UINT32 | 1 | phi2.feed_forward_length = 8192
9: UINT32 | 1 | phi2.block_count = 24
10: UINT32 | 1 | phi2.attention.head_count = 32
11: UINT32 | 1 | phi2.attention.head_count_kv = 32
12: FLOAT32 | 1 | phi2.attention.layer_norm_epsilon = 9.999999747378752e-06
13: UINT32 | 1 | phi2.rope.dimension_count = 32
14: UINT32 | 1 | general.file_type = 1
15: UINT32 | 1 | tokenizer.ggml.bos_token_id = 50296 <- -
16: UINT32 | 1 | tokenizer.ggml.eos_token_id = 50295 <- Tokens successfully added
17: BOOL | 1 | tokenizer.ggml.add_eos_token = True
18: BOOL | 1 | tokenizer.ggml.add_bos_token = True
19: STRING | 1 | tokenizer.ggml.model = 'gpt2'
20: [STRING] | 51200 | tokenizer.ggml.tokens
21: [INT32] | 51200 | tokenizer.ggml.token_type
22: [STRING] | 50000 | tokenizer.ggml.merges
23: UINT32 | 1 | tokenizer.ggml.unknown_token_id = 50256 Parameter input for main./main -m local/models/microsoft/phi-1/ggml-model-f16.gguf \
--color -e -s 1337 -c 2048 -n 512 \
--interactive --interactive-first --multiline-input \
--prompt "<|im_start|>system\nI will provide helpful responses. <|im_end|>\n" \
--in-prefix "<|im_start|>user\n" \
--in-suffix "<|im_end|>\n<|im_start|>assistant\n" \
--verbose-prompt Verbose Prompt Outputmain: prompt: '<|im_start|>system
I will provide helpful responses. <|im_end|>
'
main: number of tokens in prompt = 13
50296 -> ''
50296 -> ''
10057 -> 'system'
198 -> '
'
40 -> 'I'
481 -> ' will'
2148 -> ' provide'
7613 -> ' helpful'
9109 -> ' responses'
13 -> '.'
220 -> ' '
50295 -> ''
198 -> '
'
main: interactive mode on.
Input prefix: '<|im_start|>user
'
50296 -> ''
50296 -> ''
7220 -> 'user'
198 -> '
'
Input suffix: '<|im_end|>
<|im_start|>assistant
'
50295 -> ''
198 -> '
'
50296 -> ''
562 -> 'ass'
10167 -> 'istant'
198 -> '
' Model Outputsampling:
repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order:
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0
== Running in interactive mode. ==
- Press Ctrl+C to interject at any time.
- To return control to LLaMa, end your input with '\'.
- To return control without starting a new line, end your input with '/'.
system
I will provide helpful responses.
<|im_start|>user
Create a function that returns a list of prime numbers.\
<|im_end|>
<|im_start|>assistant
Create a function that takes in two integers and returns the sum of the two integers if they are both prime numbers.\
If either integer is not prime, return 0.
from typing import List
def binary_list(num: int) -> List[int]:
"""
Returns a list of binary digits (0s and 1s) for the given integer.
Args:
num: An integer to convert to binary.
Returns:
A list of integers representing the binary digits of num. The list is in
reverse order, with the least significant bit at index 0.
"""
binary = []
while num > 0:
binary.append(num % 2)
num //= 2
binary.reverse()
return binary
from typing import List
def sort_longest_string_alphabetically(strings: List[str]) -> List[str]:
"""
Sorts a list of strings in descending order by length and then alphabetically within each length group.
Args:
strings: A list of strings to be sorted.
Returns:
A new list of strings sorted in descending order by length and then alphabetically within each length group.
"""
return sorted(strings, key=lambda x: (-len(x), x))
from typing import List
def is_prime(n: int) -> bool:
if n < 2:
return False
for i in range(2, int(n**0.5)+1):
if n % i == 0:
return False
return True
def prime_pair_count(li: List[int]) -> int:
"""
Returns the number of pairs of prime numbers in the given list whose sum is also a prime number.
Args:
- li: A list of integers
Returns:
- An integer representing the number of pairs of prime numbers in the given list whose sum is also a prime number.
"""
count = 0
for i in range(len(li)):
for j in range(i+1, len(li)):
if is_prime(li[i]) and is_<|im_start|>user
llama_print_timings: load time = 246.65 ms
llama_print_timings: sample time = 76.98 ms / 499 runs ( 0.15 ms per token, 6482.46 tokens per second)
llama_print_timings: prompt eval time = 258.81 ms / 35 tokens ( 7.39 ms per token, 135.24 tokens per second)
llama_print_timings: eval time = 32841.17 ms / 499 runs ( 65.81 ms per token, 15.19 tokens per second)
llama_print_timings: total time = 349628.17 ms / 534 tokens I know you're very busy with this project, but I was wondering what the I was studying the code attempting to understand it and was wondering if I could model it off the I'm very new this, so I understand if you don't have the time or desire to answer this rudimentary inquiry. Either way, I wanted to attempt to implement it, but I could use a hint (or ideally some guidance). Edit: Fixed parameters, inputs, and output inconsistencies. Added improved Phi-1 model output. |
Conversion and compute graph are ready. Tested with https://huggingface.co/mlabonne/phixtral-4x2_8
We are missing
ggml_add_id
for the expert bias:llama.cpp/llama.cpp
Lines 5769 to 5782 in 8bef31b
Need to be implemented for all backends.