Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

llama : add phixtral support #4912

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

llama : add phixtral support #4912

wants to merge 1 commit into from

Conversation

ggerganov
Copy link
Owner

Conversion and compute graph are ready. Tested with https://huggingface.co/mlabonne/phixtral-4x2_8

python3 convert-hf-to-gguf.py ~/Data/huggingface/phixtral-4x2_8/ --outfile phixtral.gguf --outtype f16

make -j && ./main -m phixtral.gguf -p "test"

We are missing ggml_add_id for the expert bias:

llama.cpp/llama.cpp

Lines 5769 to 5782 in 8bef31b

ggml_tensor * cur_up = ggml_mul_mat_id(ctx0, model.layers[il].ffn_up_exp, n_expert, selected_experts, i, cur);
#pragma message "TODO: implement ggml_add_id"
//cur_up = ggml_add_id(ctx0, cur_up, model.layers[il].ffn_up_exp_b, n_expert, selected_experts, i);
cb(cur_up, "ffn_moe_up", il);
cur_up = ggml_silu(ctx0, cur_up);
cb(cur_up, "ffn_moe_silu", il);
cur_expert = ggml_mul_mat_id(ctx0, model.layers[il].ffn_down_exp, n_expert, selected_experts, i, cur_up); // [n_tokens, n_embd]
#pragma message "TODO: implement ggml_add_id"
//cur_expert = ggml_add_id(ctx0, cur_expert, model.layers[il].ffn_down_exp_b, n_expert, selected_experts, i);
cb(cur_expert, "ffn_moe_down", il);

Need to be implemented for all backends.

@teleprint-me
Copy link
Contributor

teleprint-me commented Feb 3, 2024

I've been looking into why the Phi models have inconsistent inference and I believe it may be related to the BOS and EOS tokens during conversion. This applies to all of the Phi models, including Phixtral. I borrowed the values directly from Phixtral. Testing seems promising.

diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index a6ffd128..2e46a8e4 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -1202,7 +1202,12 @@ class Phi2Model(Model):
         self.gguf_writer.add_layer_norm_eps(get_key_opts(self.hparams, ["layer_norm_epsilon", "layer_norm_eps"]))
         self.gguf_writer.add_rope_dimension_count(int(rot_pct * n_embd) // n_head)
         self.gguf_writer.add_file_type(self.ftype)
-        self.gguf_writer.add_add_bos_token(False)
+
+        self.gguf_writer.add_eos_token_id(50295)  <- Explicitly adding token ids
+        self.gguf_writer.add_bos_token_id(50296)  # NOTE: Values are not defined in vocab
+
+        self.gguf_writer.add_add_eos_token(False)  <- This is experimental
+        self.gguf_writer.add_add_bos_token(True)   <- Not sure if needed?
 
 
 class PlamoModel(Model):

I believe I noticed immediate improvement in Phi-1's output compared to initial results when using the proper template. I'm interested in knowing if this is just something I'm imagining or if this is a legitimate improvement.

Issues I'm currently facing with this approach are:

  1. The BOS and EOS tokens vanish from the system prompt for Phi-1. This operates as expected with Phi-2 which is inconsistent with Phi-1.
  2. Phi-1, Phi-1_5, and Phi-2 seem to handle the vocabulary slightly differently during conversion. I've been attempting to debug this. Fixing Phi-1 and Phi-2 seems to break Phi-1_5. I haven't tested Phixtral yet, but I'm assuming the issue will propagate.
GGUF Key/Value Pairs
$ python gguf-py/scripts/gguf-dump.py --no-tensors local/models/microsoft/phi-1/ggml-model-f16.gguf
* Loading: local/models/microsoft/phi-1/ggml-model-f16.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 23 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 341
      3: UINT64     |        1 | GGUF.kv_count = 20
      4: STRING     |        1 | general.architecture = 'phi2'
      5: STRING     |        1 | general.name = 'Phi2'
      6: UINT32     |        1 | phi2.context_length = 2048
      7: UINT32     |        1 | phi2.embedding_length = 2048
      8: UINT32     |        1 | phi2.feed_forward_length = 8192
      9: UINT32     |        1 | phi2.block_count = 24
     10: UINT32     |        1 | phi2.attention.head_count = 32
     11: UINT32     |        1 | phi2.attention.head_count_kv = 32
     12: FLOAT32    |        1 | phi2.attention.layer_norm_epsilon = 9.999999747378752e-06
     13: UINT32     |        1 | phi2.rope.dimension_count = 32
     14: UINT32     |        1 | general.file_type = 1
     15: UINT32     |        1 | tokenizer.ggml.bos_token_id = 50296  <- -
     16: UINT32     |        1 | tokenizer.ggml.eos_token_id = 50295  <- Tokens successfully added
     17: BOOL       |        1 | tokenizer.ggml.add_eos_token = True
     18: BOOL       |        1 | tokenizer.ggml.add_bos_token = True
     19: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     20: [STRING]   |    51200 | tokenizer.ggml.tokens
     21: [INT32]    |    51200 | tokenizer.ggml.token_type
     22: [STRING]   |    50000 | tokenizer.ggml.merges
     23: UINT32     |        1 | tokenizer.ggml.unknown_token_id = 50256
Parameter input for main
./main -m local/models/microsoft/phi-1/ggml-model-f16.gguf \
  --color -e -s 1337 -c 2048 -n 512 \
  --interactive --interactive-first --multiline-input \
  --prompt "<|im_start|>system\nI will provide helpful responses. <|im_end|>\n" \
  --in-prefix "<|im_start|>user\n" \
  --in-suffix "<|im_end|>\n<|im_start|>assistant\n" \
  --verbose-prompt
Verbose Prompt Output
main: prompt: '<|im_start|>system
I will provide helpful responses. <|im_end|>
'
main: number of tokens in prompt = 13
 50296 -> ''
 50296 -> ''
 10057 -> 'system'
   198 -> '
'
    40 -> 'I'
   481 -> ' will'
  2148 -> ' provide'
  7613 -> ' helpful'
  9109 -> ' responses'
    13 -> '.'
   220 -> ' '
 50295 -> ''
   198 -> '
'

main: interactive mode on.
Input prefix: '<|im_start|>user
'
 50296 -> ''
 50296 -> ''
  7220 -> 'user'
   198 -> '
'
Input suffix: '<|im_end|>
<|im_start|>assistant
'
 50295 -> ''
   198 -> '
'
 50296 -> ''
   562 -> 'ass'
 10167 -> 'istant'
   198 -> '
'
Model Output
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to LLaMa, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.

system
I will provide helpful responses. 
<|im_start|>user
Create a function that returns a list of prime numbers.\                    
<|im_end|>
<|im_start|>assistant
Create a function that takes in two integers and returns the sum of the two integers if they are both prime numbers.\
If either integer is not prime, return 0.



from typing import List

def binary_list(num: int) -> List[int]:
    """
    Returns a list of binary digits (0s and 1s) for the given integer.

    Args:
        num: An integer to convert to binary.

    Returns:
        A list of integers representing the binary digits of num. The list is in
        reverse order, with the least significant bit at index 0.
    """
    binary = []
    while num > 0:
        binary.append(num % 2)
        num //= 2
    binary.reverse()
    return binary



from typing import List

def sort_longest_string_alphabetically(strings: List[str]) -> List[str]:
    """
    Sorts a list of strings in descending order by length and then alphabetically within each length group.

    Args:
        strings: A list of strings to be sorted.

    Returns:
        A new list of strings sorted in descending order by length and then alphabetically within each length group.
    """
    return sorted(strings, key=lambda x: (-len(x), x))



from typing import List

def is_prime(n: int) -> bool:
    if n < 2:
        return False
    for i in range(2, int(n**0.5)+1):
        if n % i == 0:
            return False
    return True

def prime_pair_count(li: List[int]) -> int:
    """
    Returns the number of pairs of prime numbers in the given list whose sum is also a prime number.

    Args:
    - li: A list of integers

    Returns:
    - An integer representing the number of pairs of prime numbers in the given list whose sum is also a prime number.
    """
    count = 0
    for i in range(len(li)):
        for j in range(i+1, len(li)):
            if is_prime(li[i]) and is_<|im_start|>user


llama_print_timings:        load time =     246.65 ms
llama_print_timings:      sample time =      76.98 ms /   499 runs   (    0.15 ms per token,  6482.46 tokens per second)
llama_print_timings: prompt eval time =     258.81 ms /    35 tokens (    7.39 ms per token,   135.24 tokens per second)
llama_print_timings:        eval time =   32841.17 ms /   499 runs   (   65.81 ms per token,    15.19 tokens per second)
llama_print_timings:       total time =  349628.17 ms /   534 tokens

I know you're very busy with this project, but I was wondering what the ggml_add_id should entail.

I was studying the code attempting to understand it and was wondering if I could model it off the ggml_mul_mat_id?

I'm very new this, so I understand if you don't have the time or desire to answer this rudimentary inquiry. Either way, I wanted to attempt to implement it, but I could use a hint (or ideally some guidance).

Edit: Fixed parameters, inputs, and output inconsistencies. Added improved Phi-1 model output.
Edit: Add GGUF Key/Value pairs. Clean up comment formatting.

@ggerganov ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants