llama : add phixtral support #4912

ggerganov · 2024-01-13T12:22:12Z

Conversion and compute graph are ready. Tested with https://huggingface.co/mlabonne/phixtral-4x2_8

python3 convert-hf-to-gguf.py ~/Data/huggingface/phixtral-4x2_8/ --outfile phixtral.gguf --outtype f16

make -j && ./main -m phixtral.gguf -p "test"

We are missing ggml_add_id for the expert bias:

llama.cpp/llama.cpp

Lines 5769 to 5782 in 8bef31b

    
                               ggml_tensor * cur_up = ggml_mul_mat_id(ctx0, model.layers[il].ffn_up_exp, n_expert, selected_experts, i, cur); 
        
           #pragma message "TODO: implement ggml_add_id" 
        
                               //cur_up = ggml_add_id(ctx0, cur_up, model.layers[il].ffn_up_exp_b, n_expert, selected_experts, i); 
        
                               cb(cur_up, "ffn_moe_up", il); 
        
                               cur_up = ggml_silu(ctx0, cur_up); 
        
                               cb(cur_up, "ffn_moe_silu", il); 
        
                               cur_expert = ggml_mul_mat_id(ctx0, model.layers[il].ffn_down_exp, n_expert, selected_experts, i, cur_up); // [n_tokens, n_embd] 
        
           #pragma message "TODO: implement ggml_add_id" 
        
                               //cur_expert = ggml_add_id(ctx0, cur_expert, model.layers[il].ffn_down_exp_b, n_expert, selected_experts, i); 
        
                               cb(cur_expert, "ffn_moe_down", il);

Need to be implemented for all backends.

teleprint-me · 2024-02-03T20:25:23Z

I've been looking into why the Phi models have inconsistent inference and I believe it may be related to the BOS and EOS tokens during conversion. This applies to all of the Phi models, including Phixtral. I borrowed the values directly from Phixtral. Testing seems promising.

diff --git a/convert-hf-to-gguf.py b/convert-hf-to-gguf.py
index a6ffd128..2e46a8e4 100755
--- a/convert-hf-to-gguf.py
+++ b/convert-hf-to-gguf.py
@@ -1202,7 +1202,12 @@ class Phi2Model(Model):
         self.gguf_writer.add_layer_norm_eps(get_key_opts(self.hparams, ["layer_norm_epsilon", "layer_norm_eps"]))
         self.gguf_writer.add_rope_dimension_count(int(rot_pct * n_embd) // n_head)
         self.gguf_writer.add_file_type(self.ftype)
-        self.gguf_writer.add_add_bos_token(False)
+
+        self.gguf_writer.add_eos_token_id(50295)  <- Explicitly adding token ids
+        self.gguf_writer.add_bos_token_id(50296)  # NOTE: Values are not defined in vocab
+
+        self.gguf_writer.add_add_eos_token(False)  <- This is experimental
+        self.gguf_writer.add_add_bos_token(True)   <- Not sure if needed?
 
 
 class PlamoModel(Model):

I believe I noticed immediate improvement in Phi-1's output compared to initial results when using the proper template. I'm interested in knowing if this is just something I'm imagining or if this is a legitimate improvement.

Issues I'm currently facing with this approach are:

The BOS and EOS tokens vanish from the system prompt for Phi-1. This operates as expected with Phi-2 which is inconsistent with Phi-1.
Phi-1, Phi-1_5, and Phi-2 seem to handle the vocabulary slightly differently during conversion. I've been attempting to debug this. Fixing Phi-1 and Phi-2 seems to break Phi-1_5. I haven't tested Phixtral yet, but I'm assuming the issue will propagate.

GGUF Key/Value Pairs

$ python gguf-py/scripts/gguf-dump.py --no-tensors local/models/microsoft/phi-1/ggml-model-f16.gguf
* Loading: local/models/microsoft/phi-1/ggml-model-f16.gguf
* File is LITTLE endian, script is running on a LITTLE endian host.

* Dumping 23 key/value pair(s)
      1: UINT32     |        1 | GGUF.version = 3
      2: UINT64     |        1 | GGUF.tensor_count = 341
      3: UINT64     |        1 | GGUF.kv_count = 20
      4: STRING     |        1 | general.architecture = 'phi2'
      5: STRING     |        1 | general.name = 'Phi2'
      6: UINT32     |        1 | phi2.context_length = 2048
      7: UINT32     |        1 | phi2.embedding_length = 2048
      8: UINT32     |        1 | phi2.feed_forward_length = 8192
      9: UINT32     |        1 | phi2.block_count = 24
     10: UINT32     |        1 | phi2.attention.head_count = 32
     11: UINT32     |        1 | phi2.attention.head_count_kv = 32
     12: FLOAT32    |        1 | phi2.attention.layer_norm_epsilon = 9.999999747378752e-06
     13: UINT32     |        1 | phi2.rope.dimension_count = 32
     14: UINT32     |        1 | general.file_type = 1
     15: UINT32     |        1 | tokenizer.ggml.bos_token_id = 50296  <- -
     16: UINT32     |        1 | tokenizer.ggml.eos_token_id = 50295  <- Tokens successfully added
     17: BOOL       |        1 | tokenizer.ggml.add_eos_token = True
     18: BOOL       |        1 | tokenizer.ggml.add_bos_token = True
     19: STRING     |        1 | tokenizer.ggml.model = 'gpt2'
     20: [STRING]   |    51200 | tokenizer.ggml.tokens
     21: [INT32]    |    51200 | tokenizer.ggml.token_type
     22: [STRING]   |    50000 | tokenizer.ggml.merges
     23: UINT32     |        1 | tokenizer.ggml.unknown_token_id = 50256

Parameter input for main

./main -m local/models/microsoft/phi-1/ggml-model-f16.gguf \
  --color -e -s 1337 -c 2048 -n 512 \
  --interactive --interactive-first --multiline-input \
  --prompt "<|im_start|>system\nI will provide helpful responses. <|im_end|>\n" \
  --in-prefix "<|im_start|>user\n" \
  --in-suffix "<|im_end|>\n<|im_start|>assistant\n" \
  --verbose-prompt

Verbose Prompt Output

main: prompt: '<|im_start|>system
I will provide helpful responses. <|im_end|>
'
main: number of tokens in prompt = 13
 50296 -> ''
 50296 -> ''
 10057 -> 'system'
   198 -> '
'
    40 -> 'I'
   481 -> ' will'
  2148 -> ' provide'
  7613 -> ' helpful'
  9109 -> ' responses'
    13 -> '.'
   220 -> ' '
 50295 -> ''
   198 -> '
'

main: interactive mode on.
Input prefix: '<|im_start|>user
'
 50296 -> ''
 50296 -> ''
  7220 -> 'user'
   198 -> '
'
Input suffix: '<|im_end|>
<|im_start|>assistant
'
 50295 -> ''
   198 -> '
'
 50296 -> ''
   562 -> 'ass'
 10167 -> 'istant'
   198 -> '
'

Model Output

sampling: 
	repeat_last_n = 64, repeat_penalty = 1.100, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temp 
generate: n_ctx = 2048, n_batch = 512, n_predict = 512, n_keep = 0


== Running in interactive mode. ==
 - Press Ctrl+C to interject at any time.
 - To return control to LLaMa, end your input with '\'.
 - To return control without starting a new line, end your input with '/'.

system
I will provide helpful responses. 
<|im_start|>user
Create a function that returns a list of prime numbers.\                    
<|im_end|>
<|im_start|>assistant
Create a function that takes in two integers and returns the sum of the two integers if they are both prime numbers.\
If either integer is not prime, return 0.



from typing import List

def binary_list(num: int) -> List[int]:
    """
    Returns a list of binary digits (0s and 1s) for the given integer.

    Args:
        num: An integer to convert to binary.

    Returns:
        A list of integers representing the binary digits of num. The list is in
        reverse order, with the least significant bit at index 0.
    """
    binary = []
    while num > 0:
        binary.append(num % 2)
        num //= 2
    binary.reverse()
    return binary



from typing import List

def sort_longest_string_alphabetically(strings: List[str]) -> List[str]:
    """
    Sorts a list of strings in descending order by length and then alphabetically within each length group.

    Args:
        strings: A list of strings to be sorted.

    Returns:
        A new list of strings sorted in descending order by length and then alphabetically within each length group.
    """
    return sorted(strings, key=lambda x: (-len(x), x))



from typing import List

def is_prime(n: int) -> bool:
    if n < 2:
        return False
    for i in range(2, int(n**0.5)+1):
        if n % i == 0:
            return False
    return True

def prime_pair_count(li: List[int]) -> int:
    """
    Returns the number of pairs of prime numbers in the given list whose sum is also a prime number.

    Args:
    - li: A list of integers

    Returns:
    - An integer representing the number of pairs of prime numbers in the given list whose sum is also a prime number.
    """
    count = 0
    for i in range(len(li)):
        for j in range(i+1, len(li)):
            if is_prime(li[i]) and is_<|im_start|>user


llama_print_timings:        load time =     246.65 ms
llama_print_timings:      sample time =      76.98 ms /   499 runs   (    0.15 ms per token,  6482.46 tokens per second)
llama_print_timings: prompt eval time =     258.81 ms /    35 tokens (    7.39 ms per token,   135.24 tokens per second)
llama_print_timings:        eval time =   32841.17 ms /   499 runs   (   65.81 ms per token,    15.19 tokens per second)
llama_print_timings:       total time =  349628.17 ms /   534 tokens

I know you're very busy with this project, but I was wondering what the ggml_add_id should entail.

I was studying the code attempting to understand it and was wondering if I could model it off the ggml_mul_mat_id?

I'm very new this, so I understand if you don't have the time or desire to answer this rudimentary inquiry. Either way, I wanted to attempt to implement it, but I could use a hint (or ideally some guidance).

Edit: Fixed parameters, inputs, and output inconsistencies. Added improved Phi-1 model output.
Edit: Add GGUF Key/Value pairs. Clean up comment formatting.

ggerganov force-pushed the gg/add-phixtral branch from 8bef31b to 0580f87 Compare January 13, 2024 12:23

llama : add phixtral support (wip)

9998ecd

ggerganov force-pushed the gg/add-phixtral branch from 0580f87 to 9998ecd Compare January 13, 2024 12:24

cg123 mentioned this pull request Jan 17, 2024

Try to add Qwen-moe into mixtral_moe.py arcee-ai/mergekit#117

Open

DisOOM mentioned this pull request Apr 3, 2024

Adding Support for Custom Qwen2moe Architectures with mergekit-qwen2 #6453

Draft

ggerganov added the demo Demonstrate some concept or idea, not intended to be merged label Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama : add phixtral support #4912

llama : add phixtral support #4912

ggerganov commented Jan 13, 2024

teleprint-me commented Feb 3, 2024 •

edited

Loading


	ggml_tensor * cur_up = ggml_mul_mat_id(ctx0, model.layers[il].ffn_up_exp, n_expert, selected_experts, i, cur);
	#pragma message "TODO: implement ggml_add_id"
	//cur_up = ggml_add_id(ctx0, cur_up, model.layers[il].ffn_up_exp_b, n_expert, selected_experts, i);
	cb(cur_up, "ffn_moe_up", il);

	cur_up = ggml_silu(ctx0, cur_up);
	cb(cur_up, "ffn_moe_silu", il);

	cur_expert = ggml_mul_mat_id(ctx0, model.layers[il].ffn_down_exp, n_expert, selected_experts, i, cur_up); // [n_tokens, n_embd]
	#pragma message "TODO: implement ggml_add_id"
	//cur_expert = ggml_add_id(ctx0, cur_expert, model.layers[il].ffn_down_exp_b, n_expert, selected_experts, i);
	cb(cur_expert, "ffn_moe_down", il);

llama : add phixtral support #4912

Are you sure you want to change the base?

llama : add phixtral support #4912

Conversation

ggerganov commented Jan 13, 2024

teleprint-me commented Feb 3, 2024 • edited Loading

teleprint-me commented Feb 3, 2024 •

edited

Loading