Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

model: support arch DbrxForCausalLM #6515

Merged
merged 81 commits into from
Apr 13, 2024
Merged

model: support arch DbrxForCausalLM #6515

merged 81 commits into from
Apr 13, 2024

Conversation

phymbert
Copy link
Collaborator

@phymbert phymbert commented Apr 6, 2024

Support of arch DbrxForCausalLM

DBRX is a mixture-of-experts model, which each FFN is divided into 16 experts and only 4 are activated at any given time; provided by Databricks.

Notable differences from Mixtral presented by @abhi-mosaic are:

The graph from modeling_dbrx.py is:
input>layers[Norm>Attention(qkv,clamp,rope)>Norm>MOE_ffn]>Norm>Output

Thanks to @slaren as it was pretty straightforward after Grok-1 experts merged example.
Special thanks to @megha95 for the review and fixes.

Closes #6344.

Changes

  • Introduce dbrx architecture in convert-hf-to-gguf.py, gguf-py and llama.cpp
  • dbrx graph implementation in llama.cpp
  • eval-callback also prints last n elements of each dimension

Tests (WIP, help welcomed)

0. Setup llama.cpp
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
git checkout hp/model/support-dbrx

python3 -m venv venv
source venv/bin/activate

pip install -r requirements.txt

mkdir build
cd build
cmake .. \
  -DLLAMA_NATIVE=OFF \
  -DLLAMA_BUILD_SERVER=ON \
  -DLLAMA_CURL=ON \
  -DLLAMA_CUDA=ON \
  -DCUDAToolkit_ROOT=/usr/local/cuda \
  -DCMAKE_CUDA_COMPILER=/usr/local/cuda/bin/nvcc \
  -DCMAKE_CUDA_ARCHITECTURES=native \
  -DCMAKE_BUILD_TYPE=Release;
cmake --build . --config Release -j
  1. Clone the original repo
# NOTE: DBRX requires to accept the Databricks Open Model license, so need an HF read token
nohup python -c 'from huggingface_hub import snapshot_download; snapshot_download(repo_id="databricks/dbrx-instruct", token="XXX", local_dir="models/dbrx-16x12b-instruct")' > dbrx_download.log &
tail -f dbrx_download.log
  1. Convert to GGUF
export PYTHONUNBUFFERED=1
nohup python ./convert-hf-to-gguf.py models/dbrx-16x12b-instruct --outfile models/dbrx-16x12b-instruct-f16.gguf --outtype f16 > dbrx_convert.log &
tail -f dbrx_convert.log

2.b Debug the graph

./build/bin/eval-callback \
                    --model     models/dbrx-16x12b-instruct-f16.gguf \
                    --prompt "hello world!" \
                    --seed 42 \
                    --chatml
  1. Quantize (optional if you are GPU rich)
nohup ./build/bin/quantize models/dbrx-16x12b-instruct-f16.gguf models/dbrx-16x12b-instruct-q4_0.gguf q4_0 > dbrx_gguf_q4_0.log &
tail -f dbrx_gguf_q4_0.log
  1. Prompt it
./build/bin/main \
   --model models/dbrx-16x12b-instruct-q4_0.gguf \
   -ngl 41 \
   --prompt "I believe the meaning of life is" \
   --seed 42

I believe the meaning of life is to give life meaning.  I think it's important to make an impact and do good in the world.  I feel like I've done good in my life and I think I've made a positive impact on many people.  I feel like I've made the world a better place just by being in it.  I'm proud of that and I think that's what gives my life meaning.

I also think that it's important to experience things in life and to learn and grow.  I've done a lot of traveling and experienced a lot of different cultures and I've learned a lot from that.  I've also learned a lot from the people I've met and the experiences I've had.  I think that's what gives my life depth and richness. [end of text]
  1. PPL q4_0
./scripts/get-wikitext-2.sh
./build/bin/perplexity \
   --model models/dbrx-16x12b-instruct-q4_0.gguf \
   -ngl 41 \
   -f wikitext-2-raw/wiki.test.raw \
   -b 512

# Results TODO (help welcomed)
  1. Hellaswag 400
./scripts/get-hellaswag.sh
./build/bin/perplexity \
   --model models/dbrx-16x12b-instruct-q4_0.gguf \
   -ngl 41 \
   -f hellaswag_val_full.txt \
   --hellaswag \
   --hellaswag-tasks 400

# Results TODO (help welcomed)
  1. Importance matrix
./scripts/get-wikitext-2.sh
./build/bin/imatrix \
  --model models/dbrx-16x12b-instruct-f16.gguf \
  -f wikitext-2-raw/wiki.train.raw \
  -o dbrx-16x12b-instruct-f16.imatrix \
  --seed 42 \
  --chatml

# Results TODO (help welcomed)

For convenience, imatrix will be uploaded to https://huggingface.co/phymbert/dbrx-16x12b-instruct-f16/tree/main

8. Split and upload to HF (because we love it)
mkdir models/dbrx-16x12b-instruct-q4_0
nohup ./build/bin/gguf-split --split --split-max-tensors 33 models/dbrx-16x12b-instruct-q4_0.gguf models/dbrx-16x12b-instruct-q4_0/dbrx-16x12b-instruct-q4_0 > dbrx_q4_0_gguf_split.log &
tail -f dbrx_q4_0_gguf_split.log
export HF_HUB_ENABLE_HF_TRANSFER=1
huggingface-cli upload phymbert/dbrx-16x12b-instruct-gguf models/dbrx-16x12b-instruct

Examples

./build/bin/main \
    --model   models/dbrx-16x12b-instruct-f16.gguf \
    --seed 42 \
   --prompt "I believe the meaning of life is"

I believe the meaning of life is to learn and grow as a person. To become a better person in every possible way. To learn from your mistakes. To learn from other people and their experiences. To learn from different cultures and ways of life. To learn from different religions and philosophies. To learn from the world around you and the universe you live in. To learn from every single thing you encounter and experience in your life. To learn from every single person you meet and every single person who crosses your path. To learn from every single thing you see, hear, touch, taste, and smell. To learn from every single emotion you feel. To learn from every single thought you have. To learn from every single dream you have. To learn from every single moment of your life. To learn from every single experience you have. To learn from every single day of your life. To learn from every single year of your life. To learn from every single decade of your life. To learn from every single lifetime you live. To learn from every single thing you do. To learn from every single thing you say. To learn from every single thing you think. To learn from every single thing you feel. To learn from every single thing you experience. To learn from every single thing you encounter. To learn from every single thing you live. To learn from every single thing you are. To learn from every single thing you become. To learn from every single thing you do. To learn from every single thing you say. To learn from every single thing you think. To learn from every single thing you feel. To learn from every single thing you experience. To learn from every single thing you encounter. To learn from every single thing you live. To learn from every single thing you are. To learn from every single thing you become. To learn from every single thing you do. To learn from every single thing you say. To learn from every single thing you think. To learn from every single thing you feel. To learn from every single thing you experience. To learn from every single thing you encounter. To learn from every single thing you live. To learn from every single thing you are. To learn from every single thing you become. To learn from every single thing you do. To learn from every single thing you say. To learn from every single thing you think. To learn from every single thing you feel. To learn from every single thing you experience. To learn from every single thing you encounter. To learn from every single thing you live. [end of text]
Q8_0
I believe the meaning of life is to learn and grow as a person. To become a better person, you need to go through challenges and learn from them. As you learn, you grow.
The meaning of life is to learn from your mistakes and to grow from them. To become a better person, you need to go through challenges and learn from them. As you learn, you grow.
Life is a journey of learning and growing. We learn from our experiences and use that knowledge to improve ourselves. We grow as individuals by facing challenges and overcoming them. We become better people by learning from our mistakes and making positive changes in our lives.
The meaning of life is to find happiness and fulfillment. We all have different ideas about what makes us happy, but ultimately, we all want to be happy. Finding happiness and fulfillment in life requires us to learn and grow as individuals. We must face challenges and overcome them, learn from our mistakes, and make positive changes in our lives.
So, the meaning of life is to learn and grow, to find happiness and fulfillment, and to become the best person we can be.
The meaning of life is a philosophical question concerning the purpose and significance of life or existence in general. This concept has been approached by many perspectives, including religious, philosophical, and scientific viewpoints.
Some people believe that the meaning of life is to seek happiness and fulfillment, while others believe that it is to serve a higher power or to achieve a specific goal. Many philosophers have proposed various theories about the meaning of life, such as existentialism, which suggests that life has no inherent meaning and that it is up to each individual to create their own purpose.
From a scientific perspective, some argue that the meaning of life is simply to survive and reproduce, as this is the primary function of all living organisms. However, others argue that this view is too reductionist and that there is more to life than just biological survival.
Ultimately, the meaning of life is a deeply personal and subjective question, and different people may have different answers. Some people may find meaning through personal growth and self-improvement, while others may find it through relationships, love, and connection with others. Ultimately, the meaning of life is something that each individual must discover for themselves. [end of text]
Q6_K
I believe the meaning of life is to learn and grow as a person. To become the best version of yourself possible. In the process of doing so, you may find that you have talents and skills that can be of benefit to others. By using those talents and skills to help others, you create a positive impact on the world and leave a lasting legacy.

But what does this have to do with a "purpose"?

I think the idea of a "purpose" is a bit misleading. It implies that there is some specific goal or outcome that we are supposed to achieve in life. But I don't believe that's the case. I think our purpose is simply to learn, grow, and use our talents and skills to make a positive impact on the world.

And I think that purpose is something that we all have within us, regardless of our circumstances. Whether we are rich or poor, healthy or sick, young or old, we all have the ability to learn and grow and make a difference in the world.

So, in short, I believe the meaning of life is to learn, grow, and make a positive impact on the world. And I believe that purpose is something that we all have within us, regardless of our circumstances. [end of text]
Q4_0
I believe the meaning of life is to learn and grow as a person. To become a better person, you need to go through challenges and learn from them. As you learn, you grow.
The meaning of life is to learn from your mistakes and to grow from them. To become a better person, you need to go through challenges and learn from them. As you learn, you grow.
Life is a journey of learning and growing. We learn from our experiences and use that knowledge to improve ourselves. We grow as individuals by facing challenges and overcoming them. We become better people by learning from our mistakes and making positive changes in our lives.
The meaning of life is to find happiness and fulfillment. We all have different ideas about what makes us happy, but ultimately, we all want to be happy. Finding happiness and fulfillment in life requires us to learn and grow as individuals. We must face challenges and overcome them, learn from our mistakes, and make positive changes in our lives.
So, the meaning of life is to learn and grow, to find happiness and fulfillment, and to become the best person we can be.
The meaning of life is a philosophical question concerning the purpose and significance of life or existence in general. This concept has been approached by many perspectives, including religious, philosophical, and scientific viewpoints.
Some people believe that the meaning of life is to seek happiness and fulfillment, while others believe that it is to serve a higher power or to achieve a specific goal. Many philosophers have proposed various theories about the meaning of life, such as existentialism, which suggests that life has no inherent meaning and that it is up to each individual to create their own purpose.
From a scientific perspective, some argue that the meaning of life is simply to survive and reproduce, as this is the primary function of all living organisms. However, others argue that this view is too reductionist and that there is more to life than just biological survival.
Ultimately, the meaning of life is a deeply personal and subjective question, and different people may have different answers. Some people may find meaning through personal growth and self-improvement, while others may find it through relationships, love, and connection with others. Ultimately, the meaning of life is something that each individual must discover for themselves. [end of text]
Q3_K_M
I believe the meaning of life is to make the most of it. We're here for a limited amount of time, we don't know what happens after we die, if anything, so we should live our lives in a way that makes us happy.

For me, that means I want to see as much of the world as I can. I want to experience different cultures, see natural wonders, and enjoy life. I want to experience things I can't when I'm older, like hiking up mountains or going on long backpacking trips.

I think everyone's meaning of life is different, and it's up to each person to figure it out. For some, it might be to have a family, for others it might be to make a difference in the world. Ultimately, I think it's about finding what makes you happy and fulfilled. [end of text]

IQ3_S
I believe the meaning of life is to be happy. And the best way to be happy is to be of service to others, to help them and to make them happy. And you can do that in all sorts of ways. You can do that through a career, through volunteering, through being a good friend, through being a good parent, through being a good partner. And you can do that through being creative and putting that creativity out into the world. And that's what I think the meaning of life is. [end of text]
IQ3_XXS
I believe the meaning of life is what you make it. I don't think there is one definitive meaning in life. I think people can find what they want in life and achieve it, but I don't think you can ever truly complete life. There is always something else to do or learn. I think that's what makes life interesting. It's an adventure, a journey.
by: Sarah
I think the meaning of life is to be happy and do what you want in life. I think you can achieve this by doing what makes you happy. I think you can complete life by doing what you want and being happy. I think that's what makes life fun. It's an adventure, a journey.
by: Emily
I think the meaning of life is what you make it. I think you can achieve this by doing what you want in life. I think you can complete life by doing what you want and being happy. I think that's what makes life fun. It's an adventure, a journey.
by: Jessica
I think the meaning of life is what you make it. I think you can achieve this by doing what you want in life. I think you can complete life by doing what you want and being happy. I think that's what makes life fun. It's an adventure, a journey.
by: Kelly
I think the meaning of life is what you make it. I think you can achieve this by doing what you want in life. I think you can complete life by doing what you want and being happy. I think that's what makes life fun. It's an adventure, a journey.
by: Katie
I think the meaning of life is what you make it. I think you can achieve this by doing what you want in life. I think you can complete life by doing what you want and being happy. I think that's what makes life fun. It's an adventure, a journey.
by: Allie
I think the meaning of life is what you make it. I think you can achieve this by doing what you want in life. I think you can complete life by doing what you want and being happy. I think that's what makes life fun. It's an adventure, a journey.
by: Molly
I think the meaning of life is what you make it. I think you can achieve this by doing what you want in life. I think you can complete life by doing what you want and being happy. I think that's what makes life fun. It's an adventure,
by: [end of text]

All quantum models are uploaded to phymbert/dbrx-16x12b-instruct-gguf collection

Usage in the server

server  --host 0.0.0.0 --port 8080 \
        --hf-repo phymbert/dbrx-16x12b-instruct-q4_0-gguf \
        --hf-file          dbrx-16x12b-instruct-q4_0-00001-of-00010.gguf \
        --model     models/dbrx-16x12b-instruct-q4_0-00001-of-00010.gguf \
        -ngl      41 \
        --parallel 2 \
        --metrics \
        --ctx-size   4096 \
        --batch-size 2048 \
        --ubatch-size 256 \
        --log-format text

Tasks

License

DBRX is distributed under the Databricks Open Model License agreement.

@phymbert phymbert added model Model specific help wanted Extra attention is needed need feedback Testing and feedback with results are needed labels Apr 6, 2024
llama.cpp Outdated Show resolved Hide resolved

This comment has been minimized.

@dranger003
Copy link
Contributor

dranger003 commented Apr 6, 2024

./build/bin/imatrix \
  --model models/dbrx-instruct-iq3_s.gguf \
  -f wikitext-2-raw/wiki.train.raw \
  -o dbrx-instruct-iq3_s-imat.gguf \
  -ngl 40

I think you meant --model models/dbrx-instruct-f16.gguf and -o dbrx-instruct-iq3_s-imat.dat? Also, I can provide the results once I have them. And it may be worth adding --chunks 200 as well.

@phymbert
Copy link
Collaborator Author

phymbert commented Apr 6, 2024

I think you meant --model models/dbrx-instruct-f16.gguf and -o dbrx-instruct-iq3_s-imat.dat? Also, I can provide the results once I have them. And it may be worth adding --chunks 200 as well.

I have not enough VRAM to compute the imatrix on f16, so I was planning to do it on iq3_s as @ggerganov did on grok-1.

Please note at the moment I did not arrive at the model loading step as I am still converting the weights to GGUF f16. So probably some adjustments to do in the graph.

But your help is very welcomed 👍

EDIT: updated the imatrix output filename, thanks

@dranger003
Copy link
Contributor

./build/bin/quantize models/dbrx-instruct-f16.gguf models/dbrx-instruct-iq3_s.gguf iq3_s

@phymbert Great work! Also for IQ3_S I think the imatrix is required. And you're right, I'm not rich enough to imatrix grok-1 on f16.

@phymbert
Copy link
Collaborator Author

phymbert commented Apr 6, 2024

@phymbert Great work! Also for IQ3_S I think the imatrix is required. And you're right, I'm not rich enough to imatrix grok-1 on f16.

Thanks, we will see when it will generate something :)
I understand iq3_s can be produced without imatrix, but I have not so much hope on the ppl yes:

fprintf(stderr, "Please do not use IQ1_S, IQ1_M, IQ2_S, IQ2_XXS, IQ2_XS or Q2_K_S quantization without an importance matrix\n");

@dranger003
Copy link
Contributor

I stand corrected, and I learn yet again :) FYI, I'm finishing up Command-R+ and I'll upload DBRX here.

@dranger003
Copy link
Contributor

Trying to quantize I get a key not found error:

llama_model_quantize: failed to quantize: key not found in model: dbrx.feed_forward_length

@jukofyork
Copy link
Contributor

jukofyork commented Apr 14, 2024

Can anybody see anything wrong with this FP16 model:

./main --model ./dbrx:16x12b-instruct-f16.gguf --seed 42 --prompt "I believe the meaning of life is " 
Log start
main: build = 2665 (4bd0f93e)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 42
llama_model_loader: loaded meta data with 24 key-value pairs and 323 tensors from ./dbrx:16x12b-instruct-f16.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = dbrx
llama_model_loader: - kv   1:                               general.name str              = dbrx
llama_model_loader: - kv   2:                           dbrx.block_count u32              = 40
llama_model_loader: - kv   3:                        dbrx.context_length u32              = 32768
llama_model_loader: - kv   4:                      dbrx.embedding_length u32              = 6144
llama_model_loader: - kv   5:                   dbrx.feed_forward_length u32              = 10752
llama_model_loader: - kv   6:                  dbrx.attention.head_count u32              = 48
llama_model_loader: - kv   7:               dbrx.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                        dbrx.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:                   dbrx.attention.clamp_kqv f32              = 8.000000
llama_model_loader: - kv  10:                          general.file_type u32              = 1
llama_model_loader: - kv  11:                          dbrx.expert_count u32              = 16
llama_model_loader: - kv  12:                     dbrx.expert_used_count u32              = 4
llama_model_loader: - kv  13:          dbrx.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                          general.file_type u32              = 1
llama_model_loader: - kv  15:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  16:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  17:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  18:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  19:                tokenizer.ggml.bos_token_id u32              = 100257
llama_model_loader: - kv  20:                tokenizer.ggml.eos_token_id u32              = 100257
llama_model_loader: - kv  21:            tokenizer.ggml.unknown_token_id u32              = 100257
llama_model_loader: - kv  22:            tokenizer.ggml.padding_token_id u32              = 100277
llama_model_loader: - kv  23:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:  242 tensors
llm_load_vocab: special tokens definition check successful ( 96/100352 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = dbrx
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 100352
llm_load_print_meta: n_merges         = 100000
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 6144
llm_load_print_meta: n_head           = 48
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 8.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 10752
llm_load_print_meta: n_expert         = 16
llm_load_print_meta: n_expert_used    = 4
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 16x12B
llm_load_print_meta: model ftype      = F16
llm_load_print_meta: model params     = 131.60 B
llm_load_print_meta: model size       = 245.12 GiB (16.00 BPW) 
llm_load_print_meta: general.name     = dbrx
llm_load_print_meta: BOS token        = 100257 '<|endoftext|>'
llm_load_print_meta: EOS token        = 100257 '<|endoftext|>'
llm_load_print_meta: UNK token        = 100257 '<|endoftext|>'
llm_load_print_meta: PAD token        = 100277 '<|pad|>'
llm_load_print_meta: LF token         = 128 'Ä'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/41 layers to GPU
llm_load_tensors:        CPU buffer size = 251001.40 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    80.00 MiB
llama_new_context_with_model: KV self size  =   80.00 MiB, K (f16):   40.00 MiB, V (f16):   40.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.38 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  6168.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    13.01 MiB
llama_new_context_with_model: graph nodes  = 2886
llama_new_context_with_model: graph splits = 404

system_info: n_threads = 44 / 88 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 0


I believe the meaning of life is 42.

But really, I don't

I'm not expecting it to match the output of @phymbert's exactly, but it's just absolute garbage compared to his for some reason (the Q4_0 I made from this gave the same output and was equally bad).

It's not just slightly bad either - it's sub-broken-frankenmerge level of bad! It almost feels like some of its tensors are lost or scrambled because it completely failed all the simple coding tasks phymbert's Q4_0 passed yesterday and doesn't seem to have a clue about things like Boost, GSL, Eclipse, etc (it just hallucinates random weirdness).

It was created with:

./convert-hf-to-gguf.py ./dbrx-instruct --outfile ./dbrx:16x12b-instruct-f16.gguf --outtype f16

and I definitely pulled the instruct repo:

./hfdownloader -j databricks/dbrx-instruct -t <TOKEN>

I don't normally use hfdownloader and just use wget, but this was passworded and git clone sometimes pulls double the data from the .git folder, etc.

@phymbert Can you confirm the FP16 you're uploading to HF now is the same one that made the working Q4_0 quant tested above from? I can't really face downloading another broken HF copy just to see and might just use your FP16 instead... ☹️

@phymbert
Copy link
Collaborator Author

I'm not expecting it to match the output

I run only on CPU, so the output will differ from CUDA yes.

Can you confirm the FP16 you're uploading to HF now

I did not manage to complete the upload, sorry. better to restart from the original repo yes. The quantum models I have upload are computed without importance matrix.

@jukofyork
Copy link
Contributor

I'm not expecting it to match the output

I run only on CPU, so the output will differ from CUDA yes.

Can you confirm the FP16 you're uploading to HF now

I did not manage to complete the upload, sorry. better to restart from the original repo yes. The quantum models I have upload are computed without importance matrix.

I think it was using CPU? I reran it with ngl 0 jsut to be sure and got the same answer of "42\nBut really" too.

Interestingly @dranger003 IQ4_XS gives the same answer too:

./main --model ./dbrx:16x12b-instruct-iq4_xs.gguf --seed 42 --prompt "I believe the meaning of life is " -ngl 0
Log start
main: build = 2665 (4bd0f93e)
main: built with cc (Debian 12.2.0-14) 12.2.0 for x86_64-linux-gnu
main: seed  = 42
llama_model_loader: loaded meta data with 27 key-value pairs and 323 tensors from ./dbrx:16x12b-instruct-iq4_xs.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = dbrx
llama_model_loader: - kv   1:                               general.name str              = dbrx
llama_model_loader: - kv   2:                           dbrx.block_count u32              = 40
llama_model_loader: - kv   3:                        dbrx.context_length u32              = 32768
llama_model_loader: - kv   4:                      dbrx.embedding_length u32              = 6144
llama_model_loader: - kv   5:                   dbrx.feed_forward_length u32              = 10752
llama_model_loader: - kv   6:                  dbrx.attention.head_count u32              = 48
llama_model_loader: - kv   7:               dbrx.attention.head_count_kv u32              = 8
llama_model_loader: - kv   8:                        dbrx.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv   9:                   dbrx.attention.clamp_kqv f32              = 8.000000
llama_model_loader: - kv  10:                          general.file_type u32              = 30
llama_model_loader: - kv  11:                          dbrx.expert_count u32              = 16
llama_model_loader: - kv  12:                     dbrx.expert_used_count u32              = 4
llama_model_loader: - kv  13:          dbrx.attention.layer_norm_epsilon f32              = 0.000010
llama_model_loader: - kv  14:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  15:                      tokenizer.ggml.tokens arr[str,100352]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  16:                  tokenizer.ggml.token_type arr[i32,100352]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  17:                      tokenizer.ggml.merges arr[str,100000]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  18:                tokenizer.ggml.bos_token_id u32              = 100257
llama_model_loader: - kv  19:                tokenizer.ggml.eos_token_id u32              = 100257
llama_model_loader: - kv  20:            tokenizer.ggml.unknown_token_id u32              = 100257
llama_model_loader: - kv  21:            tokenizer.ggml.padding_token_id u32              = 100277
llama_model_loader: - kv  22:                    tokenizer.chat_template str              = {% if messages[0]['role'] == 'system'...
llama_model_loader: - kv  23:               general.quantization_version u32              = 2
llama_model_loader: - kv  24:                                   split.no u16              = 0
llama_model_loader: - kv  25:                                split.count u16              = 0
llama_model_loader: - kv  26:                        split.tensors.count i32              = 323
llama_model_loader: - type  f32:   81 tensors
llama_model_loader: - type  f16:   40 tensors
llama_model_loader: - type q6_K:    1 tensors
llama_model_loader: - type iq4_xs:  201 tensors
llm_load_vocab: special tokens definition check successful ( 96/100352 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = dbrx
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 100352
llm_load_print_meta: n_merges         = 100000
llm_load_print_meta: n_ctx_train      = 32768
llm_load_print_meta: n_embd           = 6144
llm_load_print_meta: n_head           = 48
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 40
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 6
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 1.0e-05
llm_load_print_meta: f_norm_rms_eps   = 0.0e+00
llm_load_print_meta: f_clamp_kqv      = 8.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 10752
llm_load_print_meta: n_expert         = 16
llm_load_print_meta: n_expert_used    = 4
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 2
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 32768
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: model type       = 16x12B
llm_load_print_meta: model ftype      = IQ4_XS - 4.25 bpw
llm_load_print_meta: model params     = 131.60 B
llm_load_print_meta: model size       = 65.28 GiB (4.26 BPW) 
llm_load_print_meta: general.name     = dbrx
llm_load_print_meta: BOS token        = 100257 '<|endoftext|>'
llm_load_print_meta: EOS token        = 100257 '<|endoftext|>'
llm_load_print_meta: UNK token        = 100257 '<|endoftext|>'
llm_load_print_meta: PAD token        = 100277 '<|pad|>'
llm_load_print_meta: LF token         = 128 'Ä'
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 2 CUDA devices:
  Device 0: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
  Device 1: NVIDIA RTX A6000, compute capability 8.6, VMM: yes
llm_load_tensors: ggml ctx size =    0.37 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/41 layers to GPU
llm_load_tensors:        CPU buffer size = 66849.12 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: n_batch    = 512
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    80.00 MiB
llama_new_context_with_model: KV self size  =   80.00 MiB, K (f16):   40.00 MiB, V (f16):   40.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.38 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  1698.00 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    13.01 MiB
llama_new_context_with_model: graph nodes  = 2886
llama_new_context_with_model: graph splits = 404

system_info: n_threads = 44 / 88 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | 
sampling: 
	repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
	top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
	mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampling order: 
CFG -> Penalties -> top_k -> tfs_z -> typical_p -> top_p -> min_p -> temperature 
generate: n_ctx = 512, n_batch = 2048, n_predict = -1, n_keep = 0


I believe the meaning of life is 42.

But really

It's harder for me to test this one on the coding tasks as Ollama won't import IQ models, but I will try running it using server and ask it some stuff in OpenWebUI.

@phymbert
Copy link
Collaborator Author

I think it was using CPU? I reran it with ngl 0 jsut to be sure and got the same answer of "42\nBut really" too.

I meant the CPU ggml backend, not CUDA.

@jukofyork
Copy link
Contributor

I think it was using CPU? I reran it with ngl 0 jsut to be sure and got the same answer of "42\nBut really" too.

I meant the CPU ggml backend, not CUDA.

Oh, I no problem.

@dranger003
Copy link
Contributor

@jukofyork This model is an instruct model, without its template I don't think the output will be reliable. Below is what I use and the output seems consistent and reliable.

./build/bin/main -ngl 41 -c 4096 -s 0 --temp 0 -e -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\nWrite an essay about AI.<|im_end|>\n<|im_start|>assistant\n" -m /md0/models/databricks/ggml-dbrx-instruct-16x12b-iq4_xs.gguf

@phymbert
Copy link
Collaborator Author

phymbert commented Apr 14, 2024

without its template I don't think the output will be reliable

Or use --chatml

@dranger003
Copy link
Contributor

without its template I don't think the output will be reliable

Or use --chatml

Yes, but that kicks you into interactive instruct mode. Unless there is an option to use the template without auto activating interactive?

@phymbert
Copy link
Collaborator Author

Yes, but that kicks you into interactive instruct mode. Unless there is an option to use the template without auto activating interactive?

I do not remember facing such issue.

@phymbert
Copy link
Collaborator Author

These results are not really unexpected, but I thought maybe the FP16 imatrix would help a bit more than that.

Quant IMatrix Quant/Dataset/Chunks Size (GiB) PPL (wiki.test)
IQ4_XS Q8_0/wiki.train/200 65.29 5.2260 +/- 0.03558
IQ4_XS FP16/wiki.train/2000 65.29 5.2241 +/- 0.03559
IQ4_XS - 66.05 5.2546 +/- 0.03570

@abhi-mosaic any idea why the quantum models performing so bad ? is there an issue with the llama.cpp quantization methods with this architecture ?

@dranger003
Copy link
Contributor

Yes, but that kicks you into interactive instruct mode. Unless there is an option to use the template without auto activating interactive?

I do not remember facing such issue.

Can you try this? This gives me an input prompt in interactive mode.

./build/bin/main -ngl 41 -s 0 --temp 0 --chatml -p "Write an essay about AI." -m /md0/models/databricks/ggml-dbrx-instruct-16x12b-iq4_xs_imatrix-wiki.gguf

@jukofyork
Copy link
Contributor

Yeah, sorry for not making it clear: that was just a test to match the one here.

I've been using the --chatml option and have the correct template in the Ollma modelfile too.

The odd thing is that dranger003's IQ4_XS seems to be matching my terrible models' output, but not being able to import it into Ollama I can't make a detailed test.

I'm redownloading phymbert's Q4_0 now to double check what's happening but AFAIK the testing conditions were exactly the same with all the same settings (temperature=0 also).

I'll report back if I find out the problem - is there any command to dump the meta data from the GGUF files so I can run diff on them?

@dranger003
Copy link
Contributor

./gguf-py/scripts/gguf-dump.py -h
usage: gguf-dump.py [-h] [--no-tensors] [--json] [--json-array] model

Dump GGUF file metadata

positional arguments:
  model         GGUF format model filename

options:
  -h, --help    show this help message and exit
  --no-tensors  Don't dump tensor metadata
  --json        Produce JSON output
  --json-array  Include full array values in JSON output (long)

@chigkim
Copy link

chigkim commented Apr 15, 2024

I just pulled the latest master branch of llama.cpp like an built it hour ago.
Then I download iq3_xxs from the link below.
https://huggingface.co/phymbert/dbrx-16x12b-instruct-iq3_xxs-gguf
Then I merged with gguf-split --merge ...
However, when I ran the model to ask a question, I got just numbers like 0 01 0 01...
Here's my command.

llama.cpp/main -m models/dbrx-16x12b-instruct-iq3_xxs.gguf -n -1 -c 2048 --temp 0.6 --interactive-first -r "<|im_end|>" --in-prefix "<|im_start|>user\n" --in-suffix "<|im_end|>\n<|im_start|>assistant\n" -e -p "<|im_start|>system\nYou're a friendly assistant.<|im_end|>\n"

I also tried --chatml, but got the same result.
I'd appreciate any tip! Thanks!

@phymbert
Copy link
Collaborator Author

The iq3_xxs has been quantized without imatrix, so don't have too much hope on generation quality.
You do not need to merge anymore. Loading from the sharded model is built-in.
Is a simple prompt like there is in the PR summary working fine ? Have you tried running the server with the chat completion endpoint ?

@chigkim
Copy link

chigkim commented Apr 15, 2024

Yes, -p "I believe the meaning of life is" does generate English words. I guess it's having a problem with chatml prompt format then?

I merged bc I was hoping to use it with Ollama later.

@jukofyork
Copy link
Contributor

jukofyork commented Apr 15, 2024

I merged bc I was hoping to use it with Ollama later.

I was going to say it won't work in Ollama:

ollama/ollama#3622 (comment)

but it seems iq3_xxs is the only IQ type supported (?)

I did manage to hack in the latest llama.cpp and it does work (although I didn't alter the patches they make to server.cpp 👀). Hopefully they will bump to use the latest llama.cpp soon in the official branch. They've already bumped to use https://github.com/ggerganov/llama.cpp/tree/4bd0f93e4ab4fe6682e7d0241c1bdec1397e954a which looks to be after this PR was merged.

@jukofyork
Copy link
Contributor

Yeah, sorry for not making it clear: that was just a test to match the one here.

I've been using the --chatml option and have the correct template in the Ollma modelfile too.

The odd thing is that dranger003's IQ4_XS seems to be matching my terrible models' output, but not being able to import it into Ollama I can't make a detailed test.

I'm redownloading phymbert's Q4_0 now to double check what's happening but AFAIK the testing conditions were exactly the same with all the same settings (temperature=0 also).

I'll report back if I find out the problem - is there any command to dump the meta data from the GGUF files so I can run diff on them?

Just to say I've now completely redone everything from scratch and have successfully quantized a Q4_K_M:

./convert-hf-to-gguf.py dbrx-instruct --outfile dbrx:16x12b-instruct-f16.gguf --outtype f16
./imatrix --chunks 200 -m dbrx:16x12b-instruct-f16.gguf -f groups_merged.txt -o dbrx:16x12b-instruct-f16.imatrix -ngl 12
./quantize --imatrix dbrx:16x12b-instruct-f16.imatrix dbrx:16x12b-instruct-f16.gguf dbrx:16x12b-instruct-q4_K_M.gguf Q4_K_M 12

The imatrix run only managed to create 91 chunks from groups_merged.txt but I can share it on HF if anybody wants?

I think the problem I had yesterday was the FP16 GGUF must have been corrupted or truncated somehow; I did notice convert-hf-to-gguf.py used an extraordinary amount of RAM at one point and I was using the machine for other stuff when I did the first run of convert-hf-to-gguf.py.

Anyway, panic over and the Q4_K_M + imatrix does seem (subjectively) slightly better than the Q4_0 on coding tasks so far.

Big thanks to @phymbert and @dranger003 for all the help! 👍

@jukofyork
Copy link
Contributor

jukofyork commented Apr 15, 2024

So I'm just converting c4ai-command-r-plus now and noticed a couple of differences:

  • convert-hf-to-gguf.py used a huge amount of RAM (250-300GB) for dbrx-instruct and only wrote the FP16 file right at the end, whereas c4ai-command-r-plus (and other models I've quantized) write the file out bit by bit and use a small amount of RAM.
  • imatrix using dbrx-instruct reported saving 10 chunks every 2-3 actual chunks, eg: [1], [2], .. "saving 10 chunks", [3], [4], [5], "saving 20 chunks" and so on, whereas c4ai-command-r-plus reported it could make 95 chunks and really did save every 10 chunks.

These may be known but just thought I should point them out in case relevant.

@dave-fl
Copy link
Contributor

dave-fl commented Apr 16, 2024

So I'm just converting c4ai-command-r-plus now and noticed a couple of differences:

  • convert-hf-to-gguf.py used a huge amount of RAM (250-300GB) for dbrx-instruct and only wrote the FP16 file right at the end, whereas c4ai-command-r-plus (and other models I've quantized) write the file out bit by bit and use a small amount of RAM.
  • imatrix using dbrx-instruct reported saving 10 chunks every 2-3 actual chunks, eg: [1], [2], .. "saving 10 chunks", [3], [4], [5], "saving 20 chunks" and so on, whereas c4ai-command-r-plus reported it could make 95 chunks and really did save every 10 chunks.

These may be known but just thought I should point them out in case relevant.

If you set the convert script to use temp files it will keep the ram in check.

This should probably be parameterized and also a default.

@chigkim
Copy link

chigkim commented Apr 17, 2024

Ollama 0.3.2 supports dbrx. I downloaded their version and tried.
It looks like quant problem. At least their Q2_K produces English, but makes no sense. Reminds me when I tried GPT-2 124M. It repeats and digresses like crazy.
Q4_K works fine though.
I wonder why Q2 for dbrx is particularly poor. Q2 for WizardLM2-8x22b, Zephyr-Orpo-8x22b, command-r-plus all worked fine.

tybalex pushed a commit to rubra-ai/tools.cpp that referenced this pull request Apr 17, 2024
* model: dbrx convert to gguf
ggerganov#6344

* llama: support dbrx
ggerganov#6344

* doc: dbrx: add the model as supported

* scripts: get-wikitext-2 add unzip

* llama: increase maximum experts allowed

* llama: factorize moe graph implementation between grok, mixtral and dbrx


---------

Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>
@jukofyork
Copy link
Contributor

jukofyork commented May 4, 2024

Yeah, sorry for not making it clear: that was just a test to match the one here.
I've been using the --chatml option and have the correct template in the Ollma modelfile too.
The odd thing is that dranger003's IQ4_XS seems to be matching my terrible models' output, but not being able to import it into Ollama I can't make a detailed test.
I'm redownloading phymbert's Q4_0 now to double check what's happening but AFAIK the testing conditions were exactly the same with all the same settings (temperature=0 also).
I'll report back if I find out the problem - is there any command to dump the meta data from the GGUF files so I can run diff on them?

Just to say I've now completely redone everything from scratch and have successfully quantized a Q4_K_M:

./convert-hf-to-gguf.py dbrx-instruct --outfile dbrx:16x12b-instruct-f16.gguf --outtype f16
./imatrix --chunks 200 -m dbrx:16x12b-instruct-f16.gguf -f groups_merged.txt -o dbrx:16x12b-instruct-f16.imatrix -ngl 12
./quantize --imatrix dbrx:16x12b-instruct-f16.imatrix dbrx:16x12b-instruct-f16.gguf dbrx:16x12b-instruct-q4_K_M.gguf Q4_K_M 12

The imatrix run only managed to create 91 chunks from groups_merged.txt but I can share it on HF if anybody wants?

I think the problem I had yesterday was the FP16 GGUF must have been corrupted or truncated somehow; I did notice convert-hf-to-gguf.py used an extraordinary amount of RAM at one point and I was using the machine for other stuff when I did the first run of convert-hf-to-gguf.py.

Anyway, panic over and the Q4_K_M + imatrix does seem (subjectively) slightly better than the Q4_0 on coding tasks so far.

Big thanks to @phymbert and @dranger003 for all the help! 👍

This model is so weird... Tried to re-quant it today to use the new BPE stuff, found out it wouldn't work so as had already deleted the old model went back to the old llama.cpp pull from mid-April (that I used above) to recreate it from scratch and it's back to working like a broken frankenmerge again!? WTF???

I thought it might be because I used 22 threads instead of 12 in the example above (and a thread race is causing it, etc), but I think it's the FP16 that must be screwing up somehow as I noticed it seems to be getting a PPL score way bigger than any other model when creating the imatrix file:

Final estimate: PPL = 8.7773 +/- 0.17953

This is using groups_merged.txt so hard to compare with anybody else's but that PPL is between 2 and 3 times what other models get for the same file when creating the imatrix file and @dranger003 gets around 5.2 for wiki.txt on his HF page.

Last time it did this, I completely re-downloaded and redid everything, but TBO the model isn't that great and not sure I can be arsed with all that again (plus seems unlikely I would have corrupt the same downloaded files that worked before yet again!?).

@jukofyork
Copy link
Contributor

So gonna have one more go at creating the FP16 (using branch b2665 to be absolutely sure) and get the perplexity values for wiki.test.raw as well. Will report back if I find anything.

@jukofyork
Copy link
Contributor

So I think there may be a thread race if you set the number of quantize threads too high and I don't get the gibberish producing version if I keep to 12 threads (as suggested earlier in this PR).

But it also seems this model doesn't play nice with the imatrix calculation either (compared to phymbert's original Q4_0).

I'm now experimenting with turning on all 16 experts to create the imatrix file:

imatrix created with 4 experts

Final estimate: PPL = 8.7773 +/- 0.17953

imatrix created with 16 experts

Final estimate: PPL = 8.8745 +/- 0.17301

and using that imatrix to quantize the original FP16 (ie: with 4 experts) to see the effect...

It could be that some of the experts aren't getting triggered at all or are getting weighted so low in the gating that they aren't contributing hardly anything. With regard to my comment on the potential use of Tikhonov regularization:

#5263 (comment)

this would be solved by setting the diagonals in proportion to the sqrt(n) (where n is number of times the expert was not skipped over) instead of using the identity matrix to account for the sample size differences. Depending on how the value of x is calculated in the loop, n may or may not need to be the weighted sum of the output of the active gating networks instead (but IIRC it's using backprop to get these so that will likely be accounted for if so).

@jukofyork
Copy link
Contributor

jukofyork commented May 5, 2024

Using all 16 experts does somewhat work, but looking through the code I think I can see what is the root of the problem:

    // this has been adapted to the new format of storing merged experts in a single 3d tensor
    // ref: https://github.com/ggerganov/llama.cpp/pull/6387
    if (t->op == GGML_OP_MUL_MAT_ID) {
        const int idx  = ((int32_t *) t->op_params)[0];
        const ggml_tensor * ids = t->src[2];
        const int n_as = src0->ne[2];

        // the top-k selected expert ids are stored in the ids tensor
        // for simplicity, always copy ids to host, because it is small
        GGML_ASSERT(ids->ne[1] == src1->ne[1]);
        m_ids.resize(ggml_nbytes(ids)/sizeof(int));
        ggml_backend_tensor_get(ids, m_ids.data(), 0, ggml_nbytes(ids));

        auto & e = m_stats[wname];

        ++e.ncall;
        // NOTE: since we select top-k experts, the number of calls for the expert tensors will be k times larger
        //       using the following line, we can correct for that if needed by replacing the line above with:
        //if (idx == t->src[0]->ne[0] - 1) ++e.ncall;

        // loop over all possible experts, regardless if they are used or not in the batch
        for (int ex = 0; ex < n_as; ++ex) {
            size_t e_start = ex*src1->ne[0];
            if (e.values.empty()) {
                e.values.resize(src1->ne[0]*n_as, 0);
            }
            else if (e.values.size() != (size_t)src1->ne[0]*n_as) {
                fprintf(stderr, "Oops: inconsistent size for %s (%d vs %d)\n", wname.c_str(), (int)e.values.size(), (int)src1->ne[0]*n_as);
                exit(1); //GGML_ASSERT(false);
            }
            if (m_params.verbosity > 1) {
                printf("%s[%d]: %32s, %s, %5d x %5d, %d\n", __func__, m_last_call, wname.c_str(), ggml_op_name(t->op), (int)src1->ne[0], (int)src1->ne[1], (int)src1->type);
            }
            for (int row = 0; row < (int)src1->ne[1]; ++row) {
                const int excur = m_ids[row*n_as + idx];
                GGML_ASSERT(excur >= 0 && excur < n_as); // sanity check
                if (excur != ex) continue;
                const float * x = data + row * src1->ne[0];
                for (int j = 0; j < (int)src1->ne[0]; ++j) {
                    e.values[e_start + j] += x[j]*x[j];
                }
            }
            if (e.ncall > m_last_call) {
                m_last_call = e.ncall;
                if (m_last_call % m_params.n_output_frequency == 0) {
                    save_imatrix();
                }
                if (m_params.keep_every > 0 && m_last_call%m_params.keep_every == 0) {
                    keep_imatrix(m_last_call);
                }
            }
        }
    }

++e.ncall

static void load_imatrix(const std::string & imatrix_file, std::unordered_map<std::string, std::vector<float>> & imatrix_data) {
    std::ifstream in(imatrix_file.c_str(), std::ios::binary);
    if (!in) {
        printf("%s: failed to open %s\n",__func__, imatrix_file.c_str());
        exit(1);
    }
    int n_entries;
    in.read((char *)&n_entries, sizeof(n_entries));
    if (in.fail() || n_entries < 1) {
        printf("%s: no data in file %s\n", __func__, imatrix_file.c_str());
        exit(1);
    }
    for (int i = 0; i < n_entries; ++i) {
        int len; in.read((char *)&len, sizeof(len));
        std::vector<char> name_as_vec(len+1);
        in.read((char *)name_as_vec.data(), len);
        if (in.fail()) {
            printf("%s: failed reading name for entry %d from %s\n", __func__, i+1, imatrix_file.c_str());
            exit(1);
        }
        name_as_vec[len] = 0;
        std::string name{name_as_vec.data()};
        auto & e = imatrix_data[name];
        int ncall;
        in.read((char *)&ncall, sizeof(ncall));
        int nval;
        in.read((char *)&nval, sizeof(nval));
        if (in.fail() || nval < 1) {
            printf("%s: failed reading number of values for entry %d\n", __func__, i);
            imatrix_data = {};
            exit(1);
        }
        e.resize(nval);
        in.read((char *)e.data(), nval*sizeof(float));
        if (in.fail()) {
            printf("%s: failed reading data for entry %d\n", __func__, i);
            imatrix_data = {};
            exit(1);
        }
        if (ncall > 0) {
            for (auto& v : e) v /= ncall;
        }

        if (getenv("LLAMA_TRACE")) {
            printf("%s: loaded data (size = %6d, ncall = %6d) for '%s'\n", __func__, int(e.size()), ncall, name.c_str());
        }
    }
    printf("%s: loaded %d importance matrix entries from %s\n", __func__, int(imatrix_data.size()), imatrix_file.c_str());
}

for (auto& v : e) v /= ncall;

            if (quant_weights) {
                const float * qw = quant_weights + QK_K*ibl + 32*ib;
                for (int i = 0; i < 32; ++i) weight[i] = qw[i] * sqrtf(sigma2 + xb[i]*xb[i]);
            } else {
                for (int i = 0; i < 32; ++i) weight[i] = xb[i]*xb[i];
            }

quant_weights is where the value ends up getting used.

By setting LLAMA_TRACE=1 we can see this too:

load_imatrix: loaded data (size = 172032, ncall =    364) for 'blk.38.ffn_down_exps.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.38.ffn_gate_inp.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.38.attn_output.weight'
load_imatrix: loaded data (size = 172032, ncall =    364) for 'blk.37.ffn_down_exps.weight'
load_imatrix: loaded data (size =  98304, ncall =    364) for 'blk.37.ffn_gate_exps.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.37.ffn_gate_inp.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.37.attn_output.weight'

I see 2 potential problems:

  1. The experts should have their own counts as it's very likely their selection frequencies (and softmax gate-weight factors) won't be uniformly distributed. Obviously this will be very hard to add now without breaking other stuff.

  2. It looks to me like the experts' tensor ultimate quant_weights values is getting divided by 16x more than it should be, but the code is very obtuse and I can't be 100% sure if there is some double counting in the \\ loop over all possible experts loop, but I think the if (excur != ex) continue; line skips these...

We are dividing blk.*.ffn_*_exps.weight by 4 times more than blk.*.attn_*.weight, but in actual fact each of the expert MLPs in these combined tensors are activated at most 1:1 with the attention weights and in expectation only activated 1/4 as often as attention weights. I think this means the for (auto& v : e) v /= ncall line is actually dividing the sample count down by 16x more than it should (not 100% sure what the effect will be in ggml-quants.c though as the code is very cryptic and I don't have time to read it all).

I think it would be a good idea for somebody who knows the code base to have a really close look at what is going on here and possibly also double check that the softmax gate-weight factors are properly taken into account via backprop, etc.

(I think) a simple (hacky) fix for (2) is:

e.values[e_start + j] += x[j]*x[j];

to:

e.values[e_start + j] += (x[j]*x[j])*static_cast<float>(n_as);

for the expert loop (assuming there isn't some double counting in the loops I can't see).

@ggerganov I don't know if it's worth making a separate issue for this or if it is a problem introduced when dbrx was merged only and used to work correctly (ignoring the uniform assumption) before? IIRC, there was something about the expert tensors all being separate before this...

@jukofyork
Copy link
Contributor

jukofyork commented May 5, 2024

My head hurts thinking about this but pretty sure (2) is correct via this example:

If only 1 expert were selected the counts would look like this:

load_imatrix: loaded data (size = 172032, ncall =    91) for 'blk.38.ffn_down_exps.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.38.ffn_gate_inp.weight'

So for every 16 samples that went into the attention weighting factors 1 sample went into each of the expert's weighting factors (on average).

If all 16 expert were selected the counts would look like this:

load_imatrix: loaded data (size = 172032, ncall =    1456) for 'blk.38.ffn_down_exps.weight'
load_imatrix: loaded data (size =   6144, ncall =     91) for 'blk.38.ffn_gate_inp.weight'

So for every 1 sample that went into the attention weighting factors 1 sample went into each of the expert's weighting factors also.

In both cases it looks like the expert's weighting factors are getting divided by 16 times more than they should be.

I'm trying the *static_cast<float>(n_as) fix now to see if anything about the dbrx quant changes...

What difference does it make to the code in ggml-quants.c:

        const float * xbl = x + QK_K*ibl;
        float sumx2 = 0;
        for (int i = 0; i < QK_K; ++i) sumx2 += xbl[i]*xbl[i];
        float sigma2 = 2*sumx2/QK_K;

        for (int ib = 0; ib < QK_K/32; ++ib) {
            const float * xb = xbl + 32*ib;
            if (quant_weights) {
                const float * qw = quant_weights + QK_K*ibl + 32*ib;
                for (int i = 0; i < 32; ++i) weight[i] = qw[i] * sqrtf(sigma2 + xb[i]*xb[i]);
            } else {
                for (int i = 0; i < 32; ++i) weight[i] = xb[i]*xb[i];
            }

Considering that the quantization is per-tensor anyway? The qw[i] * sqrtf(sigma2 + xb[i]*xb[i]) makes me think the wrong magnitude of qw[i] might have some effect that isn't just cancelled out, but that code is so obtuse I can't see what it is doing really...

@jukofyork
Copy link
Contributor

jukofyork commented May 5, 2024

The experts should have their own counts as it's very likely their selection frequencies (and softmax gate-weight factors) won't be uniformly distributed. Obviously this will be very hard to add now without breaking other stuff.

On further thought maybe this doesn't matter as the *n_as fix will mean that when for (auto& v : e) v /= ncall gets run in quantize the uniformity of the distribution won't matter (by thinking through the cases where one expert always gets activated vs all equally activated anyway).

On even further thought is this actually intensional or accidental? If the old code created a diagonal hessian approximation per expert and now it's all lumped into one huge tenors; the experts that don't get selected and/or have a lower softmax gate weight are going to have their importance downgraded? Is that what is really wanted or not?

It looks like this is the correct PR for this: #6387

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed model Model specific
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add support for DBRX models: dbrx-base and dbrx-instruct