Mistral large GPTQ model inference problem #308

drdaliang · 2024-08-14T14:51:25Z

Hello,

I am using ScaleLLM to inference Mistral-Large-Instruct-2407-GPTQ model, and got all commas as the output. like this:

You are a helpful assistant. You can help me by answering my questions. You can also ask me questions.
word count: 19, token count: 29
hi
word count: 1, token count: 8
, , , ,,
word count: 0, token count: 15, tokens used: 46, model: mistral-large-latest(Mistral-Large-Instruct-2407-GPTQ)

I have successfully run the same local model with vllm and sglang. I got the model from this url:

https://huggingface.co/TechxGenus/Mistral-Large-Instruct-2407-GPTQ

logs when running the model:

WARNING: Logging before InitGoogleLogging() is written to STDERR
I20240814 22:35:40.531607 162675 llm_handler.cpp:171] Creating engine with devices: cuda:0,cuda:1,cuda:2,cuda:3
W20240814 22:35:43.234417 162675 model_loader.cpp:301] Overwriting dtype from bfloat16 to float16 for quantization
I20240814 22:35:43.234916 162675 llm_engine.cpp:138] Initializing model from: /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ
I20240814 22:35:43.234936 162675 model_loader.cpp:172] Using fast tokenizer.
I20240814 22:35:43.281715 162675 llm_engine.cpp:156] Block info, block_size: 16, n_local_kv_heads: 2, head_dim: 128, n_layers: 88, dtype: Half
I20240814 22:35:43.283596 162675 llm_engine.cpp:175] Initializing model with ModelArgs: [model_type: mistral, dtype: float16, hidden_size: 12288, hidden_act: silu, intermediate_size: 28672, n_layers: 88, head_dim: 128, n_heads: 96, n_kv_heads: 8, vocab_size: 32768, rms_norm_eps: 1e-05, layer_norm_eps: 0, rotary_dim: 0, rope_theta: 1e+06, rope_scaling_rope_type: , rope_scaling_factor: 0, rope_scaling_low_freq_factor: 0, rope_scaling_high_freq_factor: 0, rope_scaling_original_max_position_embeddings: 0, rotary_pct: 1, max_position_embeddings: 32768, bos_token_id: 1, eos_token_id: 2, use_parallel_residual: 0, attn_qkv_clip: 0, attn_qk_ln: 0, attn_alibi: 0, alibi_bias_max: 0, no_bias: 0, linear_bias: 0, qkv_bias: 0, residual_post_layernorm: 0]
I20240814 22:35:43.283625 162675 llm_engine.cpp:176] Initializing model with quant args: QuantArgs: [quant_method: gptq, bits: 4, group_size: 128, desc_act: 1, true_sequential: 1]
I20240814 22:35:43.283633 162675 llm_engine.cpp:177] Initializing model with tokenizer args: TokenizerArgs: [tokenizer_type: sentencepiece, vocab_file: tokenizer.model, pattern: ]
I20240814 22:35:43.615725 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00001-of-00014.safetensors
I20240814 22:35:44.853797 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00
002-of-00014.safetensors
I20240814 22:35:45.887975 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00003-of-00014.safetensors
I20240814 22:35:46.955483 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00004-of-00014.safetensors
I20240814 22:35:48.005149 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00005-of-00014.safetensors
I20240814 22:35:49.047144 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00006-of-00014.safetensors
I20240814 22:35:50.051784 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00007-of-00014.safetensors
I20240814 22:35:51.050997 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00008-of-00014.safetensors
I20240814 22:35:52.106201 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00009-of-00014.safetensors
I20240814 22:35:53.120800 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00010-of-00014.safetensors
I20240814 22:35:54.188920 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00011-of-00014.safetensors
I20240814 22:35:55.185894 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00012-of-00014.safetensors
I20240814 22:35:56.136348 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00013-of-00014.safetensors
I20240814 22:35:57.192777 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00014-of-00014.safetensors
I20240814 22:35:57.555701 162675 llm_engine.cpp:305] cuda:0: available memory: 5.12 GB, total memory: 23.55 GB
I20240814 22:35:57.555797 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B
I20240814 22:35:57.555825 162675 llm_engine.cpp:305] cuda:1: available memory: 5.12 GB, total memory: 23.55 GB
I20240814 22:35:57.555845 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B
I20240814 22:35:57.555866 162675 llm_engine.cpp:305] cuda:2: available memory: 5.12 GB, total memory: 23.55 GB
I20240814 22:35:57.555889 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B
I20240814 22:35:57.555910 162675 llm_engine.cpp:305] cuda:3: available memory: 5.12 GB, total memory: 23.55 GB
I20240814 22:35:57.555932 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B
I20240814 22:35:57.555955 162675 llm_engine.cpp:122] Initializing kv cache with size: 2.76 GB
I20240814 22:35:57.555976 162675 llm_engine.cpp:333] Initializing kv cache with shape: [2056 16 2 128]
I20240814 22:35:57.666657 162675 llm_engine.cpp:236] Capturing CUDA graphs: num_decoding_tokens: 1, batch sizes: 1 2 4 8 16 24 32 48 64
I20240814 22:36:00.271500 162675 llm_handler.cpp:224] Using default chat template for model type: mistral
INFO: Started server process [162675]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO: 172.24.21.213:53482 - "POST /v1/chat/completions HTTP/1.1" 200 OK

what can be the problem?

guocuimi · 2024-08-14T16:17:19Z

Thanks for reporting the issue. Looking into it.

guocuimi · 2024-08-14T17:05:32Z

The issue has been successfully reproduced on my end.
Two potential issues identified.

1> stale chat template for mistral
2> garbage output from gptq kernel. current kernel doesn't support desc_act=True.

Fix is on the way. you should be able to get a workable version in next release. (ETA: end of this week).

guocuimi · 2024-08-19T23:30:29Z

Juat a quick update: A new Marlin kernel with desc_act support has been landed. However, additional time is needed to thoroughly test scenarios where tp > 1. The work involved is more extensive than initially anticipated. The new ETA for release is 08/21.

guocuimi added the investigation needed need further investigation label Aug 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mistral large GPTQ model inference problem #308

Mistral large GPTQ model inference problem #308

drdaliang commented Aug 14, 2024

guocuimi commented Aug 14, 2024

guocuimi commented Aug 14, 2024 •

edited

Loading

guocuimi commented Aug 19, 2024

Mistral large GPTQ model inference problem #308

Mistral large GPTQ model inference problem #308

Comments

drdaliang commented Aug 14, 2024

guocuimi commented Aug 14, 2024

guocuimi commented Aug 14, 2024 • edited Loading

guocuimi commented Aug 19, 2024

guocuimi commented Aug 14, 2024 •

edited

Loading