Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mistral large GPTQ model inference problem #308

Open
drdaliang opened this issue Aug 14, 2024 · 3 comments
Open

Mistral large GPTQ model inference problem #308

drdaliang opened this issue Aug 14, 2024 · 3 comments
Labels
investigation needed need further investigation

Comments

@drdaliang
Copy link

Hello,

I am using ScaleLLM to inference Mistral-Large-Instruct-2407-GPTQ model, and got all commas as the output. like this:

You are a helpful assistant. You can help me by answering my questions. You can also ask me questions.
word count: 19, token count: 29
hi
word count: 1, token count: 8
, , , ,,
word count: 0, token count: 15, tokens used: 46, model: mistral-large-latest(Mistral-Large-Instruct-2407-GPTQ)

I have successfully run the same local model with vllm and sglang. I got the model from this url:

https://huggingface.co/TechxGenus/Mistral-Large-Instruct-2407-GPTQ

logs when running the model:

WARNING: Logging before InitGoogleLogging() is written to STDERR
I20240814 22:35:40.531607 162675 llm_handler.cpp:171] Creating engine with devices: cuda:0,cuda:1,cuda:2,cuda:3
W20240814 22:35:43.234417 162675 model_loader.cpp:301] Overwriting dtype from bfloat16 to float16 for quantization
I20240814 22:35:43.234916 162675 llm_engine.cpp:138] Initializing model from: /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ
I20240814 22:35:43.234936 162675 model_loader.cpp:172] Using fast tokenizer.
I20240814 22:35:43.281715 162675 llm_engine.cpp:156] Block info, block_size: 16, n_local_kv_heads: 2, head_dim: 128, n_layers: 88, dtype: Half
I20240814 22:35:43.283596 162675 llm_engine.cpp:175] Initializing model with ModelArgs: [model_type: mistral, dtype: float16, hidden_size: 12288, hidden_act: silu, intermediate_size: 28672, n_layers: 88, head_dim: 128, n_heads: 96, n_kv_heads: 8, vocab_size: 32768, rms_norm_eps: 1e-05, layer_norm_eps: 0, rotary_dim: 0, rope_theta: 1e+06, rope_scaling_rope_type: , rope_scaling_factor: 0, rope_scaling_low_freq_factor: 0, rope_scaling_high_freq_factor: 0, rope_scaling_original_max_position_embeddings: 0, rotary_pct: 1, max_position_embeddings: 32768, bos_token_id: 1, eos_token_id: 2, use_parallel_residual: 0, attn_qkv_clip: 0, attn_qk_ln: 0, attn_alibi: 0, alibi_bias_max: 0, no_bias: 0, linear_bias: 0, qkv_bias: 0, residual_post_layernorm: 0]
I20240814 22:35:43.283625 162675 llm_engine.cpp:176] Initializing model with quant args: QuantArgs: [quant_method: gptq, bits: 4, group_size: 128, desc_act: 1, true_sequential: 1]
I20240814 22:35:43.283633 162675 llm_engine.cpp:177] Initializing model with tokenizer args: TokenizerArgs: [tokenizer_type: sentencepiece, vocab_file: tokenizer.model, pattern: ]
I20240814 22:35:43.615725 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00001-of-00014.safetensors
I20240814 22:35:44.853797 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00
002-of-00014.safetensors
I20240814 22:35:45.887975 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00003-of-00014.safetensors
I20240814 22:35:46.955483 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00004-of-00014.safetensors
I20240814 22:35:48.005149 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00005-of-00014.safetensors
I20240814 22:35:49.047144 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00006-of-00014.safetensors
I20240814 22:35:50.051784 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00007-of-00014.safetensors
I20240814 22:35:51.050997 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00008-of-00014.safetensors
I20240814 22:35:52.106201 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00009-of-00014.safetensors
I20240814 22:35:53.120800 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00010-of-00014.safetensors
I20240814 22:35:54.188920 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00011-of-00014.safetensors
I20240814 22:35:55.185894 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00012-of-00014.safetensors
I20240814 22:35:56.136348 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00013-of-00014.safetensors
I20240814 22:35:57.192777 162675 model_loader.cpp:37] Loading model weights from /home/shiyu/Documents/mistral-large-gptq/Mistral-Large-Instruct-2407-GPTQ/model-00014-of-00014.safetensors
I20240814 22:35:57.555701 162675 llm_engine.cpp:305] cuda:0: available memory: 5.12 GB, total memory: 23.55 GB
I20240814 22:35:57.555797 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B
I20240814 22:35:57.555825 162675 llm_engine.cpp:305] cuda:1: available memory: 5.12 GB, total memory: 23.55 GB
I20240814 22:35:57.555845 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B
I20240814 22:35:57.555866 162675 llm_engine.cpp:305] cuda:2: available memory: 5.12 GB, total memory: 23.55 GB
I20240814 22:35:57.555889 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B
I20240814 22:35:57.555910 162675 llm_engine.cpp:305] cuda:3: available memory: 5.12 GB, total memory: 23.55 GB
I20240814 22:35:57.555932 162675 llm_engine.cpp:309] Using max_memory_utilization: 0.9, max_cache_size: 0.00 B
I20240814 22:35:57.555955 162675 llm_engine.cpp:122] Initializing kv cache with size: 2.76 GB
I20240814 22:35:57.555976 162675 llm_engine.cpp:333] Initializing kv cache with shape: [2056 16 2 128]
I20240814 22:35:57.666657 162675 llm_engine.cpp:236] Capturing CUDA graphs: num_decoding_tokens: 1, batch sizes: 1 2 4 8 16 24 32 48 64
I20240814 22:36:00.271500 162675 llm_handler.cpp:224] Using default chat template for model type: mistral
INFO: Started server process [162675]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
INFO: 172.24.21.213:53482 - "POST /v1/chat/completions HTTP/1.1" 200 OK

what can be the problem?

@guocuimi
Copy link
Collaborator

Thanks for reporting the issue. Looking into it.

@guocuimi guocuimi added the investigation needed need further investigation label Aug 14, 2024
@guocuimi
Copy link
Collaborator

guocuimi commented Aug 14, 2024

The issue has been successfully reproduced on my end.
Two potential issues identified.

  • 1> stale chat template for mistral
  • 2> garbage output from gptq kernel. current kernel doesn't support desc_act=True.

Fix is on the way. you should be able to get a workable version in next release. (ETA: end of this week).

@guocuimi
Copy link
Collaborator

Juat a quick update: A new Marlin kernel with desc_act support has been landed. However, additional time is needed to thoroughly test scenarios where tp > 1. The work involved is more extensive than initially anticipated. The new ETA for release is 08/21.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
investigation needed need further investigation
Projects
None yet
Development

No branches or pull requests

2 participants