-
-
Notifications
You must be signed in to change notification settings - Fork 4.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Deepseek-V2 #4650
Support Deepseek-V2 #4650
Conversation
ERROR 05-08 20:22:08 worker_base.py:145] ValueError: Model architectures ['DeepseekV2ForCausalLM'] are not supported for now. Supported architectures: ['AquilaModel', 'AquilaForCausalLM', 'BaiChuanForCausalLM', 'BaichuanForCausalLM', 'BloomForCausalLM', 'ChatGLMModel', 'ChatGLMForConditionalGeneration', 'CohereForCausalLM', 'DbrxForCausalLM', 'DeciLMForCausalLM', 'DeepseekForCausalLM', 'FalconForCausalLM', 'GemmaForCausalLM', 'GPT2LMHeadModel', 'GPTBigCodeForCausalLM', 'GPTJForCausalLM', 'GPTNeoXForCausalLM', 'InternLMForCausalLM', 'InternLM2ForCausalLM', 'JAISLMHeadModel', 'LlamaForCausalLM', 'LlavaForConditionalGeneration', 'LLaMAForCausalLM', 'MistralForCausalLM', 'MixtralForCausalLM', 'QuantMixtralForCausalLM', 'MptForCausalLM', 'MPTForCausalLM', 'MiniCPMForCausalLM', 'OlmoForCausalLM', 'OPTForCausalLM', 'OrionForCausalLM', 'PhiForCausalLM', 'Phi3ForCausalLM', 'QWenLMHeadModel', 'Qwen2ForCausalLM', 'Qwen2MoeForCausalLM', 'RWForCausalLM', 'StableLMEpochForCausalLM', 'StableLmForCausalLM', 'Starcoder2ForCausalLM', 'XverseForCausalLM'] |
it seems the model architecture is not supported in vLLM |
What's the reason it is not supported in this PR? |
Hi, with only MHA, is it possible to realize max_model_len = 128k? In my test, may only 12k. |
The internal inference implementation supports MLA. The implementation on vLLM is more about making it support quickly and matching the model parameters with the code. So the efficiency of using it for LLM Serving is not high enough. I think maybe the current PR could be quickly reviewed and merged asap. Subsequent communities can consider implementing an integrated version. |
Hi @zwd003 May you merge the latest main branch and fix the conflicts? Thanks. |
请问一下目前是否有在开发支持MLA吗 |
ok |
HI @zwd003 This error occurred during the deployment process. How to solve it? Thanks! (RayWorkerWrapper pid=52311) ERROR 05-11 18:04:33 worker_base.py:145] File "/opt/vllm/vllm/model_executor/models/deepseek_v2.py", line 156, in forward |
I encountered the same error |
|
Thanks! :D |
hello,I encountered this error when the QPS was increased to 2.
|
Could you show me lines about KV compression? Thanks. |
加载模型时报如下错误: Cache shape torch.Size([163840, 64]) [repeated 6x across cluster] Process finished with exit code 1 |
any update? looking forward to it.. |
vllm/config.py
Outdated
@@ -250,6 +250,9 @@ def get_hidden_size(self) -> int: | |||
return self.hf_text_config.hidden_size | |||
|
|||
def get_head_size(self) -> int: | |||
if hasattr(self.hf_text_config, "model_type") and self.hf_text_config.model_type=='deepseek_v2': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add the head_dim
to the huggingface config instead of hard coding this here?
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
I test this commit and it solves triton cache issue: #6140 |
I finally figured out that one previous commit works without this error. Just try
|
Has anybody had luck running The 236b Deepseek V2 on VLLM yet? I was able to get the Lite Instruct model to run, but I can't get the full instruct model to run, even on 8 H100's. I get told there's no available memory for cache blocks. Been using these params:
Example error from logs:
I understand this is an enormous model, but their docs on Hugging Face say "If you want to utilize DeepSeek-Coder-V2 in BF16 format for inference, 80GB*8 GPUs are required." and I do have 8 80GB GPU's. |
--max-model-len too large, shrink it to less than 18000 together with all related args and retry |
Thanks, @daoxian. That seems to have gotten me farther with a new error about kv cache. I don't see any options in vllm to specify KV cache size, but maybe I'm just missing something. UPDATE: I was able to get it to work with a low context of 4096. However, I really need the large context capabilities of this model. Would be good to know what needs to be done to use something like 64k or 128k context with this huge model. I'm assuming I just need more VRAM, but I even tried 16 H100's split across two nodes and it still doesn't work. I'm guessing because pipeline parallelism isn't supported for Deepseek yet. |
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com>
@mphilippnv were you ever able to get past that 4k context limit? Anyone have a better sense of what changes would need to be implemented to make that possible? |
@gabrielgrant It's definitely a memory issue. After conversing with my hardware people more, I found out our system only supports pipeline parallelism using MPI. Supposedly ray backend doesn't work in our system. Otherwise, with very large models liek this, you basically need multi-node deployment. For example, 2 nodes with 8 GPU's each. Then you would flag Additionally, I was able to get it running at about 32k context using These models are able to run on my 8 GPU setup and run pretty fast. Regardless, pipeline parallelism is still needed, I think, to get the max context out of it. |
@mphilippnv is it still not able to run on 8*h100 with 128k context?can you share your start command thanks. |
@KylinMountain I'm running the vllm openai docker container v0.5.4. I'm passing these engine args:
That runs out of memory saying "there's not enough memory for cache blocks". I've been able to get it to run with these settings:
The fp8 quantization helps. But notice the context is still 30k. I can't even get 64k running, let alone 120, unfortunately. |
Ah, cool hadn't seen that neuralmagic FP8 version. Very interesting that they claim it has better HumanEval+ performance than the original (bottom of overview): "It achieves an average score of 88.98 on the HumanEval+ benchmark, whereas the unquantized model achieves 87.63." Curious if you've had a chance to try any of the more aggressive quantizations by [bartowski[(https://huggingface.co/bartowski/DeepSeek-Coder-V2-Instruct-GGUF), LoneStriker or legraphista? |
@gabrielgrant I have not had a chance to try the more aggressive ones. I don't think vllm supports GGUF yet, even though I know there is an open issue being worked on for it. |
@mphilippnv AFAIU it just landed a few days ago! #5191 |
@mphilippnv A quick question on the parallelism setting
Does TP+PP still work for MoE model like deepseek v2? If so, we can definitely use multi-host inference to support higher context window size without quantization, right? |
@Jeffwan I'm not sure. I haven't had a chance to really dive into getting our multi-node pipeline parallelism worrking. But yeah, if we can use multi-node, then I don't see why I wouldn't be able to get full context size across 16 80gb GPU's. |
SGLang https://github.com/sgl-project/sglang/ now supports DeepSeek V2 MLA. It should be the fastest among all current open-source implementations. Give it a try! If you have any issues with usage, feel free to provide feedback.
|
@zhyncs Thank you very much. Will give it a try, but want to know why this needs disable radix cache? I will run the deep-seek 232B on 8xH100. |
@KylinMountain You can enable it. It doesn't matter. |
Any update for MLA? |
Ok, so I finally got my helm chart setup so I can run pipeline parallelism on the large model. I have ray setup on my pods and was able to serve 405b at full context. So, I went to try Deepseek 2.5 full and ran into this exception. Looks like maybe a Ray-specific exception and not VLLM related but posting here anyways:
Here are my vllm args:
|
@mphilippnv can you try to see if #6751 helps? |
@youkaichao this looks exactly like the issue. I guess I will wait for the merge. Hopefully it makes it to next release. Thanks! |
can you try it first, and report the benefit in #6751 ? this can help us to be confident to merge it. |
@youkaichao sure. Will take me a day or so. Need to update my docker file to install that branch and use it. Will report back on that issue you linked. |
Co-authored-by: Philipp Moritz <pcmoritz@gmail.com> Signed-off-by: Alvant <alvasian@yandex.ru>
I am facing the same problem, what's the solution? |
@SeveredAsif upgrade your vllm version |
Description:
This PR introduces support for the recently released DeepSeek-V2 model by DeepSeek-AI.
Key Updates:
Related Resources:
Todo:
We look forward to community feedback and suggestions to help us continuously improve and refine the integration and inference implementation of the DeepSeek-V2 model.
Testing
Note: Currently, only the inference method using the Multi-Head Attention (MHA) approach has been implemented, and the efficient inference mode mentioned in the paper has not yet been realized.