-
Notifications
You must be signed in to change notification settings - Fork 39
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Question]: python run_vllm.py TypeError: 'type' object is not subscriptable #13
Comments
Hi @junior-zsy, thanks for your feedback. This issue is caused by the vllm version. Currently, we support Please update minference to 0.1.4 |
@iofu728 I used multi card inference and reported an error,error : Additionally, I am using the Qwen2-7B-Instruct model. The official Qwen2-7B-Instruct model requires vllm to be greater than 0.4.3, otherwise the long text effect may be problematic,Can I use vllm 0.4.3 inference |
I also have the same need for Qwen2-7B-Instruct running on vllm. |
Hi @junior-zsy and @cyLi-Tiger, we fix this issue in 0.1.4.post1. Please update MInference to version 0.1.4.post1. If the issue persists, feel free to reopen this issue. |
Hi @iofu728, thanks for the fix! I tried python run_vllm.py
|
Hi @cyLi-Tiger, thanks for your feedback. I tested vllm==0.4.1 with flash_attn==2.5.8 and vllm==0.4.3 with flash_attn==0.4.2, and it works well using Qwen2. Could you try reinstalling minference and flash_attn, and then run vllm again? |
Describe the bug
python run_vllm.py
2024-07-05 15:25:04,647 WARNING utils.py:580 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set
RAY_USE_MULTIPROCESSING_CPU_COUNT=1
as an env var before starting Ray. Set the env var:RAY_DISABLE_DOCKER_CPU_WARNING=1
to mute this warning.2024-07-05 15:25:05,859 INFO worker.py:1771 -- Started a local Ray instance.
INFO 07-05 15:25:11 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/xxx/model/Qwen2-7B-Instruct', tokenizer='/xxx/model/Qwen2-7B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-05 15:25:17 selector.py:16] Using FlashAttention backend.
[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:42823 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:10.178.172.129]:42823 (errno: 97 - Address family not supported by protocol).
(RayWorkerVllm pid=1499766) INFO 07-05 15:25:17 selector.py:16] Using FlashAttention backend.
INFO 07-05 15:25:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=1499766) INFO 07-05 15:25:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=1499766) [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:10.178.172.129]:42823 (errno: 97 - Address family not supported by protocol).
INFO 07-05 15:25:22 model_runner.py:104] Loading model weights took 7.1441 GB
(RayWorkerVllm pid=1499766) INFO 07-05 15:25:23 model_runner.py:104] Loading model weights took 7.1441 GB
INFO 07-05 15:25:38 ray_gpu_executor.py:240] # GPU blocks: 49742, # CPU blocks: 9362
Traceback (most recent call last):
File "/xxx/code/MInference/examples/run_vllm.py", line 31, in
llm = minference_patch(llm)
File "/xxx/code/MInference/minference/models_patch.py", line 39, in call
return self.patch_model(model)
File "/xxx/code/MInference/minference/models_patch.py", line 102, in patch_model
model = minference_patch_vllm(model, self.config.config_path)
File "/xxx/code/MInference/minference/patch.py", line 1072, in minference_patch_vllm
attn_forward = minference_vllm_forward(config)
File "/xxx/code/MInference/minference/modules/minference_forward.py", line 771, in minference_vllm_forward
attn_metadata: AttentionMetadata[FlashAttentionMetadata],
TypeError: 'type' object is not subscriptable
dependent on:
vllm 0.4.0
flash-attn 2.5.9.post1
torch 2.1.2
triton 2.1.0
Steps to reproduce
No response
Expected Behavior
No response
Logs
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: