Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Question]: python run_vllm.py TypeError: 'type' object is not subscriptable #13

Closed
junior-zsy opened this issue Jul 5, 2024 · 6 comments · Fixed by #19
Closed

[Question]: python run_vllm.py TypeError: 'type' object is not subscriptable #13

junior-zsy opened this issue Jul 5, 2024 · 6 comments · Fixed by #19
Assignees
Labels
question Further information is requested

Comments

@junior-zsy
Copy link

Describe the bug

python run_vllm.py
2024-07-05 15:25:04,647 WARNING utils.py:580 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1 as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1 to mute this warning.
2024-07-05 15:25:05,859 INFO worker.py:1771 -- Started a local Ray instance.
INFO 07-05 15:25:11 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/xxx/model/Qwen2-7B-Instruct', tokenizer='/xxx/model/Qwen2-7B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-05 15:25:17 selector.py:16] Using FlashAttention backend.
[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:42823 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:10.178.172.129]:42823 (errno: 97 - Address family not supported by protocol).
(RayWorkerVllm pid=1499766) INFO 07-05 15:25:17 selector.py:16] Using FlashAttention backend.
INFO 07-05 15:25:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=1499766) INFO 07-05 15:25:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=1499766) [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:10.178.172.129]:42823 (errno: 97 - Address family not supported by protocol).
INFO 07-05 15:25:22 model_runner.py:104] Loading model weights took 7.1441 GB
(RayWorkerVllm pid=1499766) INFO 07-05 15:25:23 model_runner.py:104] Loading model weights took 7.1441 GB
INFO 07-05 15:25:38 ray_gpu_executor.py:240] # GPU blocks: 49742, # CPU blocks: 9362
Traceback (most recent call last):
File "/xxx/code/MInference/examples/run_vllm.py", line 31, in
llm = minference_patch(llm)
File "/xxx/code/MInference/minference/models_patch.py", line 39, in call
return self.patch_model(model)
File "/xxx/code/MInference/minference/models_patch.py", line 102, in patch_model
model = minference_patch_vllm(model, self.config.config_path)
File "/xxx/code/MInference/minference/patch.py", line 1072, in minference_patch_vllm
attn_forward = minference_vllm_forward(config)
File "/xxx/code/MInference/minference/modules/minference_forward.py", line 771, in minference_vllm_forward
attn_metadata: AttentionMetadata[FlashAttentionMetadata],
TypeError: 'type' object is not subscriptable

dependent on:
vllm 0.4.0
flash-attn 2.5.9.post1
torch 2.1.2
triton 2.1.0

Steps to reproduce

No response

Expected Behavior

No response

Logs

No response

Additional Information

No response

@junior-zsy junior-zsy added the bug Something isn't working label Jul 5, 2024
@iofu728 iofu728 self-assigned this Jul 5, 2024
@iofu728 iofu728 added question Further information is requested and removed bug Something isn't working labels Jul 5, 2024
@iofu728
Copy link
Contributor

iofu728 commented Jul 5, 2024

Hi @junior-zsy, thanks for your feedback.

This issue is caused by the vllm version. Currently, we support vllm==0.4.1.

Please update minference to 0.1.4 pip install minference==0.1.4, which fixes some other bugs #14. This update also makes minference potentially compatible with vllm==0.4.0.

@iofu728 iofu728 changed the title [Bug]: python run_vllm.py TypeError: 'type' object is not subscriptable [Question]: python run_vllm.py TypeError: 'type' object is not subscriptable Jul 5, 2024
@junior-zsy
Copy link
Author

junior-zsy commented Jul 5, 2024

@iofu728 I used multi card inference and reported an error,error :
Traceback (most recent call last):
File "/xxx/code/MInference/examples/run_vllm.py", line 54, in
llm = minference_patch(llm)
File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/models_patch.py", line 39, in call
return self.patch_model(model)
File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/models_patch.py", line 102, in patch_model
model = minference_patch_vllm(model, self.config.config_path)
File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/patch.py", line 1091, in minference_patch_vllm
llm.llm_engine.model_executor.driver_worker.model_runner.model.apply(update_module)
AttributeError: 'RayWorkerWrapper' object has no attribute 'model_runner'
[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Additionally, I am using the Qwen2-7B-Instruct model. The official Qwen2-7B-Instruct model requires vllm to be greater than 0.4.3, otherwise the long text effect may be problematic,Can I use vllm 0.4.3 inference

@cyLi-Tiger
Copy link

@iofu728 I used multi card inference and reported an error,error : Traceback (most recent call last): File "/xxx/code/MInference/examples/run_vllm.py", line 54, in llm = minference_patch(llm) File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/models_patch.py", line 39, in call return self.patch_model(model) File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/models_patch.py", line 102, in patch_model model = minference_patch_vllm(model, self.config.config_path) File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/patch.py", line 1091, in minference_patch_vllm llm.llm_engine.model_executor.driver_worker.model_runner.model.apply(update_module) AttributeError: 'RayWorkerWrapper' object has no attribute 'model_runner' [W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Additionally, I am using the Qwen2-7B-Instruct model. The official Qwen2-7B-Instruct model requires vllm to be greater than 0.4.3, otherwise the long text effect may be problematic,Can I use vllm 0.4.3 inference

I also have the same need for Qwen2-7B-Instruct running on vllm.

@iofu728
Copy link
Contributor

iofu728 commented Jul 7, 2024

Hi @junior-zsy and @cyLi-Tiger, we fix this issue in 0.1.4.post1.

Please update MInference to version 0.1.4.post1. If the issue persists, feel free to reopen this issue.

@cyLi-Tiger
Copy link

Hi @iofu728, thanks for the fix!

I tried python run_vllm.py again, with vllm 0.4.1, torch 2.2.1, triton 2.2.0, flash-attn 2.5.9.post1, minference 0.1.4.post1, the results of Llama-3-8B-Instruct-Gradient-1048k looked good. But I got an error when running with Qwen2-7B-Instruct. Have you tested vllm on Qwen2 models?

python run_vllm.py

INFO 07-09 02:59:37 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/xxx/weights/Qwen2/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/xxx/weights/Qwen2/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-09 02:59:37 utils.py:608] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2
INFO 07-09 02:59:37 selector.py:28] Using FlashAttention backend.
INFO 07-09 02:59:45 model_runner.py:173] Loading model weights took 14.2487 GB
Traceback (most recent call last):
File "/xxx/experiment/kv_compress/MInference/examples/run_vllm.py", line 21, in
llm = LLM(
File "/xxx/experiment/vllm/vllm/entrypoints/llm.py", line 118, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 277, in from_engine_args
engine = cls(
File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 160, in init
self._initialize_kv_caches()
File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 236, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/xxx/experiment/vllm/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/worker/worker.py", line 138, in determine_num_available_blocks
self.model_runner.profile_run()
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/worker/model_runner.py", line 927, in profile_run
self.execute_model(seqs, kv_caches)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/worker/model_runner.py", line 848, in execute_model
hidden_states = model_executable(**execute_model_kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 315, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 252, in forward
hidden_states, residual = layer(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 205, in forward
hidden_states = self.self_attn(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 151, in forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/attention/layer.py", line 48, in forward
return self.impl.forward(query, key, value, kv_cache, attn_metadata,
File "/xxx/experiment/vllm/vllm/attention/backends/flash_attn.py", line 220, in forward
out = flash_attn_varlen_func(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1066, in flash_attn_varlen_func
return FlashAttnVarlenFunc.apply(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 581, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 86, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

@iofu728
Copy link
Contributor

iofu728 commented Jul 9, 2024

Hi @iofu728, thanks for the fix!

I tried python run_vllm.py again, with vllm 0.4.1, torch 2.2.1, triton 2.2.0, flash-attn 2.5.9.post1, minference 0.1.4.post1, the results of Llama-3-8B-Instruct-Gradient-1048k looked good. But I got an error when running with Qwen2-7B-Instruct. Have you tested vllm on Qwen2 models?

python run_vllm.py

INFO 07-09 02:59:37 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/xxx/weights/Qwen2/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/xxx/weights/Qwen2/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-09 02:59:37 utils.py:608] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2
INFO 07-09 02:59:37 selector.py:28] Using FlashAttention backend.
INFO 07-09 02:59:45 model_runner.py:173] Loading model weights took 14.2487 GB
Traceback (most recent call last):
File "/xxx/experiment/kv_compress/MInference/examples/run_vllm.py", line 21, in
llm = LLM(
File "/xxx/experiment/vllm/vllm/entrypoints/llm.py", line 118, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 277, in from_engine_args
engine = cls(
File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 160, in init
self._initialize_kv_caches()
File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 236, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/xxx/experiment/vllm/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/worker/worker.py", line 138, in determine_num_available_blocks
self.model_runner.profile_run()
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/worker/model_runner.py", line 927, in profile_run
self.execute_model(seqs, kv_caches)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/worker/model_runner.py", line 848, in execute_model
hidden_states = model_executable(**execute_model_kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 315, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 252, in forward
hidden_states, residual = layer(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 205, in forward
hidden_states = self.self_attn(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 151, in forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/attention/layer.py", line 48, in forward
return self.impl.forward(query, key, value, kv_cache, attn_metadata,
File "/xxx/experiment/vllm/vllm/attention/backends/flash_attn.py", line 220, in forward
out = flash_attn_varlen_func(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1066, in flash_attn_varlen_func
return FlashAttnVarlenFunc.apply(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 581, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 86, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Hi @cyLi-Tiger, thanks for your feedback. I tested vllm==0.4.1 with flash_attn==2.5.8 and vllm==0.4.3 with flash_attn==0.4.2, and it works well using Qwen2. Could you try reinstalling minference and flash_attn, and then run vllm again?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants