[Question]: python run_vllm.py TypeError: 'type' object is not subscriptable #13

junior-zsy · 2024-07-05T07:30:54Z

Describe the bug

python run_vllm.py
2024-07-05 15:25:04,647 WARNING utils.py:580 -- Detecting docker specified CPUs. In previous versions of Ray, CPU detection in containers was incorrect. Please ensure that Ray has enough CPUs allocated. As a temporary workaround to revert to the prior behavior, set RAY_USE_MULTIPROCESSING_CPU_COUNT=1 as an env var before starting Ray. Set the env var: RAY_DISABLE_DOCKER_CPU_WARNING=1 to mute this warning.
2024-07-05 15:25:05,859 INFO worker.py:1771 -- Started a local Ray instance.
INFO 07-05 15:25:11 llm_engine.py:75] Initializing an LLM engine (v0.4.0) with config: model='/xxx/model/Qwen2-7B-Instruct', tokenizer='/xxx/model/Qwen2-7B-Instruct', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=2, disable_custom_all_reduce=True, quantization=None, enforce_eager=True, kv_cache_dtype=auto, device_config=cuda, seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-05 15:25:17 selector.py:16] Using FlashAttention backend.
[W socket.cpp:436] [c10d] The server socket cannot be initialized on [::]:42823 (errno: 97 - Address family not supported by protocol).
[W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:10.178.172.129]:42823 (errno: 97 - Address family not supported by protocol).
(RayWorkerVllm pid=1499766) INFO 07-05 15:25:17 selector.py:16] Using FlashAttention backend.
INFO 07-05 15:25:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=1499766) INFO 07-05 15:25:17 pynccl_utils.py:45] vLLM is using nccl==2.18.1
(RayWorkerVllm pid=1499766) [W socket.cpp:663] [c10d] The client socket cannot be initialized to connect to [::ffff:10.178.172.129]:42823 (errno: 97 - Address family not supported by protocol).
INFO 07-05 15:25:22 model_runner.py:104] Loading model weights took 7.1441 GB
(RayWorkerVllm pid=1499766) INFO 07-05 15:25:23 model_runner.py:104] Loading model weights took 7.1441 GB
INFO 07-05 15:25:38 ray_gpu_executor.py:240] # GPU blocks: 49742, # CPU blocks: 9362
Traceback (most recent call last):
File "/xxx/code/MInference/examples/run_vllm.py", line 31, in
llm = minference_patch(llm)
File "/xxx/code/MInference/minference/models_patch.py", line 39, in call
return self.patch_model(model)
File "/xxx/code/MInference/minference/models_patch.py", line 102, in patch_model
model = minference_patch_vllm(model, self.config.config_path)
File "/xxx/code/MInference/minference/patch.py", line 1072, in minference_patch_vllm
attn_forward = minference_vllm_forward(config)
File "/xxx/code/MInference/minference/modules/minference_forward.py", line 771, in minference_vllm_forward
attn_metadata: AttentionMetadata[FlashAttentionMetadata],
TypeError: 'type' object is not subscriptable

dependent on:
vllm 0.4.0
flash-attn 2.5.9.post1
torch 2.1.2
triton 2.1.0

Steps to reproduce

No response

Expected Behavior

No response

Logs

No response

Additional Information

No response

The text was updated successfully, but these errors were encountered:

iofu728 · 2024-07-05T09:01:38Z

Hi @junior-zsy, thanks for your feedback.

This issue is caused by the vllm version. Currently, we support vllm==0.4.1.

Please update minference to 0.1.4 pip install minference==0.1.4, which fixes some other bugs #14. This update also makes minference potentially compatible with vllm==0.4.0.

junior-zsy · 2024-07-05T09:13:56Z

@iofu728 I used multi card inference and reported an error,error :
Traceback (most recent call last):
File "/xxx/code/MInference/examples/run_vllm.py", line 54, in
llm = minference_patch(llm)
File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/models_patch.py", line 39, in call
return self.patch_model(model)
File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/models_patch.py", line 102, in patch_model
model = minference_patch_vllm(model, self.config.config_path)
File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/patch.py", line 1091, in minference_patch_vllm
llm.llm_engine.model_executor.driver_worker.model_runner.model.apply(update_module)
AttributeError: 'RayWorkerWrapper' object has no attribute 'model_runner'
[W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Additionally, I am using the Qwen2-7B-Instruct model. The official Qwen2-7B-Instruct model requires vllm to be greater than 0.4.3, otherwise the long text effect may be problematic，Can I use vllm 0.4.3 inference

cyLi-Tiger · 2024-07-05T11:34:16Z

@iofu728 I used multi card inference and reported an error,error : Traceback (most recent call last): File "/xxx/code/MInference/examples/run_vllm.py", line 54, in llm = minference_patch(llm) File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/models_patch.py", line 39, in call return self.patch_model(model) File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/models_patch.py", line 102, in patch_model model = minference_patch_vllm(model, self.config.config_path) File "/xxx/miniconda3/envs/minference/lib/python3.10/site-packages/minference/patch.py", line 1091, in minference_patch_vllm llm.llm_engine.model_executor.driver_worker.model_runner.model.apply(update_module) AttributeError: 'RayWorkerWrapper' object has no attribute 'model_runner' [W CudaIPCTypes.cpp:16] Producer process has been terminated before all shared CUDA tensors released. See Note [Sharing CUDA tensors]

Additionally, I am using the Qwen2-7B-Instruct model. The official Qwen2-7B-Instruct model requires vllm to be greater than 0.4.3, otherwise the long text effect may be problematic，Can I use vllm 0.4.3 inference

I also have the same need for Qwen2-7B-Instruct running on vllm.

iofu728 · 2024-07-07T07:58:14Z

Hi @junior-zsy and @cyLi-Tiger, we fix this issue in 0.1.4.post1.

Please update MInference to version 0.1.4.post1. If the issue persists, feel free to reopen this issue.

cyLi-Tiger · 2024-07-09T03:14:11Z

Hi @iofu728, thanks for the fix!

I tried python run_vllm.py again, with vllm 0.4.1, torch 2.2.1, triton 2.2.0, flash-attn 2.5.9.post1, minference 0.1.4.post1, the results of Llama-3-8B-Instruct-Gradient-1048k looked good. But I got an error when running with Qwen2-7B-Instruct. Have you tested vllm on Qwen2 models?

python run_vllm.py

INFO 07-09 02:59:37 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/xxx/weights/Qwen2/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/xxx/weights/Qwen2/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-09 02:59:37 utils.py:608] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2
INFO 07-09 02:59:37 selector.py:28] Using FlashAttention backend.
INFO 07-09 02:59:45 model_runner.py:173] Loading model weights took 14.2487 GB
Traceback (most recent call last):
File "/xxx/experiment/kv_compress/MInference/examples/run_vllm.py", line 21, in
llm = LLM(
File "/xxx/experiment/vllm/vllm/entrypoints/llm.py", line 118, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 277, in from_engine_args
engine = cls(
File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 160, in init
self._initialize_kv_caches()
File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 236, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/xxx/experiment/vllm/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/worker/worker.py", line 138, in determine_num_available_blocks
self.model_runner.profile_run()
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/worker/model_runner.py", line 927, in profile_run
self.execute_model(seqs, kv_caches)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/worker/model_runner.py", line 848, in execute_model
hidden_states = model_executable(**execute_model_kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 315, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 252, in forward
hidden_states, residual = layer(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 205, in forward
hidden_states = self.self_attn(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 151, in forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/attention/layer.py", line 48, in forward
return self.impl.forward(query, key, value, kv_cache, attn_metadata,
File "/xxx/experiment/vllm/vllm/attention/backends/flash_attn.py", line 220, in forward
out = flash_attn_varlen_func(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1066, in flash_attn_varlen_func
return FlashAttnVarlenFunc.apply(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 581, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 86, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

iofu728 · 2024-07-09T08:54:47Z

Hi @iofu728, thanks for the fix!

I tried python run_vllm.py again, with vllm 0.4.1, torch 2.2.1, triton 2.2.0, flash-attn 2.5.9.post1, minference 0.1.4.post1, the results of Llama-3-8B-Instruct-Gradient-1048k looked good. But I got an error when running with Qwen2-7B-Instruct. Have you tested vllm on Qwen2 models?

python run_vllm.py

INFO 07-09 02:59:37 llm_engine.py:98] Initializing an LLM engine (v0.4.1) with config: model='/xxx/weights/Qwen2/Qwen2-7B-Instruct', speculative_config=None, tokenizer='/xxx/weights/Qwen2/Qwen2-7B-Instruct', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.bfloat16, max_seq_len=128000, download_dir=None, load_format=auto, tensor_parallel_size=1, disable_custom_all_reduce=False, quantization=None, enforce_eager=True, kv_cache_dtype=auto, quantization_param_path=None, device_config=cuda, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), seed=0)
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
INFO 07-09 02:59:37 utils.py:608] Found nccl from library /usr/lib/x86_64-linux-gnu/libnccl.so.2
INFO 07-09 02:59:37 selector.py:28] Using FlashAttention backend.
INFO 07-09 02:59:45 model_runner.py:173] Loading model weights took 14.2487 GB
Traceback (most recent call last):
File "/xxx/experiment/kv_compress/MInference/examples/run_vllm.py", line 21, in
llm = LLM(
File "/xxx/experiment/vllm/vllm/entrypoints/llm.py", line 118, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 277, in from_engine_args
engine = cls(
File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 160, in init
self._initialize_kv_caches()
File "/xxx/experiment/vllm/vllm/engine/llm_engine.py", line 236, in _initialize_kv_caches
self.model_executor.determine_num_available_blocks())
File "/xxx/experiment/vllm/vllm/executor/gpu_executor.py", line 111, in determine_num_available_blocks
return self.driver_worker.determine_num_available_blocks()
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/worker/worker.py", line 138, in determine_num_available_blocks
self.model_runner.profile_run()
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/worker/model_runner.py", line 927, in profile_run
self.execute_model(seqs, kv_caches)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/worker/model_runner.py", line 848, in execute_model
hidden_states = model_executable(**execute_model_kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 315, in forward
hidden_states = self.model(input_ids, positions, kv_caches,
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 252, in forward
hidden_states, residual = layer(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 205, in forward
hidden_states = self.self_attn(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/model_executor/models/qwen2.py", line 151, in forward
attn_output = self.attn(q, k, v, kv_cache, attn_metadata)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
return forward_call(*args, **kwargs)
File "/xxx/experiment/vllm/vllm/attention/layer.py", line 48, in forward
return self.impl.forward(query, key, value, kv_cache, attn_metadata,
File "/xxx/experiment/vllm/vllm/attention/backends/flash_attn.py", line 220, in forward
out = flash_attn_varlen_func(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 1066, in flash_attn_varlen_func
return FlashAttnVarlenFunc.apply(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/torch/autograd/function.py", line 553, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 581, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
File "/xxx/anaconda3/envs/vllm_cu121/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 86, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Hi @cyLi-Tiger, thanks for your feedback. I tested vllm==0.4.1 with flash_attn==2.5.8 and vllm==0.4.3 with flash_attn==0.4.2, and it works well using Qwen2. Could you try reinstalling minference and flash_attn, and then run vllm again?

junior-zsy added the bug Something isn't working label Jul 5, 2024

iofu728 self-assigned this Jul 5, 2024

iofu728 added question Further information is requested and removed bug Something isn't working labels Jul 5, 2024

iofu728 changed the title ~~[Bug]: python run_vllm.py TypeError: 'type' object is not subscriptable~~ [Question]: python run_vllm.py TypeError: 'type' object is not subscriptable Jul 5, 2024

iofu728 assigned liyucheng09 Jul 5, 2024

liyucheng09 mentioned this issue Jul 5, 2024

add vllm support for 0.4.2 and 0.4.3 #19

Merged

4 tasks

iofu728 closed this as completed in #19 Jul 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question]: python run_vllm.py TypeError: 'type' object is not subscriptable #13

[Question]: python run_vllm.py TypeError: 'type' object is not subscriptable #13

junior-zsy commented Jul 5, 2024

iofu728 commented Jul 5, 2024 •

edited

Loading

junior-zsy commented Jul 5, 2024 •

edited

Loading

cyLi-Tiger commented Jul 5, 2024

iofu728 commented Jul 7, 2024

cyLi-Tiger commented Jul 9, 2024

iofu728 commented Jul 9, 2024

[Question]: python run_vllm.py TypeError: 'type' object is not subscriptable #13

[Question]: python run_vllm.py TypeError: 'type' object is not subscriptable #13

Comments

junior-zsy commented Jul 5, 2024

Describe the bug

Steps to reproduce

Expected Behavior

Logs

Additional Information

iofu728 commented Jul 5, 2024 • edited Loading

junior-zsy commented Jul 5, 2024 • edited Loading

cyLi-Tiger commented Jul 5, 2024

iofu728 commented Jul 7, 2024

cyLi-Tiger commented Jul 9, 2024

iofu728 commented Jul 9, 2024

iofu728 commented Jul 5, 2024 •

edited

Loading

junior-zsy commented Jul 5, 2024 •

edited

Loading