Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure buckets do not exceed the batch token limit #206

Merged
merged 2 commits into from
Aug 27, 2024

Conversation

kzawora-intel
Copy link

@kzawora-intel kzawora-intel commented Aug 27, 2024

This PR ensures we don't capture buckets that are above the specified token budget (as set by max_num_batched_tokens argument)

Example for token budget of 2048 (--max-num-batched-tokens 2048):

$ python vllm_test.py --max-num-batched-tokens 2048
WARNING 08-27 14:48:55 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
INFO 08-27 14:48:56 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, weights_load_device=hpu, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False)
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.91MB/s]
INFO 08-27 14:48:57 profiler.py:62] Profiler enabled for: vllm-instance-d356a015eeb349f7a4650e00bf6ce976
WARNING 08-27 14:48:57 utils.py:566] Pin memory is not supported on HPU.
INFO 08-27 14:48:57 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:48:57 habana_model_runner.py:532] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 08-27 14:48:57 habana_model_runner.py:545] Generated 23 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)]
INFO 08-27 14:48:57 habana_model_runner.py:550] Decode bucket config (min, step, max_warmup) bs:[1, 128, 256], seq:[128, 128, 2048]
INFO 08-27 14:48:57 habana_model_runner.py:561] Generated 31 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056398260 KB
------------------------------------------------------------------------------
INFO 08-27 14:49:00 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:49:00 loader.py:284] Loading weights on hpu ...
INFO 08-27 14:49:00 weight_utils.py:224] Using model weights format ['*.bin']
pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:06<00:00, 35.9MB/s]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.15it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.15it/s]

INFO 08-27 14:49:08 habana_model_runner.py:441] Pre-loading model weights on hpu:0 took 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 298.9 MiB of host memory (485.6 GiB/1007 GiB used)
INFO 08-27 14:49:08 habana_model_runner.py:486] Wrapping in HPU Graph took 0 B of device memory (244.4 MiB/94.62 GiB used) and 0 B of host memory (485.6 GiB/1007 GiB used)
INFO 08-27 14:49:08 habana_model_runner.py:490] Loading model weights took in total 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 298.2 MiB of host memory (485.6 GiB/1007 GiB used)

We can see that no bucket exceeds 2048 tokens, and we have (16, 128) as well as (1, 2048). Previously, with default bucket settings, we'd also capture (16, 2048), and (64, 2048) cases, which should not be allowed.

With --max-num-batched-tokens 32768:

$ python vllm_test.py --max-num-batched-tokens 32768
WARNING 08-27 14:54:39 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
INFO 08-27 14:54:41 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, weights_load_device=hpu, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-27 14:54:41 profiler.py:62] Profiler enabled for: vllm-instance-be8ab3101609425ba60df601dc9de3a6
WARNING 08-27 14:54:41 utils.py:566] Pin memory is not supported on HPU.
INFO 08-27 14:54:41 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:54:41 habana_model_runner.py:533] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 08-27 14:54:41 habana_model_runner.py:546] Generated 52 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (64, 128), (64, 256), (64, 384), (64, 512)]
INFO 08-27 14:54:41 habana_model_runner.py:551] Decode bucket config (min, step, max_warmup) bs:[1, 128, 256], seq:[128, 128, 2048]
INFO 08-27 14:54:41 habana_model_runner.py:562] Generated 95 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (8, 1280), (8, 1408), (8, 1536), (8, 1664), (8, 1792), (8, 1920), (8, 2048), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (16, 1280), (16, 1408), (16, 1536), (16, 1664), (16, 1792), (16, 1920), (16, 2048), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (64, 128), (64, 256), (64, 384), (64, 512), (128, 128), (128, 256), (256, 128)]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056398260 KB
------------------------------------------------------------------------------
INFO 08-27 14:54:45 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:54:45 loader.py:284] Loading weights on hpu ...
INFO 08-27 14:54:45 weight_utils.py:224] Using model weights format ['*.bin']
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.99it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.99it/s]

INFO 08-27 14:54:45 habana_model_runner.py:442] Pre-loading model weights on hpu:0 took 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 279.7 MiB of host memory (485.8 GiB/1007 GiB used)
INFO 08-27 14:54:46 habana_model_runner.py:487] Wrapping in HPU Graph took 0 B of device memory (244.4 MiB/94.62 GiB used) and 48 KiB of host memory (485.8 GiB/1007 GiB used)
INFO 08-27 14:54:46 habana_model_runner.py:491] Loading model weights took in total 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 279.6 MiB of host memory (485.8 GiB/1007 GiB used)

Max model length (2048) is not exceeded for low batch, as seen in (1, 2048) bucket, but we can still get high batch sizes captured up to 32k tokens, as seen in (256, 128) bucket.

@mswiniarsk
Copy link

LGTM

@kzawora-intel kzawora-intel merged commit aefd336 into habana_main Aug 27, 2024
13 checks passed
szutenberg added a commit that referenced this pull request Aug 27, 2024
@kzawora-intel kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Sep 5, 2024
zhouyu5 pushed a commit to zhouyu5/vllm-fork that referenced this pull request Sep 13, 2024
This PR ensures we don't capture buckets that are above the specified
token budget (as set by `max_num_batched_tokens` argument)

Example for token budget of 2048 (`--max-num-batched-tokens 2048`):
```
$ python vllm_test.py --max-num-batched-tokens 2048
WARNING 08-27 14:48:55 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
INFO 08-27 14:48:56 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, weights_load_device=hpu, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False)
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.91MB/s]
INFO 08-27 14:48:57 profiler.py:62] Profiler enabled for: vllm-instance-d356a015eeb349f7a4650e00bf6ce976
WARNING 08-27 14:48:57 utils.py:566] Pin memory is not supported on HPU.
INFO 08-27 14:48:57 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:48:57 habana_model_runner.py:532] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 08-27 14:48:57 habana_model_runner.py:545] Generated 23 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)]
INFO 08-27 14:48:57 habana_model_runner.py:550] Decode bucket config (min, step, max_warmup) bs:[1, 128, 256], seq:[128, 128, 2048]
INFO 08-27 14:48:57 habana_model_runner.py:561] Generated 31 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056398260 KB
------------------------------------------------------------------------------
INFO 08-27 14:49:00 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:49:00 loader.py:284] Loading weights on hpu ...
INFO 08-27 14:49:00 weight_utils.py:224] Using model weights format ['*.bin']
pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:06<00:00, 35.9MB/s]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.15it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.15it/s]

INFO 08-27 14:49:08 habana_model_runner.py:441] Pre-loading model weights on hpu:0 took 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 298.9 MiB of host memory (485.6 GiB/1007 GiB used)
INFO 08-27 14:49:08 habana_model_runner.py:486] Wrapping in HPU Graph took 0 B of device memory (244.4 MiB/94.62 GiB used) and 0 B of host memory (485.6 GiB/1007 GiB used)
INFO 08-27 14:49:08 habana_model_runner.py:490] Loading model weights took in total 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 298.2 MiB of host memory (485.6 GiB/1007 GiB used)
```
We can see that no bucket exceeds 2048 tokens, and we have `(16, 128)`
as well as `(1, 2048)`. Previously, with default bucket settings, we'd
also capture `(16, 2048)`, and `(64, 2048)` cases, which should not be
allowed.


With `--max-num-batched-tokens 32768`:
```
$ python vllm_test.py --max-num-batched-tokens 32768
WARNING 08-27 14:54:39 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
INFO 08-27 14:54:41 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, weights_load_device=hpu, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-27 14:54:41 profiler.py:62] Profiler enabled for: vllm-instance-be8ab3101609425ba60df601dc9de3a6
WARNING 08-27 14:54:41 utils.py:566] Pin memory is not supported on HPU.
INFO 08-27 14:54:41 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:54:41 habana_model_runner.py:533] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 08-27 14:54:41 habana_model_runner.py:546] Generated 52 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (64, 128), (64, 256), (64, 384), (64, 512)]
INFO 08-27 14:54:41 habana_model_runner.py:551] Decode bucket config (min, step, max_warmup) bs:[1, 128, 256], seq:[128, 128, 2048]
INFO 08-27 14:54:41 habana_model_runner.py:562] Generated 95 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (8, 1280), (8, 1408), (8, 1536), (8, 1664), (8, 1792), (8, 1920), (8, 2048), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (16, 1280), (16, 1408), (16, 1536), (16, 1664), (16, 1792), (16, 1920), (16, 2048), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (64, 128), (64, 256), (64, 384), (64, 512), (128, 128), (128, 256), (256, 128)]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056398260 KB
------------------------------------------------------------------------------
INFO 08-27 14:54:45 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:54:45 loader.py:284] Loading weights on hpu ...
INFO 08-27 14:54:45 weight_utils.py:224] Using model weights format ['*.bin']
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.99it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.99it/s]

INFO 08-27 14:54:45 habana_model_runner.py:442] Pre-loading model weights on hpu:0 took 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 279.7 MiB of host memory (485.8 GiB/1007 GiB used)
INFO 08-27 14:54:46 habana_model_runner.py:487] Wrapping in HPU Graph took 0 B of device memory (244.4 MiB/94.62 GiB used) and 48 KiB of host memory (485.8 GiB/1007 GiB used)
INFO 08-27 14:54:46 habana_model_runner.py:491] Loading model weights took in total 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 279.6 MiB of host memory (485.8 GiB/1007 GiB used)
```

Max model length (2048) is not exceeded for low batch, as seen in `(1,
2048)` bucket, but we can still get high batch sizes captured up to 32k
tokens, as seen in `(256, 128)` bucket.
zhouyu5 pushed a commit to zhouyu5/vllm-fork that referenced this pull request Sep 20, 2024
This PR ensures we don't capture buckets that are above the specified
token budget (as set by `max_num_batched_tokens` argument)

Example for token budget of 2048 (`--max-num-batched-tokens 2048`):
```
$ python vllm_test.py --max-num-batched-tokens 2048
WARNING 08-27 14:48:55 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
INFO 08-27 14:48:56 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, weights_load_device=hpu, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False)
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.91MB/s]
INFO 08-27 14:48:57 profiler.py:62] Profiler enabled for: vllm-instance-d356a015eeb349f7a4650e00bf6ce976
WARNING 08-27 14:48:57 utils.py:566] Pin memory is not supported on HPU.
INFO 08-27 14:48:57 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:48:57 habana_model_runner.py:532] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 08-27 14:48:57 habana_model_runner.py:545] Generated 23 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)]
INFO 08-27 14:48:57 habana_model_runner.py:550] Decode bucket config (min, step, max_warmup) bs:[1, 128, 256], seq:[128, 128, 2048]
INFO 08-27 14:48:57 habana_model_runner.py:561] Generated 31 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056398260 KB
------------------------------------------------------------------------------
INFO 08-27 14:49:00 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:49:00 loader.py:284] Loading weights on hpu ...
INFO 08-27 14:49:00 weight_utils.py:224] Using model weights format ['*.bin']
pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:06<00:00, 35.9MB/s]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.15it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.15it/s]

INFO 08-27 14:49:08 habana_model_runner.py:441] Pre-loading model weights on hpu:0 took 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 298.9 MiB of host memory (485.6 GiB/1007 GiB used)
INFO 08-27 14:49:08 habana_model_runner.py:486] Wrapping in HPU Graph took 0 B of device memory (244.4 MiB/94.62 GiB used) and 0 B of host memory (485.6 GiB/1007 GiB used)
INFO 08-27 14:49:08 habana_model_runner.py:490] Loading model weights took in total 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 298.2 MiB of host memory (485.6 GiB/1007 GiB used)
```
We can see that no bucket exceeds 2048 tokens, and we have `(16, 128)`
as well as `(1, 2048)`. Previously, with default bucket settings, we'd
also capture `(16, 2048)`, and `(64, 2048)` cases, which should not be
allowed.


With `--max-num-batched-tokens 32768`:
```
$ python vllm_test.py --max-num-batched-tokens 32768
WARNING 08-27 14:54:39 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
INFO 08-27 14:54:41 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, weights_load_device=hpu, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-27 14:54:41 profiler.py:62] Profiler enabled for: vllm-instance-be8ab3101609425ba60df601dc9de3a6
WARNING 08-27 14:54:41 utils.py:566] Pin memory is not supported on HPU.
INFO 08-27 14:54:41 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:54:41 habana_model_runner.py:533] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 08-27 14:54:41 habana_model_runner.py:546] Generated 52 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (64, 128), (64, 256), (64, 384), (64, 512)]
INFO 08-27 14:54:41 habana_model_runner.py:551] Decode bucket config (min, step, max_warmup) bs:[1, 128, 256], seq:[128, 128, 2048]
INFO 08-27 14:54:41 habana_model_runner.py:562] Generated 95 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (8, 1280), (8, 1408), (8, 1536), (8, 1664), (8, 1792), (8, 1920), (8, 2048), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (16, 1280), (16, 1408), (16, 1536), (16, 1664), (16, 1792), (16, 1920), (16, 2048), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (64, 128), (64, 256), (64, 384), (64, 512), (128, 128), (128, 256), (256, 128)]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056398260 KB
------------------------------------------------------------------------------
INFO 08-27 14:54:45 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:54:45 loader.py:284] Loading weights on hpu ...
INFO 08-27 14:54:45 weight_utils.py:224] Using model weights format ['*.bin']
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.99it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.99it/s]

INFO 08-27 14:54:45 habana_model_runner.py:442] Pre-loading model weights on hpu:0 took 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 279.7 MiB of host memory (485.8 GiB/1007 GiB used)
INFO 08-27 14:54:46 habana_model_runner.py:487] Wrapping in HPU Graph took 0 B of device memory (244.4 MiB/94.62 GiB used) and 48 KiB of host memory (485.8 GiB/1007 GiB used)
INFO 08-27 14:54:46 habana_model_runner.py:491] Loading model weights took in total 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 279.6 MiB of host memory (485.8 GiB/1007 GiB used)
```

Max model length (2048) is not exceeded for low batch, as seen in `(1,
2048)` bucket, but we can still get high batch sizes captured up to 32k
tokens, as seen in `(256, 128)` bucket.
@kzawora-intel kzawora-intel deleted the private/kzawora/max_num_batched_tokens branch October 7, 2024 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
habana Issues or PRs submitted by Habana Labs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants