Ensure buckets do not exceed the batch token limit #206

kzawora-intel · 2024-08-27T11:47:56Z

This PR ensures we don't capture buckets that are above the specified token budget (as set by max_num_batched_tokens argument)

Example for token budget of 2048 (--max-num-batched-tokens 2048):

$ python vllm_test.py --max-num-batched-tokens 2048
WARNING 08-27 14:48:55 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
INFO 08-27 14:48:56 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, weights_load_device=hpu, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False)
generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.91MB/s]
INFO 08-27 14:48:57 profiler.py:62] Profiler enabled for: vllm-instance-d356a015eeb349f7a4650e00bf6ce976
WARNING 08-27 14:48:57 utils.py:566] Pin memory is not supported on HPU.
INFO 08-27 14:48:57 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:48:57 habana_model_runner.py:532] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 08-27 14:48:57 habana_model_runner.py:545] Generated 23 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)]
INFO 08-27 14:48:57 habana_model_runner.py:550] Decode bucket config (min, step, max_warmup) bs:[1, 128, 256], seq:[128, 128, 2048]
INFO 08-27 14:48:57 habana_model_runner.py:561] Generated 31 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056398260 KB
------------------------------------------------------------------------------
INFO 08-27 14:49:00 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:49:00 loader.py:284] Loading weights on hpu ...
INFO 08-27 14:49:00 weight_utils.py:224] Using model weights format ['*.bin']
pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:06<00:00, 35.9MB/s]
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.15it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  4.15it/s]

INFO 08-27 14:49:08 habana_model_runner.py:441] Pre-loading model weights on hpu:0 took 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 298.9 MiB of host memory (485.6 GiB/1007 GiB used)
INFO 08-27 14:49:08 habana_model_runner.py:486] Wrapping in HPU Graph took 0 B of device memory (244.4 MiB/94.62 GiB used) and 0 B of host memory (485.6 GiB/1007 GiB used)
INFO 08-27 14:49:08 habana_model_runner.py:490] Loading model weights took in total 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 298.2 MiB of host memory (485.6 GiB/1007 GiB used)

We can see that no bucket exceeds 2048 tokens, and we have (16, 128) as well as (1, 2048). Previously, with default bucket settings, we'd also capture (16, 2048), and (64, 2048) cases, which should not be allowed.

With --max-num-batched-tokens 32768:

$ python vllm_test.py --max-num-batched-tokens 32768
WARNING 08-27 14:54:39 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'")
/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
  warnings.warn(
No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues.
INFO 08-27 14:54:41 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, weights_load_device=hpu, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False)
INFO 08-27 14:54:41 profiler.py:62] Profiler enabled for: vllm-instance-be8ab3101609425ba60df601dc9de3a6
WARNING 08-27 14:54:41 utils.py:566] Pin memory is not supported on HPU.
INFO 08-27 14:54:41 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:54:41 habana_model_runner.py:533] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024]
INFO 08-27 14:54:41 habana_model_runner.py:546] Generated 52 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (64, 128), (64, 256), (64, 384), (64, 512)]
INFO 08-27 14:54:41 habana_model_runner.py:551] Decode bucket config (min, step, max_warmup) bs:[1, 128, 256], seq:[128, 128, 2048]
INFO 08-27 14:54:41 habana_model_runner.py:562] Generated 95 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (8, 1280), (8, 1408), (8, 1536), (8, 1664), (8, 1792), (8, 1920), (8, 2048), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (16, 1280), (16, 1408), (16, 1536), (16, 1664), (16, 1792), (16, 1920), (16, 2048), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (64, 128), (64, 256), (64, 384), (64, 512), (128, 128), (128, 256), (256, 128)]
============================= HABANA PT BRIDGE CONFIGURATION =========================== 
 PT_HPU_LAZY_MODE = 1
 PT_RECIPE_CACHE_PATH = 
 PT_CACHE_FOLDER_DELETE = 0
 PT_HPU_RECIPE_CACHE_CONFIG = 
 PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807
 PT_HPU_LAZY_ACC_PAR_MODE = 1
 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0
---------------------------: System Configuration :---------------------------
Num CPU Cores : 160
CPU RAM       : 1056398260 KB
------------------------------------------------------------------------------
INFO 08-27 14:54:45 selector.py:85] Using HabanaAttention backend.
INFO 08-27 14:54:45 loader.py:284] Loading weights on hpu ...
INFO 08-27 14:54:45 weight_utils.py:224] Using model weights format ['*.bin']
Loading pt checkpoint shards:   0% Completed | 0/1 [00:00<?, ?it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.99it/s]
Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00,  3.99it/s]

INFO 08-27 14:54:45 habana_model_runner.py:442] Pre-loading model weights on hpu:0 took 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 279.7 MiB of host memory (485.8 GiB/1007 GiB used)
INFO 08-27 14:54:46 habana_model_runner.py:487] Wrapping in HPU Graph took 0 B of device memory (244.4 MiB/94.62 GiB used) and 48 KiB of host memory (485.8 GiB/1007 GiB used)
INFO 08-27 14:54:46 habana_model_runner.py:491] Loading model weights took in total 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 279.6 MiB of host memory (485.8 GiB/1007 GiB used)

Max model length (2048) is not exceeded for low batch, as seen in (1, 2048) bucket, but we can still get high batch sizes captured up to 32k tokens, as seen in (256, 128) bucket.

mswiniarsk · 2024-08-27T12:28:02Z

LGTM

This reverts commit aefd336.

This PR ensures we don't capture buckets that are above the specified token budget (as set by `max_num_batched_tokens` argument) Example for token budget of 2048 (`--max-num-batched-tokens 2048`): ``` $ python vllm_test.py --max-num-batched-tokens 2048 WARNING 08-27 14:48:55 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn( No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues. INFO 08-27 14:48:56 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, weights_load_device=hpu, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False) generation_config.json: 100%|███████████████████████████████████████████████████████████████████████████████████████| 137/137 [00:00<00:00, 1.91MB/s] INFO 08-27 14:48:57 profiler.py:62] Profiler enabled for: vllm-instance-d356a015eeb349f7a4650e00bf6ce976 WARNING 08-27 14:48:57 utils.py:566] Pin memory is not supported on HPU. INFO 08-27 14:48:57 selector.py:85] Using HabanaAttention backend. INFO 08-27 14:48:57 habana_model_runner.py:532] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024] INFO 08-27 14:48:57 habana_model_runner.py:545] Generated 23 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)] INFO 08-27 14:48:57 habana_model_runner.py:550] Decode bucket config (min, step, max_warmup) bs:[1, 128, 256], seq:[128, 128, 2048] INFO 08-27 14:48:57 habana_model_runner.py:561] Generated 31 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (8, 128), (8, 256), (16, 128)] ============================= HABANA PT BRIDGE CONFIGURATION =========================== PT_HPU_LAZY_MODE = 1 PT_RECIPE_CACHE_PATH = PT_CACHE_FOLDER_DELETE = 0 PT_HPU_RECIPE_CACHE_CONFIG = PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807 PT_HPU_LAZY_ACC_PAR_MODE = 1 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0 ---------------------------: System Configuration :--------------------------- Num CPU Cores : 160 CPU RAM : 1056398260 KB ------------------------------------------------------------------------------ INFO 08-27 14:49:00 selector.py:85] Using HabanaAttention backend. INFO 08-27 14:49:00 loader.py:284] Loading weights on hpu ... INFO 08-27 14:49:00 weight_utils.py:224] Using model weights format ['*.bin'] pytorch_model.bin: 100%|██████████████████████████████████████████████████████████████████████████████████████████| 251M/251M [00:06<00:00, 35.9MB/s] Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.15it/s] Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 4.15it/s] INFO 08-27 14:49:08 habana_model_runner.py:441] Pre-loading model weights on hpu:0 took 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 298.9 MiB of host memory (485.6 GiB/1007 GiB used) INFO 08-27 14:49:08 habana_model_runner.py:486] Wrapping in HPU Graph took 0 B of device memory (244.4 MiB/94.62 GiB used) and 0 B of host memory (485.6 GiB/1007 GiB used) INFO 08-27 14:49:08 habana_model_runner.py:490] Loading model weights took in total 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 298.2 MiB of host memory (485.6 GiB/1007 GiB used) ``` We can see that no bucket exceeds 2048 tokens, and we have `(16, 128)` as well as `(1, 2048)`. Previously, with default bucket settings, we'd also capture `(16, 2048)`, and `(64, 2048)` cases, which should not be allowed. With `--max-num-batched-tokens 32768`: ``` $ python vllm_test.py --max-num-batched-tokens 32768 WARNING 08-27 14:54:39 _custom_ops.py:14] Failed to import from vllm._C with ModuleNotFoundError("No module named 'vllm._C'") /usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py:366: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead warnings.warn( No chat template is set for this tokenizer, falling back to a default class-level template. This is very error-prone, because models are often trained with templates different from the class default! Default chat templates are a legacy feature and will be removed in Transformers v4.43, at which point any code depending on them will stop working. We recommend setting a valid chat template before then to ensure that this model continues working without issues. INFO 08-27 14:54:41 llm_engine.py:176] Initializing an LLM engine (v0.5.3.post1) with config: model='facebook/opt-125m', speculative_config=None, tokenizer='facebook/opt-125m', skip_tokenizer_init=False, tokenizer_mode=auto, revision=None, rope_scaling=None, rope_theta=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=2048, download_dir=None, load_format=LoadFormat.AUTO, tensor_parallel_size=1, pipeline_parallel_size=1, disable_custom_all_reduce=False, quantization=None, weights_load_device=hpu, enforce_eager=False, kv_cache_dtype=auto, quantization_param_path=None, device_config=hpu, decoding_config=DecodingConfig(guided_decoding_backend='outlines'), observability_config=ObservabilityConfig(otlp_traces_endpoint=None), seed=0, served_model_name=facebook/opt-125m, use_v2_block_manager=False, enable_prefix_caching=False) INFO 08-27 14:54:41 profiler.py:62] Profiler enabled for: vllm-instance-be8ab3101609425ba60df601dc9de3a6 WARNING 08-27 14:54:41 utils.py:566] Pin memory is not supported on HPU. INFO 08-27 14:54:41 selector.py:85] Using HabanaAttention backend. INFO 08-27 14:54:41 habana_model_runner.py:533] Prompt bucket config (min, step, max_warmup) bs:[1, 32, 64], seq:[128, 128, 1024] INFO 08-27 14:54:41 habana_model_runner.py:546] Generated 52 prompt buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (64, 128), (64, 256), (64, 384), (64, 512)] INFO 08-27 14:54:41 habana_model_runner.py:551] Decode bucket config (min, step, max_warmup) bs:[1, 128, 256], seq:[128, 128, 2048] INFO 08-27 14:54:41 habana_model_runner.py:562] Generated 95 decode buckets: [(1, 128), (1, 256), (1, 384), (1, 512), (1, 640), (1, 768), (1, 896), (1, 1024), (1, 1152), (1, 1280), (1, 1408), (1, 1536), (1, 1664), (1, 1792), (1, 1920), (1, 2048), (2, 128), (2, 256), (2, 384), (2, 512), (2, 640), (2, 768), (2, 896), (2, 1024), (2, 1152), (2, 1280), (2, 1408), (2, 1536), (2, 1664), (2, 1792), (2, 1920), (2, 2048), (4, 128), (4, 256), (4, 384), (4, 512), (4, 640), (4, 768), (4, 896), (4, 1024), (4, 1152), (4, 1280), (4, 1408), (4, 1536), (4, 1664), (4, 1792), (4, 1920), (4, 2048), (8, 128), (8, 256), (8, 384), (8, 512), (8, 640), (8, 768), (8, 896), (8, 1024), (8, 1152), (8, 1280), (8, 1408), (8, 1536), (8, 1664), (8, 1792), (8, 1920), (8, 2048), (16, 128), (16, 256), (16, 384), (16, 512), (16, 640), (16, 768), (16, 896), (16, 1024), (16, 1152), (16, 1280), (16, 1408), (16, 1536), (16, 1664), (16, 1792), (16, 1920), (16, 2048), (32, 128), (32, 256), (32, 384), (32, 512), (32, 640), (32, 768), (32, 896), (32, 1024), (64, 128), (64, 256), (64, 384), (64, 512), (128, 128), (128, 256), (256, 128)] ============================= HABANA PT BRIDGE CONFIGURATION =========================== PT_HPU_LAZY_MODE = 1 PT_RECIPE_CACHE_PATH = PT_CACHE_FOLDER_DELETE = 0 PT_HPU_RECIPE_CACHE_CONFIG = PT_HPU_MAX_COMPOUND_OP_SIZE = 9223372036854775807 PT_HPU_LAZY_ACC_PAR_MODE = 1 PT_HPU_ENABLE_REFINE_DYNAMIC_SHAPES = 0 ---------------------------: System Configuration :--------------------------- Num CPU Cores : 160 CPU RAM : 1056398260 KB ------------------------------------------------------------------------------ INFO 08-27 14:54:45 selector.py:85] Using HabanaAttention backend. INFO 08-27 14:54:45 loader.py:284] Loading weights on hpu ... INFO 08-27 14:54:45 weight_utils.py:224] Using model weights format ['*.bin'] Loading pt checkpoint shards: 0% Completed | 0/1 [00:00<?, ?it/s] Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.99it/s] Loading pt checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 3.99it/s] INFO 08-27 14:54:45 habana_model_runner.py:442] Pre-loading model weights on hpu:0 took 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 279.7 MiB of host memory (485.8 GiB/1007 GiB used) INFO 08-27 14:54:46 habana_model_runner.py:487] Wrapping in HPU Graph took 0 B of device memory (244.4 MiB/94.62 GiB used) and 48 KiB of host memory (485.8 GiB/1007 GiB used) INFO 08-27 14:54:46 habana_model_runner.py:491] Loading model weights took in total 238.9 MiB of device memory (244.4 MiB/94.62 GiB used) and 279.6 MiB of host memory (485.8 GiB/1007 GiB used) ``` Max model length (2048) is not exceeded for low batch, as seen in `(1, 2048)` bucket, but we can still get high batch sizes captured up to 32k tokens, as seen in `(256, 128)` bucket.

kzawora-intel added 2 commits August 27, 2024 14:46

Ensure buckets do not exceed the batch token limit

ccd9dee

make mypy happy

50811f9

kzawora-intel requested a review from madamczykhabana August 27, 2024 11:59

kzawora-intel merged commit aefd336 into habana_main Aug 27, 2024
13 checks passed

kzawora-intel mentioned this pull request Aug 27, 2024

Port flat PA from habana_next to habana_main #169

Merged

szutenberg added a commit that referenced this pull request Aug 27, 2024

Revert "Ensure buckets do not exceed the batch token limit (#206)"

7d59fc1

This reverts commit aefd336.

szutenberg mentioned this pull request Aug 27, 2024

Revert "Ensure buckets do not exceed the batch token limit" #207

Closed

kzawora-intel added the habana Issues or PRs submitted by Habana Labs label Sep 5, 2024

kzawora-intel deleted the private/kzawora/max_num_batched_tokens branch October 7, 2024 12:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure buckets do not exceed the batch token limit #206

Ensure buckets do not exceed the batch token limit #206

kzawora-intel commented Aug 27, 2024 •

edited

Loading

mswiniarsk commented Aug 27, 2024

Ensure buckets do not exceed the batch token limit #206

Ensure buckets do not exceed the batch token limit #206

Conversation

kzawora-intel commented Aug 27, 2024 • edited Loading

mswiniarsk commented Aug 27, 2024

kzawora-intel commented Aug 27, 2024 •

edited

Loading