Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update README.md with changes related to LLava-next multi card support #221

Merged
merged 2 commits into from
Sep 7, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 37 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,7 +106,7 @@ The following table contains models and configurations we have validated on Gaud
| CodeLlama-13B | ✔ | ✔ | ✔ | |
| Mixtral-8x7B | ✔ | ✔ | ✔ | ✔ |
| Mistral-7B | ✔ | ✔ | ✔ | ✔ |
| Llava-v1.6-Mistral-7B | ✔ | ✔ | ✔ | |
| Llava-v1.6-Mistral-7B | ✔ | ✔ | ✔ | |


## Running TGI with BF16 Precision
Expand Down Expand Up @@ -245,8 +245,6 @@ docker run -p 8080:80 \

In Llava-v1.6-Mistral-7B, an image usually accounts for 2000 input tokens. For example, an image of size 512x512 is represented by 2800 tokens. Thus, `max-input-tokens` must be larger than the number of tokens associated with the image. Otherwise the image may be truncated. We set `BASE_IMAGE_TOKENS=2048` as the default image token value. This is the minimum value of `max-input-tokens`. You can override the environment variable `BASE_IMAGE_TOKENS` to change this value. The warmup will generate graphs with input length from `BASE_IMAGE_TOKENS` to `max-input-tokens`. For Llava-v1.6-Mistral-7B, the value of `max-batch-prefill-tokens` is 16384, which is calcualted as follows: `prefill_batch_size` = `max-batch-prefill-tokens` / `max-input-tokens`.

> Note: Multi-card Llava-v1.6-Mistral-7B inference is currently not supported.

```bash
model=llava-hf/llava-v1.6-mistral-7b-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
Expand All @@ -256,6 +254,8 @@ docker run -p 8080:80 \
-v $volume:/data \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to set PT_HPU_ENABLE_LAZY_COLLECTIVES to True for single card usage?

-e HF_HUB_ENABLE_HF_TRANSFER=1 \
-e ENABLE_HPU_GRAPH=true \
-e LIMIT_HPU_GRAPH=true \
Expand Down Expand Up @@ -444,6 +444,39 @@ docker run -p 8080:80 \
-e QUANT_CONFIG=./quantization_config/maxabs_quant.json \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
-e HF_HUB_ENABLE_HF_TRANSFER=1 \
-e ENABLE_HPU_GRAPH=true \
-e LIMIT_HPU_GRAPH=true \
-e USE_FLASH_ATTENTION=true \
-e FLASH_ATTENTION_RECOMPUTE=true \
-e PREFILL_BATCH_BUCKET_SIZE=1 \
-e BATCH_BUCKET_SIZE=1 \
--cap-add=sys_nice \
--ipc=host \
ghcr.io/huggingface/tgi-gaudi:2.0.4 \
--model-id $model \
--max-input-tokens 4096 --max-batch-prefill-tokens 16384 \
--max-total-tokens 8192 --max-batch-total-tokens 32768
```

### Llava-v1.6-Mistral-7B on 8 Cards

```bash
model=llava-hf/llava-v1.6-mistral-7b-hf
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run -p 8080:80 \
--runtime=habana \
-v $volume:/data \
-v $PWD/quantization_config:/usr/src/quantization_config \
-v $PWD/hqt_output:/usr/src/hqt_output \
-e QUANT_CONFIG=./quantization_config/maxabs_quant.json \
-e HABANA_VISIBLE_DEVICES=all \
-e OMPI_MCA_btl_vader_single_copy_mechanism=none \
-e TEXT_GENERATION_SERVER_IGNORE_EOS_TOKEN=true \
-e PT_HPU_ENABLE_LAZY_COLLECTIVES=true \
-e HF_HUB_ENABLE_HF_TRANSFER=1 \
-e ENABLE_HPU_GRAPH=true \
-e LIMIT_HPU_GRAPH=true \
Expand All @@ -455,6 +488,7 @@ docker run -p 8080:80 \
--ipc=host \
ghcr.io/huggingface/tgi-gaudi:2.0.4 \
--model-id $model \
--sharded true --num-shard 8 \
--max-input-tokens 4096 --max-batch-prefill-tokens 16384 \
--max-total-tokens 8192 --max-batch-total-tokens 32768
```
Expand Down