Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to integrate Multi-LoRA Setup at Inference with NVIDIA Triton / TensorRT-LLM? I built the engine... #2371

Open
JoJoLev opened this issue Oct 24, 2024 · 4 comments
Assignees
Labels
build question Further information is requested triaged Issue has been triaged by maintainers

Comments

@JoJoLev
Copy link

JoJoLev commented Oct 24, 2024

I built the engine, and had two separate LoRA layers with the base llama3.1 model. The output from the build is rank0.engine, config.json, and then a lora folder with the following structure:
lora
|
|>0
| |_> adapter_config.json
|
|> adapter_model.safetensors
|
|
>1
| |> adapter_config.json
|__ |
> adapter_model.safetensors

Is this expected? I figured there would be rank engines? I passed these in the lora directory on the engine build:
trtllm-build --checkpoint_dir ./tllm_checkpoint_1gpu_tp1 --output_dir /opt/tensorrt_llm_engine --gemm_plugin auto --lora_plugin auto --max_batch_size 8 --max_input_len 512 --max_seq_len 562 --lora_dir "/opt/lora_1" "/opt/lora_2" --max_lora_rank 8 --lora_target_modules attn_q attn_k attn_v

Any advice is appreciated.

@Superjomn Superjomn added question Further information is requested triaged Issue has been triaged by maintainers build labels Oct 26, 2024
@syuoni
Copy link
Collaborator

syuoni commented Oct 28, 2024

Hi @JoJoLev ,

I suppose the output folder is expected. You built the eigine with TP=1, and there was one rank0.engine. The LoRA weights are saved in adapter_model.safetensors under each LoRA folder.

@JoJoLev
Copy link
Author

JoJoLev commented Oct 28, 2024

Hi @syuoni

Thanks for the response. Yes, I have a rank0.engine file, and config. My question now is that when I deploy on to a container, say NVIDIA Triton, do I have to include the LoRA weights? Or have those been baked in to the rank0.engine?

@syuoni
Copy link
Collaborator

syuoni commented Oct 28, 2024

Hi @syuoni

Thanks for the response. Yes, I have a rank0.engine file, and config. My question now is that when I deploy on to a container, say NVIDIA Triton, do I have to include the LoRA weights? Or have those been baked in to the rank0.engine?

Yes, you have to include the LoRA weights. They are not baked into the engine because TRT-LLM supports multi-lora, so has to load LoRA weights dynamically at runtime.

@JoJoLev
Copy link
Author

JoJoLev commented Oct 28, 2024

@syuoni got it!

Thank you! So, after running my engine build I had the aforementioned folder structure, if I was to deploy on NVIDIA Triton, would I include the LoRA weights in the 1/ sub folder where my rank0.engine file and config.json are? Or would I be placed on a different path?
I believe this is the container we are going with on deployment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
build question Further information is requested triaged Issue has been triaged by maintainers
Projects
None yet
Development

No branches or pull requests

3 participants