-
-
Notifications
You must be signed in to change notification settings - Fork 5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to specify GPU usage in VLLM code #3012
Comments
You can specify the devices by using |
from vllm import LLM
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"
llm_1 = LLM(llm_1_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=2)
os.environ["CUDA_VISIBLE_DEVICES"] = "3"
llm_2 = LLM(llm_2_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=1) this still loads 2nd llm on 1 and 2 gpu and gives memory error |
Try instantiate them in different script? |
@simon-mo Separatly they work but my goal is to run two different LLMs. One LLM on 2 GPUs and Second LLM on 3rd GPU from vllm import LLM
import os
os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
llm_1 = LLM(llm_1_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=2)
os.environ["CUDA_VISIBLE_DEVICES"] = ""
os.environ["CUDA_VISIBLE_DEVICES"] = "2"
llm_2 = LLM(llm_2_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=1)
|
I've had your exact same scenario, my solution was to run on docker-compose, because in there you can specify which GPU ids to make available to each instance |
And then expose their APIs and consume with another script, it would be faster if you run the openai compatible API, however if you want to add something custom like |
@KatIsCoding Thanks for your suggestion. Yeah I endedup with same thought that I have to implement the ray clustering by myself. What I have noticed is that when I initialized 2nd LLM object then it recreate a cluster of GPU/CPU. If I manually change |
@KatIsCoding can you share docker setup? I haven't have much experiance with docker. Thanks |
@humza-sami were you able to figure out how to do this? I am facing the same problem and have no idea how to fix it atm. There is a solution using RAY but not sure how to implement that |
I'm sorry for my late response on the topic, as @sAviOr287 mentioned, there is a ray implementation out there, however I could not find much information about it. So far my approach to the problem was just using docker and different instances for different models, like so: version: "3.8"
networks:
load_balancing:
name: load_balancing
services:
sqlcoder:
profiles: [ai]
image: aiimage
shm_size: "15gb"
command: python3 ./aiplug.service.py
hostname: sqlcoder
networks:
- load_balancing
environment:
- MODEL_ID=defog/sqlcoder-7b-2
- TP_SIZE=1
- ACCEPT_EMPTY_IDS=1
build:
context: .
dockerfile: ./apps/VLLM/ai-service.Dockerfile
volumes:
- ./apps/VLLM/:/app:ro
- ./models:/aishared
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["1"]
capabilities: [gpu]
llama:
profiles: [ai-exp]
image: aiimage
shm_size: "15gb"
command: python3 ./aiplug.service.py
hostname: llama
networks:
- load_balancing
environment:
- AI_SERVICE_PORT=1337
- MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
- ACCEPT_EMPTY_IDS=1
- TP_SIZE=1
build:
context: .
dockerfile: ./apps/VLLM/ai-service.Dockerfile
volumes:
- ./apps/VLLM/:/app:ro
- ./models:/aishared
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu]
nginx:
image: nginx:1.15-alpine
profiles: [ai]
networks:
- load_balancing
depends_on:
- sqlcoder
- llama
volumes:
- ./nginx-conf:/etc/nginx/conf.d
ports:
- 6565:6565 #SQL Coder
- 6566:6566 #Llama It is a load balancing approach, however a different model gets hit depending on which port you are using. |
The most important thing about the configuration is the usage of deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ["0"]
capabilities: [gpu] By specifying a |
Any one found any solution , I am trying to use it with accelerate but getting same error |
I found that specifying GPU ids for ray-executor could be achieved by modifying |
Thanks for the suggestion do you have any example code for this ? I don't think I fully understand your solution.
Best
Sent from Outlook for iOS<https://aka.ms/o0ukef>
…________________________________
From: Zheng Wang ***@***.***>
Sent: Wednesday, May 22, 2024 8:39:45 AM
To: vllm-project/vllm ***@***.***>
Cc: Jean-Francois Ton ***@***.***>; Mention ***@***.***>
Subject: Re: [vllm-project/vllm] Unable to specify GPU usage in VLLM code (Issue #3012)
I found that specifying GPU ids for ray-executor could be achieved by modifying worker_node_and_gpu_ids in vllm/executor/ray_gpu_executor.py <https://github.com/vllm-project/vllm/blob/5f6d10c14c17122e6d711a4829ee0ca672e07f6f/vllm/executor/ray_gpu_executor.py#L130>
—
Reply to this email directly, view it on GitHub<#3012 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADF67DOVYOT62X6QYUVXDHDZDRDUDAVCNFSM6AAAAABDXHH2F6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRUGA4DINZRGU>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Hi @sAviOr287 , I added the following code in vllm/executor/ray_gpu_executor.py (the GPU id that I want to use is given in # update GPU IDs if specified.
if self.GPUs is not None:
assert (len(self.GPUs) == len(worker_node_and_gpu_ids)), "Number of GPUs specified does not match the number of workers."
for i, (node_id, gpu_ids) in enumerate(worker_node_and_gpu_ids):
worker_node_and_gpu_ids[i] = (node_id, [self.GPUs[i]]) |
This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you! |
This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you! |
I am facing difficulties in specifying GPU usage for different models for LLM inference pipeline using vLLM. Specifically, I have 4 RTX 4090 GPUs available, and I aim to run a LLM with a size of 42GB on 2 RTX 4090 GPUs (~48GB) and a separate model with a size of 22GB on 1 RTX 4090 GPU(`24GB).
This is my code for running 42GB model on two GPUs.
However, I haven't found a straightforward method within the VLLM library to specify which GPU should be used for each model.
The text was updated successfully, but these errors were encountered: