Unable to specify GPU usage in VLLM code #3012

humza-sami · 2024-02-23T19:23:00Z

I am facing difficulties in specifying GPU usage for different models for LLM inference pipeline using vLLM. Specifically, I have 4 RTX 4090 GPUs available, and I aim to run a LLM with a size of 42GB on 2 RTX 4090 GPUs (~48GB) and a separate model with a size of 22GB on 1 RTX 4090 GPU(`24GB).
This is my code for running 42GB model on two GPUs.

from vllm import LLM
llm = LLM(model_name, max_model_len=50, tensor_parallel_size=2)
output = llm.generate(text)

However, I haven't found a straightforward method within the VLLM library to specify which GPU should be used for each model.

simon-mo · 2024-02-23T20:16:35Z

You can specify the devices by using CUDA_VISIBLE_DEVICES environment variable.

humza-sami · 2024-02-23T20:19:12Z

You can specify the devices by using CUDA_VISIBLE_DEVICES environment variable.

@simon-mo

from vllm import LLM
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"
llm_1 = LLM(llm_1_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=2)

os.environ["CUDA_VISIBLE_DEVICES"] = "3"
llm_2 = LLM(llm_2_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=1)

this still loads 2nd llm on 1 and 2 gpu and gives memory error

simon-mo · 2024-02-23T20:38:09Z

Try instantiate them in different script?

humza-sami · 2024-02-23T20:57:47Z

@simon-mo Separatly they work but my goal is to run two different LLMs. One LLM on 2 GPUs and Second LLM on 3rd GPU

from vllm import LLM
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
llm_1 = LLM(llm_1_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=2)

os.environ["CUDA_VISIBLE_DEVICES"] = ""

os.environ["CUDA_VISIBLE_DEVICES"] = "2"
llm_2 = LLM(llm_2_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=1)

RuntimeError Traceback (most recent call last)
Cell In[11], line 3
1 os.environ["CUDA_VISIBLE_DEVICES"] = "2"
----> 3 llm_2 = LLM("codellama/CodeLlama-7b-Instruct-hf",max_model_len=4000,gpu_memory_utilization=0.9, tensor_parallel_size=1)

File /usr/local/lib/python3.8/dist-packages/vllm/entrypoints/llm.py:109, in LLM.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, disable_custom_all_reduce, **kwargs)
90 kwargs["disable_log_stats"] = True
91 engine_args = EngineArgs(
92 model=model,
93 tokenizer=tokenizer,
(...)
107 **kwargs,
108 )
--> 109 self.llm_engine = LLMEngine.from_engine_args(engine_args)
110 self.request_counter = Counter()

File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:371, in LLMEngine.from_engine_args(cls, engine_args)
369 placement_group = initialize_cluster(parallel_config)
370 # Create the LLM engine.
--> 371 engine = cls(*engine_configs,
372 placement_group,
373 log_stats=not engine_args.disable_log_stats)
374 return engine

File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:120, in LLMEngine.init(self, model_config, cache_config, parallel_config, scheduler_config, device_config, lora_config, placement_group, log_stats)
118 self._init_workers_ray(placement_group)
119 else:
--> 120 self._init_workers()
122 # Profile the memory usage and initialize the cache.
123 self._init_cache()

File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:163, in LLMEngine._init_workers(self)
149 distributed_init_method = get_distributed_init_method(
150 get_ip(), get_open_port())
151 self.driver_worker = Worker(
152 self.model_config,
153 self.parallel_config,
(...)
161 is_driver_worker=True,
162 )
--> 163 self._run_workers("init_model")
164 self._run_workers("load_model")

File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:1014, in LLMEngine._run_workers(self, method, driver_args, driver_kwargs, max_concurrent_workers, use_ray_compiled_dag, *args, **kwargs)
1011 driver_kwargs = kwargs
1013 # Start the driver worker after all the ray workers.
-> 1014 driver_worker_output = getattr(self.driver_worker,
1015 method)(*driver_args, **driver_kwargs)
1017 # Get the results of the ray workers.
1018 if self.workers:

File /usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py:94, in Worker.init_model(self, cupy_port)
91 raise RuntimeError(
92 f"Not support device type: {self.device_config.device}")
93 # Initialize the distributed environment.
---> 94 init_distributed_environment(self.parallel_config, self.rank,
95 cupy_port, self.distributed_init_method)
96 # Initialize the model.
97 set_random_seed(self.model_config.seed)

File /usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py:247, in init_distributed_environment(parallel_config, rank, cupy_port, distributed_init_method)
245 torch_world_size = torch.distributed.get_world_size()
246 if torch_world_size != parallel_config.world_size:
--> 247 raise RuntimeError(
248 "torch.distributed is already initialized but the torch world "
249 "size does not match parallel_config.world_size "
250 f"({torch_world_size} vs. {parallel_config.world_size}).")
251 elif not distributed_init_method:
252 raise ValueError(
253 "distributed_init_method must be set if torch.distributed "
254 "is not already initialized")

RuntimeError: torch.distributed is already initialized but the torch world size does not match parallel_config.world_size (2 vs. 1).

KatIsCoding · 2024-02-24T12:56:18Z

I've had your exact same scenario, my solution was to run on docker-compose, because in there you can specify which GPU ids to make available to each instance

KatIsCoding · 2024-02-24T12:58:05Z

And then expose their APIs and consume with another script, it would be faster if you run the openai compatible API, however if you want to add something custom like lmformatenforcer, you might need to make the implementation yourself

humza-sami · 2024-02-28T07:21:18Z

@KatIsCoding Thanks for your suggestion. Yeah I endedup with same thought that I have to implement the ray clustering by myself. What I have noticed is that when I initialized 2nd LLM object then it recreate a cluster of GPU/CPU. If I manually change CUDA_VISIBLE_DEVICES before making 2nd LLM object in same python script then ray confuses and throw error because current configuration clash with 1st LLM object cluster.
In single process (script), you cannot make 2nd LLM object by changing CUDA_VISIBLE_DEVICES.

humza-sami · 2024-02-28T07:22:49Z

@KatIsCoding can you share docker setup? I haven't have much experiance with docker. Thanks

sAviOr287 · 2024-05-01T01:44:49Z

@humza-sami were you able to figure out how to do this? I am facing the same problem and have no idea how to fix it atm. There is a solution using RAY but not sure how to implement that
Do you have any news on this issue?

KatIsCoding · 2024-05-02T13:07:51Z

@KatIsCoding can you share docker setup? I haven't have much experiance with docker. Thanks

@humza-sami were you able to figure out how to do this? I am facing the same problem and have no idea how to fix it atm. There is a solution using RAY but not sure how to implement that Do you have any news on this issue?

I'm sorry for my late response on the topic, as @sAviOr287 mentioned, there is a ray implementation out there, however I could not find much information about it.

So far my approach to the problem was just using docker and different instances for different models, like so:

version: "3.8"

networks:
  load_balancing:
    name: load_balancing

services:
  sqlcoder:
    profiles: [ai]
    image: aiimage
    shm_size: "15gb"
    command: python3 ./aiplug.service.py
    hostname: sqlcoder
    networks:
      - load_balancing
    environment:
      - MODEL_ID=defog/sqlcoder-7b-2
      - TP_SIZE=1
      - ACCEPT_EMPTY_IDS=1
    build:
      context: .
      dockerfile: ./apps/VLLM/ai-service.Dockerfile
    volumes:
      - ./apps/VLLM/:/app:ro
      - ./models:/aishared
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]

  llama:
    profiles: [ai-exp]
    image: aiimage
    shm_size: "15gb"
    command: python3 ./aiplug.service.py
    hostname: llama
    networks:
      - load_balancing
    environment:
      - AI_SERVICE_PORT=1337
      - MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
      - ACCEPT_EMPTY_IDS=1
      - TP_SIZE=1
    build:
      context: .
      dockerfile: ./apps/VLLM/ai-service.Dockerfile
    volumes:
      - ./apps/VLLM/:/app:ro
      - ./models:/aishared
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
  nginx:
    image: nginx:1.15-alpine
    profiles: [ai]
    networks:
      - load_balancing
    depends_on:
      - sqlcoder
      - llama
    volumes:
      - ./nginx-conf:/etc/nginx/conf.d
    ports:
      - 6565:6565 #SQL Coder
      - 6566:6566 #Llama

It is a load balancing approach, however a different model gets hit depending on which port you are using.
My dockerfile is pretty much just installing vllm + some other stuff, however it could be completely replaced with something like the OpenAI implementation vllm has.

KatIsCoding · 2024-05-02T13:10:16Z

The most important thing about the configuration is the usage of

deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

By specifying a device_ids you are essentially telling docker which GPUs to make available in each process

sparsh35 · 2024-05-21T02:48:23Z

Any one found any solution , I am trying to use it with accelerate but getting same error

Ash-Zheng · 2024-05-22T07:39:22Z

I found that specifying GPU ids for ray-executor could be achieved by modifying worker_node_and_gpu_ids in vllm/executor/ray_gpu_executor.py

sAviOr287 · 2024-05-22T14:12:36Z

Thanks for the suggestion do you have any example code for this ? I don't think I fully understand your solution. Best Sent from Outlook for iOS<https://aka.ms/o0ukef>

…

________________________________ From: Zheng Wang ***@***.***> Sent: Wednesday, May 22, 2024 8:39:45 AM To: vllm-project/vllm ***@***.***> Cc: Jean-Francois Ton ***@***.***>; Mention ***@***.***> Subject: Re: [vllm-project/vllm] Unable to specify GPU usage in VLLM code (Issue #3012) I found that specifying GPU ids for ray-executor could be achieved by modifying worker_node_and_gpu_ids in vllm/executor/ray_gpu_executor.py <https://github.com/vllm-project/vllm/blob/5f6d10c14c17122e6d711a4829ee0ca672e07f6f/vllm/executor/ray_gpu_executor.py#L130> — Reply to this email directly, view it on GitHub<#3012 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ADF67DOVYOT62X6QYUVXDHDZDRDUDAVCNFSM6AAAAABDXHH2F6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMRUGA4DINZRGU>. You are receiving this because you were mentioned.Message ID: ***@***.***>

Ash-Zheng · 2024-05-23T05:17:24Z

Thanks for the suggestion do you have any example code for this ? I don't think I fully understand your solution.

Hi @sAviOr287 , I added the following code in vllm/executor/ray_gpu_executor.py (the GPU id that I want to use is given in self.GPUs):

# update GPU IDs if specified.
if self.GPUs is not None:
    assert (len(self.GPUs) == len(worker_node_and_gpu_ids)), "Number of GPUs specified does not match the number of workers."
    for i, (node_id, gpu_ids) in enumerate(worker_node_and_gpu_ids):
        worker_node_and_gpu_ids[i] = (node_id, [self.GPUs[i]])

github-actions · 2024-10-30T02:01:55Z

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

github-actions · 2024-11-29T02:08:11Z

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

humza-sami mentioned this issue Feb 23, 2024

How to specify which GPU the model inference on? #352

Closed

kaifronsdal mentioned this issue Jun 20, 2024

[Feature Request] Way to specify GPU ordinal #3172

Open

github-actions bot added the stale label Oct 30, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to specify GPU usage in VLLM code #3012

Unable to specify GPU usage in VLLM code #3012

humza-sami commented Feb 23, 2024

simon-mo commented Feb 23, 2024

humza-sami commented Feb 23, 2024

simon-mo commented Feb 23, 2024

humza-sami commented Feb 23, 2024 •

edited

Loading

KatIsCoding commented Feb 24, 2024

KatIsCoding commented Feb 24, 2024

humza-sami commented Feb 28, 2024

humza-sami commented Feb 28, 2024

sAviOr287 commented May 1, 2024

KatIsCoding commented May 2, 2024

KatIsCoding commented May 2, 2024

sparsh35 commented May 21, 2024

Ash-Zheng commented May 22, 2024

sAviOr287 commented May 22, 2024 via email

Ash-Zheng commented May 23, 2024

github-actions bot commented Oct 30, 2024

github-actions bot commented Nov 29, 2024

Unable to specify GPU usage in VLLM code #3012

Unable to specify GPU usage in VLLM code #3012

Comments

humza-sami commented Feb 23, 2024

simon-mo commented Feb 23, 2024

humza-sami commented Feb 23, 2024

simon-mo commented Feb 23, 2024

humza-sami commented Feb 23, 2024 • edited Loading

KatIsCoding commented Feb 24, 2024

KatIsCoding commented Feb 24, 2024

humza-sami commented Feb 28, 2024

humza-sami commented Feb 28, 2024

sAviOr287 commented May 1, 2024

KatIsCoding commented May 2, 2024

KatIsCoding commented May 2, 2024

sparsh35 commented May 21, 2024

Ash-Zheng commented May 22, 2024

sAviOr287 commented May 22, 2024 via email

Ash-Zheng commented May 23, 2024

github-actions bot commented Oct 30, 2024

github-actions bot commented Nov 29, 2024

humza-sami commented Feb 23, 2024 •

edited

Loading