Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to specify GPU usage in VLLM code #3012

Closed
humza-sami opened this issue Feb 23, 2024 · 17 comments
Closed

Unable to specify GPU usage in VLLM code #3012

humza-sami opened this issue Feb 23, 2024 · 17 comments
Labels

Comments

@humza-sami
Copy link

I am facing difficulties in specifying GPU usage for different models for LLM inference pipeline using vLLM. Specifically, I have 4 RTX 4090 GPUs available, and I aim to run a LLM with a size of 42GB on 2 RTX 4090 GPUs (~48GB) and a separate model with a size of 22GB on 1 RTX 4090 GPU(`24GB).
This is my code for running 42GB model on two GPUs.

from vllm import LLM
llm = LLM(model_name, max_model_len=50, tensor_parallel_size=2)
output = llm.generate(text)

However, I haven't found a straightforward method within the VLLM library to specify which GPU should be used for each model.

@simon-mo
Copy link
Collaborator

You can specify the devices by using CUDA_VISIBLE_DEVICES environment variable.

@humza-sami
Copy link
Author

You can specify the devices by using CUDA_VISIBLE_DEVICES environment variable.

@simon-mo

from vllm import LLM
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "1,2"
llm_1 = LLM(llm_1_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=2)

os.environ["CUDA_VISIBLE_DEVICES"] = "3"
llm_2 = LLM(llm_2_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=1)

this still loads 2nd llm on 1 and 2 gpu and gives memory error

Copy link
Collaborator

Try instantiate them in different script?

@humza-sami
Copy link
Author

humza-sami commented Feb 23, 2024

@simon-mo Separatly they work but my goal is to run two different LLMs. One LLM on 2 GPUs and Second LLM on 3rd GPU

from vllm import LLM
import os

os.environ["CUDA_VISIBLE_DEVICES"] = "0,1"
llm_1 = LLM(llm_1_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=2)

os.environ["CUDA_VISIBLE_DEVICES"] = ""

os.environ["CUDA_VISIBLE_DEVICES"] = "2"
llm_2 = LLM(llm_2_name,max_model_len=50,gpu_memory_utilization=0.9, tensor_parallel_size=1)

RuntimeError Traceback (most recent call last)
Cell In[11], line 3
1 os.environ["CUDA_VISIBLE_DEVICES"] = "2"
----> 3 llm_2 = LLM("codellama/CodeLlama-7b-Instruct-hf",max_model_len=4000,gpu_memory_utilization=0.9, tensor_parallel_size=1)

File /usr/local/lib/python3.8/dist-packages/vllm/entrypoints/llm.py:109, in LLM.init(self, model, tokenizer, tokenizer_mode, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, enforce_eager, max_context_len_to_capture, disable_custom_all_reduce, **kwargs)
90 kwargs["disable_log_stats"] = True
91 engine_args = EngineArgs(
92 model=model,
93 tokenizer=tokenizer,
(...)
107 **kwargs,
108 )
--> 109 self.llm_engine = LLMEngine.from_engine_args(engine_args)
110 self.request_counter = Counter()

File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:371, in LLMEngine.from_engine_args(cls, engine_args)
369 placement_group = initialize_cluster(parallel_config)
370 # Create the LLM engine.
--> 371 engine = cls(*engine_configs,
372 placement_group,
373 log_stats=not engine_args.disable_log_stats)
374 return engine

File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:120, in LLMEngine.init(self, model_config, cache_config, parallel_config, scheduler_config, device_config, lora_config, placement_group, log_stats)
118 self._init_workers_ray(placement_group)
119 else:
--> 120 self._init_workers()
122 # Profile the memory usage and initialize the cache.
123 self._init_cache()

File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:163, in LLMEngine._init_workers(self)
149 distributed_init_method = get_distributed_init_method(
150 get_ip(), get_open_port())
151 self.driver_worker = Worker(
152 self.model_config,
153 self.parallel_config,
(...)
161 is_driver_worker=True,
162 )
--> 163 self._run_workers("init_model")
164 self._run_workers("load_model")

File /usr/local/lib/python3.8/dist-packages/vllm/engine/llm_engine.py:1014, in LLMEngine._run_workers(self, method, driver_args, driver_kwargs, max_concurrent_workers, use_ray_compiled_dag, *args, **kwargs)
1011 driver_kwargs = kwargs
1013 # Start the driver worker after all the ray workers.
-> 1014 driver_worker_output = getattr(self.driver_worker,
1015 method)(*driver_args, **driver_kwargs)
1017 # Get the results of the ray workers.
1018 if self.workers:

File /usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py:94, in Worker.init_model(self, cupy_port)
91 raise RuntimeError(
92 f"Not support device type: {self.device_config.device}")
93 # Initialize the distributed environment.
---> 94 init_distributed_environment(self.parallel_config, self.rank,
95 cupy_port, self.distributed_init_method)
96 # Initialize the model.
97 set_random_seed(self.model_config.seed)

File /usr/local/lib/python3.8/dist-packages/vllm/worker/worker.py:247, in init_distributed_environment(parallel_config, rank, cupy_port, distributed_init_method)
245 torch_world_size = torch.distributed.get_world_size()
246 if torch_world_size != parallel_config.world_size:
--> 247 raise RuntimeError(
248 "torch.distributed is already initialized but the torch world "
249 "size does not match parallel_config.world_size "
250 f"({torch_world_size} vs. {parallel_config.world_size}).")
251 elif not distributed_init_method:
252 raise ValueError(
253 "distributed_init_method must be set if torch.distributed "
254 "is not already initialized")

RuntimeError: torch.distributed is already initialized but the torch world size does not match parallel_config.world_size (2 vs. 1).

@KatIsCoding
Copy link

I've had your exact same scenario, my solution was to run on docker-compose, because in there you can specify which GPU ids to make available to each instance

@KatIsCoding
Copy link

And then expose their APIs and consume with another script, it would be faster if you run the openai compatible API, however if you want to add something custom like lmformatenforcer, you might need to make the implementation yourself

@humza-sami
Copy link
Author

@KatIsCoding Thanks for your suggestion. Yeah I endedup with same thought that I have to implement the ray clustering by myself. What I have noticed is that when I initialized 2nd LLM object then it recreate a cluster of GPU/CPU. If I manually change CUDA_VISIBLE_DEVICES before making 2nd LLM object in same python script then ray confuses and throw error because current configuration clash with 1st LLM object cluster.
In single process (script), you cannot make 2nd LLM object by changing CUDA_VISIBLE_DEVICES.

@humza-sami
Copy link
Author

@KatIsCoding can you share docker setup? I haven't have much experiance with docker. Thanks

@sAviOr287
Copy link

@humza-sami were you able to figure out how to do this? I am facing the same problem and have no idea how to fix it atm. There is a solution using RAY but not sure how to implement that
Do you have any news on this issue?

@KatIsCoding
Copy link

@KatIsCoding can you share docker setup? I haven't have much experiance with docker. Thanks

@humza-sami were you able to figure out how to do this? I am facing the same problem and have no idea how to fix it atm. There is a solution using RAY but not sure how to implement that Do you have any news on this issue?

I'm sorry for my late response on the topic, as @sAviOr287 mentioned, there is a ray implementation out there, however I could not find much information about it.

So far my approach to the problem was just using docker and different instances for different models, like so:

version: "3.8"

networks:
  load_balancing:
    name: load_balancing

services:
  sqlcoder:
    profiles: [ai]
    image: aiimage
    shm_size: "15gb"
    command: python3 ./aiplug.service.py
    hostname: sqlcoder
    networks:
      - load_balancing
    environment:
      - MODEL_ID=defog/sqlcoder-7b-2
      - TP_SIZE=1
      - ACCEPT_EMPTY_IDS=1
    build:
      context: .
      dockerfile: ./apps/VLLM/ai-service.Dockerfile
    volumes:
      - ./apps/VLLM/:/app:ro
      - ./models:/aishared
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["1"]
              capabilities: [gpu]

  llama:
    profiles: [ai-exp]
    image: aiimage
    shm_size: "15gb"
    command: python3 ./aiplug.service.py
    hostname: llama
    networks:
      - load_balancing
    environment:
      - AI_SERVICE_PORT=1337
      - MODEL_ID=meta-llama/Meta-Llama-3-8B-Instruct
      - ACCEPT_EMPTY_IDS=1
      - TP_SIZE=1
    build:
      context: .
      dockerfile: ./apps/VLLM/ai-service.Dockerfile
    volumes:
      - ./apps/VLLM/:/app:ro
      - ./models:/aishared
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]
  nginx:
    image: nginx:1.15-alpine
    profiles: [ai]
    networks:
      - load_balancing
    depends_on:
      - sqlcoder
      - llama
    volumes:
      - ./nginx-conf:/etc/nginx/conf.d
    ports:
      - 6565:6565 #SQL Coder
      - 6566:6566 #Llama

It is a load balancing approach, however a different model gets hit depending on which port you are using.
My dockerfile is pretty much just installing vllm + some other stuff, however it could be completely replaced with something like the OpenAI implementation vllm has.

@KatIsCoding
Copy link

The most important thing about the configuration is the usage of

deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ["0"]
              capabilities: [gpu]

By specifying a device_ids you are essentially telling docker which GPUs to make available in each process

@sparsh35
Copy link

Any one found any solution , I am trying to use it with accelerate but getting same error

@Ash-Zheng
Copy link

I found that specifying GPU ids for ray-executor could be achieved by modifying worker_node_and_gpu_ids in vllm/executor/ray_gpu_executor.py

@sAviOr287
Copy link

sAviOr287 commented May 22, 2024 via email

@Ash-Zheng
Copy link

Thanks for the suggestion do you have any example code for this ? I don't think I fully understand your solution.

Hi @sAviOr287 , I added the following code in vllm/executor/ray_gpu_executor.py (the GPU id that I want to use is given in self.GPUs):

# update GPU IDs if specified.
if self.GPUs is not None:
    assert (len(self.GPUs) == len(worker_node_and_gpu_ids)), "Number of GPUs specified does not match the number of workers."
    for i, (node_id, gpu_ids) in enumerate(worker_node_and_gpu_ids):
        worker_node_and_gpu_ids[i] = (node_id, [self.GPUs[i]])

Copy link

This issue has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this issue should remain open. Thank you!

@github-actions github-actions bot added the stale label Oct 30, 2024
Copy link

This issue has been automatically closed due to inactivity. Please feel free to reopen if you feel it is still relevant. Thank you!

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants