Add support for deploying large-scale model across multiple nodes #1186

linnlh · 2024-11-01T02:42:41Z

What you would like to be added?

Hi, I'm currently working on running inference on large-scale models in k8s cluster. And in some cases this means running it across multiple nodes. I have noticed that it is hard to deploy models distributed by using Arena. Therefore, I suggest to introduce a new serving type called distributed to Arena's serving module.

The basic use will like:

arena serve distributed \
  --name=distributed-sample
  --image=xxx
  --restful-port=5000 \
  --masters=1 \
  --master-cpu=1 \
  --master-memory=2Gi \
  --master-gpus=0 \
  --master-command="python serving.py"
  --workers=2 \
  --worker-cpu=4 \
  --worker-memory=8Gi \
  --worker-gpus=4 \
  --worker-command="sleep 30d"

The distributed serving aims to manage replicas of "group". Each "group" consists of multiple pods. There are two type of pods in "group": master and worker, and user can specify the resources required for these two types of pod independently.

Why is this needed?

The recent model are becoming increasingly sophisticated and larger in size. Especially after Meta released models like Llama-3.1-405B, it is hard to deploy such massive model on a single node. To address this, user tends to deploy this type of model distributed across multiple nodes.

Currently, Arena does not support for distributed model deployment. This limitation affects users who wish to deploy large-scale model like Llama-3.1-405B by using Arena. Therefore, I think there is a need to support a distributed serving type in order to meet user needs.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

The text was updated successfully, but these errors were encountered:

linnlh added the kind/feature label Nov 1, 2024

linnlh linked a pull request Nov 1, 2024 that will close this issue

Feat: add support for distributed serving type #1187

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for deploying large-scale model across multiple nodes #1186

Add support for deploying large-scale model across multiple nodes #1186

linnlh commented Nov 1, 2024

Add support for deploying large-scale model across multiple nodes #1186

Add support for deploying large-scale model across multiple nodes #1186

Comments

linnlh commented Nov 1, 2024

What you would like to be added?

Why is this needed?

Love this feature?