You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I'm currently working on running inference on large-scale models in k8s cluster. And in some cases this means running it across multiple nodes. I have noticed that it is hard to deploy models distributed by using Arena. Therefore, I suggest to introduce a new serving type called distributed to Arena's serving module.
The distributed serving aims to manage replicas of "group". Each "group" consists of multiple pods. There are two type of pods in "group": master and worker, and user can specify the resources required for these two types of pod independently.
Why is this needed?
The recent model are becoming increasingly sophisticated and larger in size. Especially after Meta released models like Llama-3.1-405B, it is hard to deploy such massive model on a single node. To address this, user tends to deploy this type of model distributed across multiple nodes.
Currently, Arena does not support for distributed model deployment. This limitation affects users who wish to deploy large-scale model like Llama-3.1-405B by using Arena. Therefore, I think there is a need to support a distributed serving type in order to meet user needs.
Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered:
What you would like to be added?
Hi, I'm currently working on running inference on large-scale models in k8s cluster. And in some cases this means running it across multiple nodes. I have noticed that it is hard to deploy models distributed by using Arena. Therefore, I suggest to introduce a new serving type called
distributed
to Arena's serving module.The basic use will like:
The
distributed
serving aims to manage replicas of "group". Each "group" consists of multiple pods. There are two type of pods in "group": master and worker, and user can specify the resources required for these two types of pod independently.Why is this needed?
The recent model are becoming increasingly sophisticated and larger in size. Especially after Meta released models like Llama-3.1-405B, it is hard to deploy such massive model on a single node. To address this, user tends to deploy this type of model distributed across multiple nodes.
Currently, Arena does not support for distributed model deployment. This limitation affects users who wish to deploy large-scale model like Llama-3.1-405B by using Arena. Therefore, I think there is a need to support a
distributed
serving type in order to meet user needs.Love this feature?
Give it a 👍 We prioritize the features with most 👍
The text was updated successfully, but these errors were encountered: