Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Triton] Inference Service with multiple models #514

Open
haiminh2001 opened this issue Jun 18, 2024 · 0 comments
Open

[Triton] Inference Service with multiple models #514

haiminh2001 opened this issue Jun 18, 2024 · 0 comments

Comments

@haiminh2001
Copy link

Is your feature request related to a problem? If so, please describe.

Context:

  • I am deploying multiple Triton Inference Servers on k8s, each is an API (for example Document OCR, Document Quality Check) and contains multiple models.
  • Problem: Each POD on K8s must have at least a GPU, which is a NVIDIA A30 MIG with 6 GB VRAM in my case. But an API / Triton Inference Server may only use 1 -> 3 GB VRAM which causes redundant resources.
  • Therefore, I am considering to use KServe Model Mesh to solve my problem. I expected that I can map my Triton Inference Servers, each with an Inference Service in context of KServe. That means an Inference Service should contains multiple models and now Model Mesh's responsibility is to schedule my API to available Service Runtime.
    Problem
  • As far as I know, Triton Inference Server Runtime, which includes a runtime from NVIDIA and an adapter, expects the each Inference Service to have only one model.
  • That makes some of my logic on Triton such as Ensemble, BLS not able to run.

Describe your proposed solution
First of all, excuse me if this Issue is on the wrong project, I think it should be on the adapter project but I also want to know is there an alternative to Model Mesh to solve my problem. I am new to KServe.
My proposed solution is to make the Inference Service accept multiple models. The benefits of this approach are:

  • Easy to migrate from Triton to Kserve.
  • These multiple models are usually tightly coupled, therefore scheduling them on the same server should reduce overhead. In addition, implementing logic should be much more simpler.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant