Serve Whisper on Nvidia Triton Inference Server #1505

nlin5 · 2023-07-07T07:38:08Z

nlin5
Jul 7, 2023

Has anyone been able to serve the whisper model on Nvidia's Triton Inference Server?

MarktHart · 2023-09-11T08:45:44Z

MarktHart
Sep 11, 2023

Yes and I think a public version is under way.

If I don't send an update in 2 weeks, feel free to remind me

6 replies

Goddard Sep 27, 2023

would be cool

MarktHart Sep 28, 2023

@DizzyProtos the decoder needs the encoder output, and decoder input. At the beginning the decoder input is a few decoder ids (like start token and output language token), later on you append it with the decoder output. A practical version also has previous decoder KV states as decoder input.

haiderasad Oct 24, 2023

@MarktHart any chance we can get the codebase of your implementation?

tomukmatthews Nov 30, 2023

I've managed to deploy faster-whisper with BentoML but i'd love to get this working on Triton Inference Server for the best performance. I've been trying to use the python backend as faster-whisper is pretty custom but have been running into some issues, and am dubious that i would see too much of a boost if i'm just using the python backend.

Would you consider sharing your implementation @MarktHart?

Jain-Archit Jan 8, 2024

Is anyone able to deploy it on Triton Inference Server?
Also, @tomukmatthews Can you share your implementation using BentoML? Any reference for me to get this working on GCP

IbrahimAmin1 · 2024-01-08T12:09:36Z

IbrahimAmin1
Jan 8, 2024

You can follow this tutorial from the k2 team.

5 replies

2Maze Feb 20, 2024

For some reason the implementation is very strange, I’m trying to get a response with time segments indicating <|0.00|> in the prompt, but the model responds without them.

IbrahimAmin1 Feb 20, 2024

Can you paste your client side code and output?

2Maze Feb 20, 2024

python3 client.py --server-addr localhost --model-name whisper --num-tasks 2 --whisper-prompt "<|startoftranscript|><|ru|><|transcribe|><|0.00|>" --audio temp/03_1.wav

task-0: 0/1
Есть в каждом из нас что-то такое, о чем мы даже не подозреваем. То существование, которое мы будем отрицать до тех самых пор, пока не будет слишком поздно. Это что-то потеряет для нас всякий смысл. Именно это заставляет нас подниматься по утрам с постели, терпеть, когда нас донимает занудный босс, терпеть кровь, пот и слезы.
(0, ['foo'], ['Есть', 'в', 'каждом', 'из', 'нас', 'что-то', 'такое,', 'о', 'чем', 'мы', 'даже', 'не', 'подозреваем.', 'То', 'существование,', 'которое', 'мы', 'будем', 'отрицать', 'до', 'тех', 'самых', 'пор,', 'пока', 'не', 'будет', 'слишком', 'поздно.', 'Это', 'что-то', 'потеряет', 'для', 'нас', 'всякий', 'смысл.', 'Именно', 'это', 'заставляет', 'нас', 'подниматься', 'по', 'утрам', 'с', 'постели,', 'терпеть,', 'когда', 'нас', 'донимает', 'занудный', 'босс,', 'терпеть', 'кровь,', 'пот', 'и', 'слезы.'])
RTF: 0.0772
total_duration: 29.000 seconds
(0.01 hours)
processing time: 2.239 seconds (0.00 hours)

%WER = 5500.00

Errors: 54 insertions, 0 deletions, 1 substitutions, over 1 reference words (0 correct)

2Maze Feb 20, 2024

Here is client code: https://github.com/yuekaizhang/Triton-ASR-Client/blob/main/client.py

2Maze Feb 21, 2024

It seems the problem is that I enter <|0.00|> in prompt, although the speech starts a little later. When I set <|1.00|>, the model produced several time segments, from this I conclude that everything is correct on the server and client sides. Error in prompt.

yuekaizhang · 2024-04-19T08:06:52Z

yuekaizhang
Apr 19, 2024

Hi guys, you may check this updated implementation: using tensorrt-llm whisper + nvidia triton python backend, https://github.com/k2-fsa/sherpa/tree/master/triton/whisper. It speeds up 7x comparing with previous onnx version implementation pointed by @IbrahimAmin1
The native tensorrt-llm triton backend support for whisper (which could enable infligh batching) is on the way.
@2Maze @Goddard @DizzyProtos @haiderasad @tomukmatthews @Jain-Archit @nlin5 @MarktHart

4 replies

tomukmatthews Apr 19, 2024

Amazing, thanks for sharing @yuekaizhang!
I've just deployed the small model version with faster-whisper.
Do you have any intuition on whether you would expect this to be faster than faster-whisper for that model size? E.g. on an A10?

yuekaizhang Apr 19, 2024

It should be faster than faster-whisper or other open source solutions: see https://github.com/shashikg/WhisperS2T?tab=readme-ov-file#benchmark-and-technical-report @tomukmatthews

billlyzhaoyh Apr 23, 2024

What is the caveat for not using the tensorrt-llm backend to start with? is it hard to fit whisper into that format?

Isuxiz Sep 26, 2024

Hi guys, you may check this updated implementation: using tensorrt-llm whisper + nvidia triton python backend, https://github.com/k2-fsa/sherpa/tree/master/triton/whisper. It speeds up 7x comparing with previous onnx version implementation pointed by @IbrahimAmin1 The native tensorrt-llm triton backend support for whisper (which could enable infligh batching) is on the way. @2Maze @Goddard @DizzyProtos @haiderasad @tomukmatthews @Jain-Archit @nlin5 @MarktHart

@yuekaizhang Great work! The only drawback is that I want whisper in triton to output word-level timestamps. I tried deleting "<|notimestamps|>" in the prompt, but it didn't work. How can I do this? I would be very grateful for your help!

Isuxiz · 2024-09-26T04:15:18Z

Isuxiz
Sep 26, 2024

Great idea. Subscribe this thread.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Serve Whisper on Nvidia Triton Inference Server #1505

{{title}}

Replies: 4 comments 15 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Serve Whisper on Nvidia Triton Inference Server #1505

Replies: 4 comments · 15 replies

Replies: 4 comments 15 replies