speed up whisper? #716
Replies: 23 comments 39 replies
-
What about deepspeed-mii? |
Beta Was this translation helpful? Give feedback.
-
we are also testing this idea as well to speed up whisper: https://github.com/zhuzilin/whisper-openvino |
Beta Was this translation helpful? Give feedback.
-
the most substantial way to do this is batch inference (would lead to 10x speedup), currently its batch_size 1. It's a bit tricky with the decoding fallback stuff but its possible |
Beta Was this translation helpful? Give feedback.
-
if i find or as i find possible ways to speed up whisper STT i will post them here as well. |
Beta Was this translation helpful? Give feedback.
-
The naive way I maximise resource utilisation is by taking the |
Beta Was this translation helpful? Give feedback.
-
Shameless self-plug (CPU only) For CPU inference, model quantization is a very easy to apply method with great average speedups which is already built-in to PyTorch.
Based on anecdotal reports from users, the accuracy (in this case the word-error-rate, WER) for all of the models remained the same, if not actually slightly improved, which also matches evidence from the PyTorch tutorial to quantize the BERT model which showed a very slight improvement in performance. There is a roughly 3.2x increase in speed for the largest model (however, it seems for the largest model using this method, there is a slight decrease in performance). I would highly recommend trying this method for CPU deployment as Dynamic Quantization requires literally adding one line of code to a pre-trained model and gives you a large speedup in performance with no or minimal accuracy degradation. I have also tried it on LSTM based models (which it shrinks the most in my experience) and found anywhere from a 3x to 4x inference speed improvement. However, it also reduces the mean and variance of the latency for each inference so if response time is important, it is also good for this as well. |
Beta Was this translation helpful? Give feedback.
-
question: does anyonme know if whisper can support multiple GPUs? like whisper -numgpus 8 -model large test.wav |
Beta Was this translation helpful? Give feedback.
-
@silvacarl2 Is there any new progress? |
Beta Was this translation helpful? Give feedback.
-
the best least expensive option is AWS EC2 g4dn.xlarge we have ours set up to auto-scale from 1 to 10 instances and that is transcribing 450,000 calls a month. I also htink whisperx actually may be more accurate because it uses forced alignment of audio withh transcribed words. we have not tried wisper-cpp in production yet. |
Beta Was this translation helpful? Give feedback.
-
have you tested |
Beta Was this translation helpful? Give feedback.
-
LOL! thats a riot, first time i have seen it! AWS EC2 g4dn.xlarge is about 50 cents an hour and for 450,000 transcriptions a month it is cheaper and more accurate than any other single STT method in the market, including all commerical tools or APIs. and this is just plain vanilla whisper in production LOL! imagine is we can make it even fatser! |
Beta Was this translation helpful? Give feedback.
-
oh yeah i forgot this after sudo upgrade: sudo apt update && sudo apt upgrade -y |
Beta Was this translation helpful? Give feedback.
-
Hello @silvacarl2 ! |
Beta Was this translation helpful? Give feedback.
-
Huggingface released a Jax/Flax version of Whisper, with inference parallelization but no input_prompt for now, it seems. Here : https://github.com/sanchit-gandhi/whisper-jax |
Beta Was this translation helpful? Give feedback.
-
I followed the same install steps on an Ubuntu 20.04 VM hosted on ESXi with a Tesla P100: (1:33 min audio) faster_whisper (can only use float32) - had to install the Nvidia CUDNN libraries for this to work. Definitely going to try this on an AWS VM. |
Beta Was this translation helpful? Give feedback.
-
yup, this is the cheapest/most useful: AWS g4dn.xlarge - using Nvidia Tesla T4 have you seen any way to run whisper using int8 or peft/lora? |
Beta Was this translation helpful? Give feedback.
-
Hi @silvacarl2 ..Have you tried Pytorch 2 ? |
Beta Was this translation helpful? Give feedback.
-
have not tried Pytorch 2. we are running 10 AWS g4dn.xlarge EC2s that we spin on and off depending on call volumes. it works great, handlign 450,000 calls a month. This is unoptimized because based on the pricing from AWS fro these instances at 50 cents an hour and the fact we can turn them on and off whenever we want, its sufficiently cost effective for a large volume. it will probably get modified once we finish fine tuning on a model to extract real time info from the call transcriptions. |
Beta Was this translation helpful? Give feedback.
-
what are you running it on? |
Beta Was this translation helpful? Give feedback.
-
this should work fine wih no issues using fatser-whisper. also a T4 should be able ti support running the full model. |
Beta Was this translation helpful? Give feedback.
-
@silvacarl2 How are you doing the autoscaling of your g4dn.xlarge EC2s? Are you using karpenter or are you just using a regular autoscaler that is hooked into prometheus? ...or some other mechanism to do the autoscaling? |
Beta Was this translation helpful? Give feedback.
-
we jsut wrote our own tiny tool that turns EC2s on and off as needed. g4dn's are great and cheap. we transcribe 500,000 phone calls a month with this, which takes between 1 and 5 g4dns, 1 at night, and up to 5 during the day. 8-) |
Beta Was this translation helpful? Give feedback.
-
correct. But we have a little front end cheap EC2 that directs API calls to the serer that is not busy. That server is even cheaper. and if it sees that it does not have enough servers, it turns them on and off by itself. the servers all have static IPs and are already preconfigured. |
Beta Was this translation helpful? Give feedback.
-
what are the best ways to squeeze as much performance out of whisper?
we have been testingg it on various size GPU/CPUs but do not see an big difference from one to the other.
Any ideas or pointers would be awesome!
carl
Beta Was this translation helpful? Give feedback.
All reactions