speed up whisper? #716

silvacarl2 · 2022-12-18T22:52:25Z

silvacarl2
Dec 18, 2022

what are the best ways to squeeze as much performance out of whisper?

we have been testingg it on various size GPU/CPUs but do not see an big difference from one to the other.

Any ideas or pointers would be awesome!

carl

krylm · 2022-12-19T12:56:45Z

krylm
Dec 19, 2022

What about deepspeed-mii?
If helps on transformers, maybe it can help with whisper too.
https://github.com/microsoft/DeepSpeed-MII

2 replies

silvacarl2 Dec 19, 2022
Author

that will be awesome if it works, checking it out.

also, looking at #208

we have tried various size GPUs and RAM, at a certain point there does not seem to be any improvement, a smaller GPU seems to work as well as a larger one.

dgoryeo Dec 19, 2022

Same here @silvacarl2. It appears that a T4 GPU is optimal with the standard CLI/model calls. To me anything above T4 has not been effective. I have not tried changing the beam size, nor playing with torch parallel runs.

silvacarl2 · 2022-12-19T21:10:07Z

silvacarl2
Dec 19, 2022
Author

we are also testing this idea as well to speed up whisper: https://github.com/zhuzilin/whisper-openvino

2 replies

usergit Dec 22, 2022

how are the results with openvino on gpu?

silvacarl2 Dec 23, 2022
Author

need to benchmark that as well as whisper.cpp

m-bain · 2022-12-20T00:35:58Z

m-bain
Dec 20, 2022

the most substantial way to do this is batch inference (would lead to 10x speedup), currently its batch_size 1. It's a bit tricky with the decoding fallback stuff but its possible

1 reply

silvacarl2 Dec 20, 2022
Author

never heard of that. will see if we can figure it out

silvacarl2 · 2022-12-20T01:16:50Z

silvacarl2
Dec 20, 2022
Author

if i find or as i find possible ways to speed up whisper STT i will post them here as well.

0 replies

patrickorlando · 2022-12-23T00:57:43Z

patrickorlando
Dec 23, 2022

The naive way I maximise resource utilisation is by taking the medium model and running two instances per gpu (Tesla T4).
Sub-processes pull file paths from a process-safe multiprocessing.Queue. Generally I configure a vm with 4 GPUs, with 8 instances of the model instantiated.
Batching will take reasonably more work, particularly if your audio files vary significantly in length.

3 replies

silvacarl2 Dec 23, 2022
Author

Batching will take reasonably more work, particularly if your audio files vary significantly in length.

we are going to run some benchmarks on this and thanks for those tips.

patrickorlando Dec 23, 2022

will be keen to hear about the results!

Blair-Johnson Jan 4, 2023

This thread discusses batch processing support. You can typically see a 3-5x speedup when running clips of similar lengths together.

MiscellaneousStuff · 2022-12-24T04:24:07Z

MiscellaneousStuff
Dec 24, 2022

Shameless self-plug (CPU only)

For CPU inference, model quantization is a very easy to apply method with great average speedups which is already built-in to PyTorch.
For example, I applied dynamic quantization to the OpenAI Whisper model (speech recognition) across a range of model sizes (ranging from tiny which had 39M params to large which had 1.5B params).
All of the benchmarks below are for transcribing 30 seconds of audio. Refer to the below table for performance increases:

Whisper Model (params)	Pre-Quant (secs)	Post-Quant (secs)	Speedup
tiny (39 M)	2.3	3.1	0.74x slowdown
base (74 M)	5.2	3.2	1.62x speedup
small (244 M)	19.1	6.9	2.76x speedup
medium (769 M)	60.7	23.1	2.62x speedup

Based on anecdotal reports from users, the accuracy (in this case the word-error-rate, WER) for all of the models remained the same, if not actually slightly improved, which also matches evidence from the PyTorch tutorial to quantize the BERT model which showed a very slight improvement in performance. There is a roughly 3.2x increase in speed for the largest model (however, it seems for the largest model using this method, there is a slight decrease in performance).

I would highly recommend trying this method for CPU deployment as Dynamic Quantization requires literally adding one line of code to a pre-trained model and gives you a large speedup in performance with no or minimal accuracy degradation. I have also tried it on LSTM based models (which it shrinks the most in my experience) and found anywhere from a 3x to 4x inference speed improvement. However, it also reduces the mean and variance of the latency for each inference so if response time is important, it is also good for this as well.
To find out more, refer to the official PyTorch page.

1 reply

tonytorm Feb 2, 2023

this is actually amazing, was wondering if there's any chance of doing that with whispercpp?!

silvacarl2 · 2023-01-18T22:42:11Z

silvacarl2
Jan 18, 2023
Author

question: does anyonme know if whisper can support multiple GPUs? like

whisper -numgpus 8 -model large test.wav

0 replies

magicleo · 2023-04-12T08:44:10Z

magicleo
Apr 12, 2023

@silvacarl2 Is there any new progress?

0 replies

silvacarl2 · 2023-04-12T14:38:47Z

silvacarl2
Apr 12, 2023
Author

the best least expensive option is AWS EC2 g4dn.xlarge

we have ours set up to auto-scale from 1 to 10 instances and that is transcribing 450,000 calls a month.

I also htink whisperx actually may be more accurate because it uses forced alignment of audio withh transcribed words.

we have not tried wisper-cpp in production yet.

0 replies

phineas-pta · 2023-04-12T15:10:39Z

phineas-pta
Apr 12, 2023

have you tested faster-whisper? see https://github.com/guillaumekln/faster-whisper

5 replies

landemou Apr 14, 2023

Thanks to this wonderful project, I can run the large-v2 model on a graphics card that only has 6 GB of VRAM and at crazy speed. 45 mins in 10 mins!

elabbarw May 31, 2023

Definitely noticed an improvement!

audioscavenger Oct 9, 2023

mine is using only like 6% GPU and 20% CPU at the same time.

yaronelh Feb 28, 2024

I would love to install it, but it seems to require cuBLAS which is not available on Windows.
unless there is another way?

phineas-pta Mar 1, 2024

cublas is included in cuda

silvacarl2 · 2023-04-12T15:27:14Z

silvacarl2
Apr 12, 2023
Author

LOL! thats a riot, first time i have seen it!

AWS EC2 g4dn.xlarge is about 50 cents an hour and for 450,000 transcriptions a month it is cheaper and more accurate than any other single STT method in the market, including all commerical tools or APIs.

and this is just plain vanilla whisper in production LOL!

imagine is we can make it even fatser!

9 replies

ururk Apr 12, 2023

FYI - I have to install gcc and make:

sudo apt install -y gcc make

silvacarl2 Apr 12, 2023
Author

the swap i am sure makes a difference. above install sequence was determeined by doing TONS of testing over MONTHS, LOL

phineas-pta Apr 12, 2023

i'd love to hear about faster-whisper performance in deployment (i'm not affiliated with its author)

audioscavenger Oct 9, 2023

it's called Whisper-Faster and is not faster from my POV

silvacarl2 Oct 9, 2023
Author

what are you running it on?

silvacarl2 · 2023-04-12T16:45:30Z

silvacarl2
Apr 12, 2023
Author

oh yeah i forgot this after sudo upgrade:

sudo apt update && sudo apt upgrade -y
sudo apt-get -y install build-essential htop && sudo apt autoremove -y

4 replies

silvacarl2 Apr 12, 2023
Author

my install installs a few extra things you may or may not need to doing GPT inferencing as well.

ururk Apr 12, 2023

I wondered about some of that stuff - but this at least gets me the performance I was looking for, and can pare down to the bare minimum. Ultimately, I will have it drop a JSON file in an S3 bucket and my usual lambda workflow can take it from there.

silvacarl2 Apr 12, 2023
Author

exactly. also, i noticed if you go to g4dn.2xlarge or even larger, there is not much of a performance difference because it is the exact same GPU.

g4dn.xlarge was the best/cheapest/most reliable we found. and we tested EVERYTHING for several months.

ururk Apr 23, 2023

I've implemented this and have a workflow going, but running into a weird performance issue. It takes almost 6 minutes for it to start returning transcription results (using the medium.en model at the moment). I ended up simplifying your buildout, and install cuda like so:

apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/3bf863cc.pub
apt-key adv --fetch-keys http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64/7fa2af80.pub
bash -c 'echo "deb http://developer.download.nvidia.com/compute/cuda/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda.list'
bash -c 'echo "deb http://developer.download.nvidia.com/compute/machine-learning/repos/ubuntu1804/x86_64 /" > /etc/apt/sources.list.d/cuda_learn.list'
apt update

logmessage "Install CUDA 11.7"

apt install -y cuda-11-7

Not sure how to debug this - system resources seem fine and when it's running it's reasonably fast... not as fast as I'd expect based on my local testing rig.

juliensalinas · 2023-04-18T05:15:55Z

juliensalinas
Apr 18, 2023

Hello @silvacarl2 !
There is this Whisper implementation based on CTranslate that looks very interesting: https://github.com/OpenNMT/CTranslate2/
I haven't tried it myself yet but it is supposed to improve speed.

0 replies

remic33 · 2023-04-23T11:53:57Z

remic33
Apr 23, 2023

Huggingface released a Jax/Flax version of Whisper, with inference parallelization but no input_prompt for now, it seems. Here : https://github.com/sanchit-gandhi/whisper-jax

0 replies

elabbarw · 2023-05-31T12:38:52Z

elabbarw
May 31, 2023

I followed the same install steps on an Ubuntu 20.04 VM hosted on ESXi with a Tesla P100:

(1:33 min audio)
whisper
real 1m17.427s
user 1m12.536s
sys 0m8.191s

faster_whisper (can only use float32) - had to install the Nvidia CUDNN libraries for this to work.
real 0m27.272s
user 0m22.970s
sys 0m6.458s

Definitely going to try this on an AWS VM.

6 replies

remic33 May 31, 2023

hey @elabbarw, Did you pre-download the weights? Did you use Large-v2?

elabbarw May 31, 2023

I used large-v2. The smaller models were not accurate with my audio files.

elabbarw May 31, 2023

The full 18 minute audio file took approximately 5 mins to transcribe using large-v2, faster-whisper float32 on p100.

elabbarw May 31, 2023

On the AWS VM using large-v2, faster-whisper int8_float16 the 18 minute audio file took 2 minutes to transcribe.

AfsalAfzz-Pro Oct 18, 2024

hi, can youi please tell me how i can run this on aws?

silvacarl2 · 2023-05-31T16:27:41Z

silvacarl2
May 31, 2023
Author

yup, this is the cheapest/most useful: AWS g4dn.xlarge - using Nvidia Tesla T4

have you seen any way to run whisper using int8 or peft/lora?

2 replies

elabbarw May 31, 2023

I was looking at my faster-whisper script and realised I kept the float32 setting from my P100!

Here are the results with 01:33mins using faster-whisper on g4dn.xlarge:
int8
real 0m24.058s
user 0m26.159s
sys 0m7.123s

int8_float16
real 0m21.841s
user 0m24.286s
sys 0m6.655s

float 32
real 0m33.701s
user 0m26.932s
sys 0m8.928s

elabbarw May 31, 2023

18 minutes took approximately 2 minutes to transcribe using int8_float16

IbrahimMohammed47 · 2023-06-04T16:08:44Z

IbrahimMohammed47
Jun 4, 2023

Hi @silvacarl2 ..Have you tried Pytorch 2 ?
I would love to hear more about the status of you setup? have you find a way to improve inference speed?

0 replies

silvacarl2 · 2023-06-06T19:29:46Z

silvacarl2
Jun 6, 2023
Author

have not tried Pytorch 2.

we are running 10 AWS g4dn.xlarge EC2s that we spin on and off depending on call volumes.

it works great, handlign 450,000 calls a month.

This is unoptimized because based on the pricing from AWS fro these instances at 50 cents an hour and the fact we can turn them on and off whenever we want, its sufficiently cost effective for a large volume.

it will probably get modified once we finish fine tuning on a model to extract real time info from the call transcriptions.

2 replies

elabbarw Jun 24, 2023

Hi @silvacarl2 ,

How are you managing diarization?

tijsg Aug 2, 2023

you can use this https://huggingface.co/pyannote/speaker-diarization

silvacarl2 · 2023-10-09T19:52:51Z

silvacarl2
Oct 9, 2023
Author

what are you running it on?

1 reply

Jain-Archit Jan 8, 2024

@silvacarl2 @elabbarw I have a similar problem where in I need to run the whisper large-v3 model for approx 100k mins of Audio per day (batch processing). I am able to run the whisper model on 5x-7x of real time, so 100k min takes me ~20k mins of compute time. Running on a single Tesla T4, compute time in a day is around 1.5k mins. Looking to optimize the inference for this on minimum gpu(s) possible (Cost of taking ~12gpus 24x7 on cloud... not feasible).

Any suggestions on what stack to use? I was looking at Nvidia Triton Inference Server, however, couldn't find any standard backend to run whisper on.

silvacarl2 · 2024-01-08T16:09:29Z

silvacarl2
Jan 8, 2024
Author

this should work fine wih no issues using fatser-whisper.

also a T4 should be able ti support running the full model.

0 replies

friarpat · 2024-02-29T16:10:54Z

friarpat
Feb 29, 2024

@silvacarl2 How are you doing the autoscaling of your g4dn.xlarge EC2s? Are you using karpenter or are you just using a regular autoscaler that is hooked into prometheus? ...or some other mechanism to do the autoscaling?

0 replies

silvacarl2 · 2024-02-29T16:19:25Z

silvacarl2
Feb 29, 2024
Author

we jsut wrote our own tiny tool that turns EC2s on and off as needed. g4dn's are great and cheap. we transcribe 500,000 phone calls a month with this, which takes between 1 and 5 g4dns, 1 at night, and up to 5 during the day.

8-)

1 reply

friarpat Feb 29, 2024

At that price and throughput, you can afford to have a few instances "laying around" during the day. You don't need to spin up karpenter or wire up the auto-scalers. Thanks!

silvacarl2 · 2024-02-29T21:26:16Z

silvacarl2
Feb 29, 2024
Author

correct. But we have a little front end cheap EC2 that directs API calls to the serer that is not busy. That server is even cheaper.

and if it sees that it does not have enough servers, it turns them on and off by itself.

the servers all have static IPs and are already preconfigured.

0 replies

speed up whisper? #716

Replies: 23 comments · 39 replies

silvacarl2 Dec 19, 2022 Author

silvacarl2 Dec 19, 2022 Author

silvacarl2 Dec 23, 2022 Author

silvacarl2 Dec 20, 2022 Author

silvacarl2 Dec 20, 2022 Author

silvacarl2 Dec 23, 2022 Author

silvacarl2 Jan 18, 2023 Author

silvacarl2 Apr 12, 2023 Author

silvacarl2 Apr 12, 2023 Author

silvacarl2 Apr 12, 2023 Author

silvacarl2 Oct 9, 2023 Author

silvacarl2 Apr 12, 2023 Author

silvacarl2 Apr 12, 2023 Author

silvacarl2 Apr 12, 2023 Author

silvacarl2 May 31, 2023 Author

silvacarl2 Jun 6, 2023 Author

silvacarl2 Oct 9, 2023 Author

silvacarl2 Jan 8, 2024 Author

silvacarl2 Feb 29, 2024 Author

silvacarl2 Feb 29, 2024 Author

Replies: 23 comments 39 replies

silvacarl2 Dec 19, 2022
Author

silvacarl2
Dec 19, 2022
Author

silvacarl2 Dec 23, 2022
Author

silvacarl2 Dec 20, 2022
Author

silvacarl2
Dec 20, 2022
Author

silvacarl2 Dec 23, 2022
Author

silvacarl2
Jan 18, 2023
Author

silvacarl2
Apr 12, 2023
Author

silvacarl2
Apr 12, 2023
Author

silvacarl2 Apr 12, 2023
Author

silvacarl2 Oct 9, 2023
Author

silvacarl2
Apr 12, 2023
Author

silvacarl2 Apr 12, 2023
Author

silvacarl2 Apr 12, 2023
Author

silvacarl2
May 31, 2023
Author

silvacarl2
Jun 6, 2023
Author

silvacarl2
Oct 9, 2023
Author

silvacarl2
Jan 8, 2024
Author

silvacarl2
Feb 29, 2024
Author

silvacarl2
Feb 29, 2024
Author