Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support quantized models to save memory #14

Closed
osma opened this issue Dec 16, 2023 · 1 comment
Closed

Support quantized models to save memory #14

osma opened this issue Dec 16, 2023 · 1 comment

Comments

@osma
Copy link

osma commented Dec 16, 2023

First, thanks for creating a fantastic project! I was looking for a way to run Whisper or some other speech-to-text model in realtime. I found several potential solutions but this one is clearly the best, especially for implementing custom applications on top.

I noticed that faster-whisper supports quantized models but RealtimeSTT currently doesn't expose that option. With int8 quantization, models take up much less VRAM (or RAM, if run on CPU only). The quality of model output may suffer a little bit, but I think it's still a worthwhile optimization when memory is tight.

I have a laptop with an integrated NVIDIA GeForce MX150 GPU that only has 2GB VRAM. I was able to run the small model without problems (with tiny as the realtime model), but the medium and larger models gave a CUDA out of memory error.

To enable quantization, I tweaked the initialization of WhisperModel here

self.realtime_model_type = faster_whisper.WhisperModel(
model_size_or_path=self.realtime_model_type,
device='cuda' if torch.cuda.is_available() else 'cpu'
)
and here
model = faster_whisper.WhisperModel(
model_size_or_path=model_path,
device='cuda' if torch.cuda.is_available() else 'cpu'
)

by adding the parameter compute_type='int8'. This resulted in quantized models and the medium model can now fit on my feeble GPU; sadly, the large-v2 model is still too big.

GPU VRAM requirements as reported by nvidia-smi with and without quantization of the main model (realtime model is always tiny with the same quantization applied as for the main model):

model default int8
tiny 542MiB 246MiB
base 914MiB 278MiB
small 1386MiB 532MiB
medium out of memory 980MiB
large-v2 out of memory out of memory

This could be exposed as an additional parameter compute_type for AudioToTextRecorder; or possibly two separate parameters, one for the realtime model and another for the main model. This parameter would then simply be passed as compute_type to the WhisperModel(s).

@KoljaB
Copy link
Owner

KoljaB commented Dec 17, 2023

This is a good idea, thank you very much for this great input. Will integrate it in the next release.

KoljaB added a commit that referenced this issue Jan 29, 2024
KoljaB added a commit that referenced this issue Jan 29, 2024
@KoljaB KoljaB closed this as completed Nov 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants