-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can real-time transcription be achieved? #1653
Comments
I am the author of Caption Anything and Whisper Dictation which alternately transcribe from the record monitor device (what you hear), or from the microphone. They can connect to either The challenge was making timely recordings, buffering enough audio to provide cogent text generation while keeping recording times short enough to minimize delays. I had to implement a Queuing System: To maintain smooth streaming and prevent delays in transcribing long clips. Storing incoming audio clips in a queue and processes them one by one as resources become available. This ensures that the system continuously transcribes without overwhelming it with too many simultaneous tasks. AI assistants were helpful in pointing out methods of doing this. Caption Anything uses a flood of 2-second clips from a mix of the actively-playing desktop streams. While Whisper Dictation uses sound level detection to record longer clips from the microphone for better accuracy. Whisper falls behind on slow systems, and increasing the number of threads won't help. So there is no real-time transcription option everywhere. But speed can be improved by compiling with acceleration like cuBLAS or CLBlast, using the tiny models or using a client-server setup. Quantizing the models to 4-bit also cuts transcription time in half on my old laptop, so it can keep up. For speed, it is critical to get the loaded model to fit easily in available RAM. |
Did you look at the examples? There are possible problems with accuracy: This pull request has suggested someone else is at least working on the code |
Thanks for the links. Those examples should work well if you have a fairly-fast machine. The problems with accuracy are probably due to buffering. It was a big challenge I had to overcome. Once they get that sorted, I would be happy to use real-time inference in my code. But as for now, I will be doing the audio buffering myself. :) Whisper Dictation does "everything else" outside the scope of whisper, such as pasting the resulting text into the active window or terminal, communicating with stable-diffusion and chat servers, speaking the results out-loud, launching programs and running editing commands with voice control. It is important to get good accuracy, so we do everything we can to achieve the best-possible result. |
Yeah thats all useful. I did the automation a while ago for an internal app using native cpp, libcurl and imgui, but probably a helpful integration for the python community. I haven't looked at Langchain, autogpt, etc recently but would they benefit? Incidentally I've just done an update on some other PC components and shifted to Win 11 and real-time inference is now working perfectly, so fingers crossed it lasts. Will have to test on some other machines to see what makes the difference. |
Good ideas. By "buffering" I mean getting clean clips of audio, which depends a great deal on having the microphone input volume set appropriately. To improve accuracy, voice activity detection (VAD) waits for silence before sending the clip off to be transcribed, instead of cutting off in the middle. You may have fixed your mic volume level. The example uses -vth 0.6 for silence detection. But audio clipping from having mic volume set too high can severely reduce accuracy. As well as volume being too low, picking up only the loudest portions of speech. Incorrectly identifying parts of it as silence. Setting the volume can be tricky. And that too is outside of scope! But it should be possible for the software to detect clipping or volume being overall too-low, attempt to be smart about it and adjust the mic mix. Or if that fails, warn the user somehow via stderr. |
You know this is very important for my work, and I will carefully check it in my spare time. Thank you! |
Thank you so much!!! |
Hi,there!
Thanks for your hard work!
Due to work requirements, it is so necessary for me to transcribe speech into text in real time, and the precision required is like that of a whisker.
Could we use some simple methods to achieve real-time transcription?
Is there a corresponding code?
Or can other methods be used to achieve it?
Thank you!
The text was updated successfully, but these errors were encountered: