Is it possible to get the final results faster in the real-time Speech-to Text? #785
-
I'm working on a subtitles system that overlays real-time transcription on my PC. I'm using QWidget and a QLabel to display the text. My problem is that I want to avoid corrections in the transcription. I've tried using only the interim results, but I get many misspellings without the final results' corrections. Then I tried using only the final results, and it works well, but I'm hoping to reduce the delay. Is there any way to get the final results faster even if i get a higher % or error, or is this just how Deepgram works? |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Thanks for asking your question about Deepgram! If you didn't already include it in your post, please be sure to add as much detail as possible so we can assist you efficiently, such as:
|
Beta Was this translation helpful? Give feedback.
-
TL;DR This is in some sense indeed just how Deepgram works. Final results are streamed on average once every 3-5 seconds, except when a speaker stops speaking, we may return a result in that case sooner than 3 seconds. The accuracy of final results goes up when a reasonable amount of audio context is given to the model (hence the 3+ second chunk durations). It's possible to do inferencing on much smaller chunks, but the accuracy would go down (this is what interim results are - just predictions made on smaller chunks). Being able to produce smaller and smaller chunks while maintaining accuracy is a solid research goal, and something we certainly think about, but there's nothing to report on this right now. |
Beta Was this translation helpful? Give feedback.
TL;DR This is in some sense indeed just how Deepgram works.
Final results are streamed on average once every 3-5 seconds, except when a speaker stops speaking, we may return a result in that case sooner than 3 seconds. The accuracy of final results goes up when a reasonable amount of audio context is given to the model (hence the 3+ second chunk durations). It's possible to do inferencing on much smaller chunks, but the accuracy would go down (this is what interim results are - just predictions made on smaller chunks). Being able to produce smaller and smaller chunks while maintaining accuracy is a solid research goal, and something we certainly think about, but there's nothing to report …