Is it possible to get the final results faster in the real-time Speech-to Text? #785

MiguelHerreroIfemaLab · 2024-05-28T13:02:48Z

MiguelHerreroIfemaLab
May 28, 2024

I'm working on a subtitles system that overlays real-time transcription on my PC. I'm using QWidget and a QLabel to display the text. My problem is that I want to avoid corrections in the transcription. I've tried using only the interim results, but I get many misspellings without the final results' corrections. Then I tried using only the final results, and it works well, but I'm hoping to reduce the delay. Is there any way to get the final results faster even if i get a higher % or error, or is this just how Deepgram works?

Answered by nikolawhallon

May 28, 2024

TL;DR This is in some sense indeed just how Deepgram works.

Final results are streamed on average once every 3-5 seconds, except when a speaker stops speaking, we may return a result in that case sooner than 3 seconds. The accuracy of final results goes up when a reasonable amount of audio context is given to the model (hence the 3+ second chunk durations). It's possible to do inferencing on much smaller chunks, but the accuracy would go down (this is what interim results are - just predictions made on smaller chunks). Being able to produce smaller and smaller chunks while maintaining accuracy is a solid research goal, and something we certainly think about, but there's nothing to report …

View full answer

team-deepgram · 2024-05-28T13:02:58Z

team-deepgram
May 28, 2024
Maintainer

Thanks for asking your question about Deepgram! If you didn't already include it in your post, please be sure to add as much detail as possible so we can assist you efficiently, such as:

The request_id if you have a question about your requests or transcription responses.
The features you used or the full api.deepgram.com URL you sent your request to, including parameters.
Any code snippets you can share.

0 replies

nikolawhallon · 2024-05-28T15:14:03Z

nikolawhallon
May 28, 2024
Collaborator

TL;DR This is in some sense indeed just how Deepgram works.

Final results are streamed on average once every 3-5 seconds, except when a speaker stops speaking, we may return a result in that case sooner than 3 seconds. The accuracy of final results goes up when a reasonable amount of audio context is given to the model (hence the 3+ second chunk durations). It's possible to do inferencing on much smaller chunks, but the accuracy would go down (this is what interim results are - just predictions made on smaller chunks). Being able to produce smaller and smaller chunks while maintaining accuracy is a solid research goal, and something we certainly think about, but there's nothing to report on this right now.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deepgram

Is it possible to get the final results faster in the real-time Speech-to Text? #785

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Deepgram

Is it possible to get the final results faster in the real-time Speech-to Text? #785

MiguelHerreroIfemaLab May 28, 2024

Replies: 2 comments

team-deepgram May 28, 2024 Maintainer

nikolawhallon May 28, 2024 Collaborator

MiguelHerreroIfemaLab
May 28, 2024

team-deepgram
May 28, 2024
Maintainer

nikolawhallon
May 28, 2024
Collaborator