Skip to content

Is it possible to get the final results faster in the real-time Speech-to Text? #785

Discussion options

You must be logged in to vote

TL;DR This is in some sense indeed just how Deepgram works.

Final results are streamed on average once every 3-5 seconds, except when a speaker stops speaking, we may return a result in that case sooner than 3 seconds. The accuracy of final results goes up when a reasonable amount of audio context is given to the model (hence the 3+ second chunk durations). It's possible to do inferencing on much smaller chunks, but the accuracy would go down (this is what interim results are - just predictions made on smaller chunks). Being able to produce smaller and smaller chunks while maintaining accuracy is a solid research goal, and something we certainly think about, but there's nothing to report …

Replies: 2 comments

Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Answer selected by jpvajda
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
3 participants