-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inferencing in real-time #847
Comments
Not yet, but this is something I have in mind for demo purposes. |
Perhaps could you have a look at |
@lissyx would there have to be any neural network architectural changes to support real-time inferencing, or would it just be an client/server engineering challenge |
It all depends on what you want to precisely achieve. "Real" streaming would require changes to the network for sure ; the bidirectional recurrent layers makes us forced to send "complete" data. |
@lissyx so with the current network, does sending chunks of audio data at a time make sense? From intuition i can see that it might inflate the WER rate by a little due to the language model not having the complete sequences of terms. |
The WER increase will not just be from the language model, but also the network itself, since it depends on having the entire utterance available. The performance will depend on your training data. If you only train with long utterances (several seconds) and then try to do inference with chunks of one second each then it'll probably perform very poorly. |
@reuben that makes sense. Thanks for the clarification. |
Does DeepSpeech (and its a feature of CTC, I suppose) require that the incoming features be fed at the word boundary? What if I construct an online moving window MFCC calculator and feed in the features without regard to the word boundary? Let us say that my window length is long enough to accomodate 5-6 grams; the first and last gram may be partial because the segmentation is done without regard to word boundary. Can such a design still infer words? |
We assume incoming features are fed at word boundaries. Performance is further improved if they are at sentence boundaries, due to the language model being trained on sentences. |
So what would be the recommended steps( in terms of training data, network topology) in order to build a speech recognition streaming sercice? |
1 similar comment
So what would be the recommended steps( in terms of training data, network topology) in order to build a speech recognition streaming sercice? |
I started to write a server to test inference on a generated model. It is available here: This is a very first implementation that listens to http post requests. I plan to add support of websockets, provide a sample web-app to try it from a browser, and package it in a docker container. @lissyx : as already discussed, tell me if you are interested in such a project. |
Update: I just published a Dockerfile to easily start the server. I tested it with deepspeech-0.1.0 and the pre-trained model published. |
I wrote django based web app that can record sound from browser and return its transcription. Its at its very first stage. I am planning it to make websocket based and google speech API based. So that, I don't have to change much in my other projects, apart from changing socket url. |
Can you please elaborate a bit more how we can s2t in real time using a
bi-directional RNN?
As far as I see this we need to wait to the end of the speech in order to
begin the decoding...maybe I misunderstand something will be happy to be
corrected
…On Sat, Jan 6, 2018 at 6:58 PM, Ashwani Pandey ***@***.***> wrote:
I wrote django based web app
<https://github.com/sci472bmt/django-deepspeech-server> that can record
sound from browser and return its transcription. Its at its very first
stage. I am planning it to make websocket based and google speech API
based. So that, I don't have to change much in my other projects, apart
from changing socket url.
I'll try to take it to real time transcription as soon as possible.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#847 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE_DSfnH40RToxOl1hu-v4VllVBA0jfmks5tH6Y8gaJpZM4PhQ1U>
.
|
@alanbekker I have 2 plans. I am currently doing research as to what will work best:
|
In respect to (1) you assume you can train your model on small
utterances..but in order to so you will need to align the small utterances
with the corresponding transcription (more labeled data is needed) , am I
wrong?
could you please explain how is (2) a another alternative for (1)?
Thanks!
…On Sun, Jan 7, 2018 at 7:29 PM, Ashwani Pandey ***@***.***> wrote:
@alanbekker <https://github.com/alanbekker> I have 2 plans. I am
currently doing research as to what will work best:
1. This is more like pseudo real time. I'll train my model with small
utterence, stream wav files to server. Each wav file will contain voice
from starting time t1. As speaker will continue speaking, wav file size
streamed to server will keep on increasing. And I think well trained
deepspeech server can return approximately correct transcriptions. At last
server will receive full sentence as audio, which it can transcribe. This
should work atleast for small audio(say, 5 sec). Moreover we can couple
this with silence detection and marking non changing transcriptions as
final to optimize performance.
2. Another idea is to use one pass decoding with RNNLM as mention in this
paper
<https://pdfs.semanticscholar.org/8ad4/4f5161ad04c71fe052582168bd7a45217d36.pdf>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#847 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE_DSbu73xvAiCf5bDNHFXFdNnCXXhWxks5tIP7ngaJpZM4PhQ1U>
.
|
You are correct that more labeled data is needed for plan 1. Thus I am looking into possibility of subtitles. |
@ashwan1 for 2 you could also try https://github.com/inikdom/rnn-speech |
Thanks a lot for suggestion :) |
@ashwan1 you may also check https://github.com/SeanNaren/deepspeech.pytorch |
@lissyx A follow-up question: how can we use this model to transcribe long video. 5 sec clip is too short... |
@jenniferzhu How much long ? Best way until #1275 is done would be to cut audio on silences. |
@lissyx is it possible for me to test #1275 on my local ?? I really want to build an app for which does s2t "real" time. Can you please also suggest the steps which i need to do to setup this branch to run ? Note: Thanks in advance |
@kausthub Just checkout the |
I couldn't find the steps to build.
Sorry if this is a beginner level question.
Thanks in advance
…On Fri, 27 Apr 2018, 15:42 lissyx, ***@***.***> wrote:
@kausthub <https://github.com/kausthub> Just checkout the
streaming-inference branch and build with it.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#847 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJO3_fuMOmWfEvxiFKnTC9ibJ1iwnQRAks5tsu8qgaJpZM4PhQ1U>
.
|
@kausthub It's in |
Thanks will check that out.
…On Fri, 27 Apr 2018, 15:51 lissyx, ***@***.***> wrote:
@kausthub <https://github.com/kausthub> It's in native_client/README.md:
https://github.com/mozilla/DeepSpeech/blob/streaming-inference/native_client/README.md
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#847 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AJO3_blSMRILzU8trMgmEHcasXIwS88Nks5tsvEWgaJpZM4PhQ1U>
.
|
Nothing anymore to do here. |
This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs. |
Is there a streaming server & client code that does the following?
(a) on the client side, continuously generates PCM data samples from the mic connected to PC, sends the samples every, say 100ms to the server and prints out the transcripts from the server as they arrive (I am thinking of a client similar to the Google Speech API - Streaming Python client example : https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/cloud-client/transcribe_streaming_mic.py )
(b) on the server side, responds to the client's samples by waiting for enough samples to build up, invokes deepspeech, sends the transcript back to the client and does this continuously as well
thanks,
Buvana
The text was updated successfully, but these errors were encountered: