Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inferencing in real-time #847

Closed
abuvaneswari opened this issue Sep 22, 2017 · 30 comments
Closed

Inferencing in real-time #847

abuvaneswari opened this issue Sep 22, 2017 · 30 comments

Comments

@abuvaneswari
Copy link

Is there a streaming server & client code that does the following?

(a) on the client side, continuously generates PCM data samples from the mic connected to PC, sends the samples every, say 100ms to the server and prints out the transcripts from the server as they arrive (I am thinking of a client similar to the Google Speech API - Streaming Python client example : https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/cloud-client/transcribe_streaming_mic.py )

(b) on the server side, responds to the client's samples by waiting for enough samples to build up, invokes deepspeech, sends the transcript back to the client and does this continuously as well

thanks,
Buvana

@lissyx
Copy link
Collaborator

lissyx commented Sep 22, 2017

Not yet, but this is something I have in mind for demo purposes.

@elpimous
Copy link

elpimous commented Sep 22, 2017

Perhaps could you have a look at
Respeaker on github, for recording part
They work on specific mic array, and on different recording types

@LearnedVector
Copy link

@lissyx would there have to be any neural network architectural changes to support real-time inferencing, or would it just be an client/server engineering challenge

@lissyx
Copy link
Collaborator

lissyx commented Sep 25, 2017

It all depends on what you want to precisely achieve. "Real" streaming would require changes to the network for sure ; the bidirectional recurrent layers makes us forced to send "complete" data.

@LearnedVector
Copy link

@lissyx so with the current network, does sending chunks of audio data at a time make sense? From intuition i can see that it might inflate the WER rate by a little due to the language model not having the complete sequences of terms.

@reuben
Copy link
Contributor

reuben commented Sep 26, 2017

The WER increase will not just be from the language model, but also the network itself, since it depends on having the entire utterance available. The performance will depend on your training data. If you only train with long utterances (several seconds) and then try to do inference with chunks of one second each then it'll probably perform very poorly.

@LearnedVector
Copy link

@reuben that makes sense. Thanks for the clarification.

@abuvaneswari
Copy link
Author

Does DeepSpeech (and its a feature of CTC, I suppose) require that the incoming features be fed at the word boundary? What if I construct an online moving window MFCC calculator and feed in the features without regard to the word boundary? Let us say that my window length is long enough to accomodate 5-6 grams; the first and last gram may be partial because the segmentation is done without regard to word boundary. Can such a design still infer words?

@kdavis-mozilla
Copy link
Contributor

kdavis-mozilla commented Oct 1, 2017

We assume incoming features are fed at word boundaries. Performance is further improved if they are at sentence boundaries, due to the language model being trained on sentences.

@alanbekker
Copy link

So what would be the recommended steps( in terms of training data, network topology) in order to build a speech recognition streaming sercice?

1 similar comment
@alanbekker
Copy link

So what would be the recommended steps( in terms of training data, network topology) in order to build a speech recognition streaming sercice?

@MainRo
Copy link

MainRo commented Nov 16, 2017

I started to write a server to test inference on a generated model. It is available here:
https://github.com/MainRo/deepspeech-server

This is a very first implementation that listens to http post requests. I plan to add support of websockets, provide a sample web-app to try it from a browser, and package it in a docker container.

@lissyx : as already discussed, tell me if you are interested in such a project.

@MainRo
Copy link

MainRo commented Dec 5, 2017

Update: I just published a Dockerfile to easily start the server. I tested it with deepspeech-0.1.0 and the pre-trained model published.
See here : https://github.com/MainRo/docker-deepspeech-server

@ashwan1
Copy link

ashwan1 commented Jan 6, 2018

I wrote django based web app that can record sound from browser and return its transcription. Its at its very first stage. I am planning it to make websocket based and google speech API based. So that, I don't have to change much in my other projects, apart from changing socket url.
I'll try to take it to real time transcription as soon as possible.

@alanbekker
Copy link

alanbekker commented Jan 7, 2018 via email

@ashwan1
Copy link

ashwan1 commented Jan 7, 2018

@alanbekker I have 2 plans. I am currently doing research as to what will work best:

  1. This is more like pseudo real time. I'll train my model with small utterence, stream wav files to server. Each wav file will contain voice from starting time t1. As speaker will continue speaking, wav file size streamed to server will keep on increasing. And I think well trained deepspeech server can return approximately correct transcriptions. At last server will receive full sentence as audio, which it can transcribe. This should work atleast for small audio(say, 5 sec). Moreover we can couple this with silence detection and marking non changing transcriptions as final to optimize performance.
  2. Another idea is to use one pass decoding with RNNLM as mention in this paper.

@alanbekker
Copy link

alanbekker commented Jan 8, 2018 via email

@ashwan1
Copy link

ashwan1 commented Jan 10, 2018

You are correct that more labeled data is needed for plan 1. Thus I am looking into possibility of subtitles.
But I would also like to see distributed model's performance in such scenario.
2 is not exactly alternative for 1. It will require changing network structure. So that's the just another thing I will be trying. But research for better alternatives continues...
Ultimately, we need to make this real time(more or less like google speech API). Any thing that works efficiently will do.

@AMairesse
Copy link

@ashwan1 for 2 you could also try https://github.com/inikdom/rnn-speech
Performance is not quite good because it still lacks a language model but inference is mono-directional so you could easily build a real-time transcription layer with state in order to “see” the transcription evolve while receiving the audio.

@ashwan1
Copy link

ashwan1 commented Jan 14, 2018

Thanks a lot for suggestion :)
I will definitely try that.

@AMairesse
Copy link

@ashwan1 you may also check https://github.com/SeanNaren/deepspeech.pytorch
Performance is way better, even without a language model. Pre-trained networks are bidirectional but there's support for unidirectional mode like in the DeepSpeech2 paper.

@jenniferzhu
Copy link

@lissyx A follow-up question: how can we use this model to transcribe long video. 5 sec clip is too short...

@lissyx
Copy link
Collaborator

lissyx commented Apr 4, 2018

@jenniferzhu How much long ? Best way until #1275 is done would be to cut audio on silences.

@kausthub
Copy link

@lissyx is it possible for me to test #1275 on my local ?? I really want to build an app for which does s2t "real" time. Can you please also suggest the steps which i need to do to setup this branch to run ?

Note:
I have already setup deepspeech and understand the main components involved in training,testing and running it.
Interested in making this "real-time".

Thanks in advance

@lissyx
Copy link
Collaborator

lissyx commented Apr 27, 2018

@kausthub Just checkout the streaming-inference branch and build with it.

@kausthub
Copy link

kausthub commented Apr 27, 2018 via email

@lissyx
Copy link
Collaborator

lissyx commented Apr 27, 2018

@kausthub
Copy link

kausthub commented Apr 27, 2018 via email

@lissyx
Copy link
Collaborator

lissyx commented Jul 25, 2018

Nothing anymore to do here.

@lissyx lissyx closed this as completed Jul 25, 2018
@lock
Copy link

lock bot commented Jan 2, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Jan 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests