Inferencing in real-time #847

abuvaneswari · 2017-09-22T21:29:46Z

Is there a streaming server & client code that does the following?

(a) on the client side, continuously generates PCM data samples from the mic connected to PC, sends the samples every, say 100ms to the server and prints out the transcripts from the server as they arrive (I am thinking of a client similar to the Google Speech API - Streaming Python client example : https://github.com/GoogleCloudPlatform/python-docs-samples/blob/master/speech/cloud-client/transcribe_streaming_mic.py )

(b) on the server side, responds to the client's samples by waiting for enough samples to build up, invokes deepspeech, sends the transcript back to the client and does this continuously as well

thanks,
Buvana

lissyx · 2017-09-22T21:31:07Z

Not yet, but this is something I have in mind for demo purposes.

elpimous · 2017-09-22T22:58:14Z

Perhaps could you have a look at
Respeaker on github, for recording part
They work on specific mic array, and on different recording types

LearnedVector · 2017-09-25T17:21:26Z

@lissyx would there have to be any neural network architectural changes to support real-time inferencing, or would it just be an client/server engineering challenge

lissyx · 2017-09-25T17:32:21Z

It all depends on what you want to precisely achieve. "Real" streaming would require changes to the network for sure ; the bidirectional recurrent layers makes us forced to send "complete" data.

LearnedVector · 2017-09-26T17:41:28Z

@lissyx so with the current network, does sending chunks of audio data at a time make sense? From intuition i can see that it might inflate the WER rate by a little due to the language model not having the complete sequences of terms.

reuben · 2017-09-26T18:03:12Z

The WER increase will not just be from the language model, but also the network itself, since it depends on having the entire utterance available. The performance will depend on your training data. If you only train with long utterances (several seconds) and then try to do inference with chunks of one second each then it'll probably perform very poorly.

LearnedVector · 2017-09-26T22:19:32Z

@reuben that makes sense. Thanks for the clarification.

abuvaneswari · 2017-09-30T17:49:35Z

Does DeepSpeech (and its a feature of CTC, I suppose) require that the incoming features be fed at the word boundary? What if I construct an online moving window MFCC calculator and feed in the features without regard to the word boundary? Let us say that my window length is long enough to accomodate 5-6 grams; the first and last gram may be partial because the segmentation is done without regard to word boundary. Can such a design still infer words?

kdavis-mozilla · 2017-10-01T06:14:24Z

We assume incoming features are fed at word boundaries. Performance is further improved if they are at sentence boundaries, due to the language model being trained on sentences.

alanbekker · 2017-10-26T04:43:09Z

So what would be the recommended steps( in terms of training data, network topology) in order to build a speech recognition streaming sercice?

alanbekker · 2017-10-26T04:46:49Z

So what would be the recommended steps( in terms of training data, network topology) in order to build a speech recognition streaming sercice?

MainRo · 2017-11-16T15:10:37Z

I started to write a server to test inference on a generated model. It is available here:
https://github.com/MainRo/deepspeech-server

This is a very first implementation that listens to http post requests. I plan to add support of websockets, provide a sample web-app to try it from a browser, and package it in a docker container.

@lissyx : as already discussed, tell me if you are interested in such a project.

MainRo · 2017-12-05T15:34:42Z

Update: I just published a Dockerfile to easily start the server. I tested it with deepspeech-0.1.0 and the pre-trained model published.
See here : https://github.com/MainRo/docker-deepspeech-server

ashwan1 · 2018-01-06T16:58:25Z

I wrote django based web app that can record sound from browser and return its transcription. Its at its very first stage. I am planning it to make websocket based and google speech API based. So that, I don't have to change much in my other projects, apart from changing socket url.
I'll try to take it to real time transcription as soon as possible.

alanbekker · 2018-01-07T10:11:20Z

Can you please elaborate a bit more how we can s2t in real time using a bi-directional RNN? As far as I see this we need to wait to the end of the speech in order to begin the decoding...maybe I misunderstand something will be happy to be corrected

…

On Sat, Jan 6, 2018 at 6:58 PM, Ashwani Pandey ***@***.***> wrote: I wrote django based web app <https://github.com/sci472bmt/django-deepspeech-server> that can record sound from browser and return its transcription. Its at its very first stage. I am planning it to make websocket based and google speech API based. So that, I don't have to change much in my other projects, apart from changing socket url. I'll try to take it to real time transcription as soon as possible. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#847 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE_DSfnH40RToxOl1hu-v4VllVBA0jfmks5tH6Y8gaJpZM4PhQ1U> .

ashwan1 · 2018-01-07T17:29:03Z

@alanbekker I have 2 plans. I am currently doing research as to what will work best:

This is more like pseudo real time. I'll train my model with small utterence, stream wav files to server. Each wav file will contain voice from starting time t1. As speaker will continue speaking, wav file size streamed to server will keep on increasing. And I think well trained deepspeech server can return approximately correct transcriptions. At last server will receive full sentence as audio, which it can transcribe. This should work atleast for small audio(say, 5 sec). Moreover we can couple this with silence detection and marking non changing transcriptions as final to optimize performance.
Another idea is to use one pass decoding with RNNLM as mention in this paper.

alanbekker · 2018-01-08T13:47:35Z

In respect to (1) you assume you can train your model on small utterances..but in order to so you will need to align the small utterances with the corresponding transcription (more labeled data is needed) , am I wrong? could you please explain how is (2) a another alternative for (1)? Thanks!

…

On Sun, Jan 7, 2018 at 7:29 PM, Ashwani Pandey ***@***.***> wrote: @alanbekker <https://github.com/alanbekker> I have 2 plans. I am currently doing research as to what will work best: 1. This is more like pseudo real time. I'll train my model with small utterence, stream wav files to server. Each wav file will contain voice from starting time t1. As speaker will continue speaking, wav file size streamed to server will keep on increasing. And I think well trained deepspeech server can return approximately correct transcriptions. At last server will receive full sentence as audio, which it can transcribe. This should work atleast for small audio(say, 5 sec). Moreover we can couple this with silence detection and marking non changing transcriptions as final to optimize performance. 2. Another idea is to use one pass decoding with RNNLM as mention in this paper <https://pdfs.semanticscholar.org/8ad4/4f5161ad04c71fe052582168bd7a45217d36.pdf> . — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#847 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE_DSbu73xvAiCf5bDNHFXFdNnCXXhWxks5tIP7ngaJpZM4PhQ1U> .

ashwan1 · 2018-01-10T21:12:13Z

You are correct that more labeled data is needed for plan 1. Thus I am looking into possibility of subtitles.
But I would also like to see distributed model's performance in such scenario.
2 is not exactly alternative for 1. It will require changing network structure. So that's the just another thing I will be trying. But research for better alternatives continues...
Ultimately, we need to make this real time(more or less like google speech API). Any thing that works efficiently will do.

AMairesse · 2018-01-13T18:06:28Z

@ashwan1 for 2 you could also try https://github.com/inikdom/rnn-speech
Performance is not quite good because it still lacks a language model but inference is mono-directional so you could easily build a real-time transcription layer with state in order to “see” the transcription evolve while receiving the audio.

ashwan1 · 2018-01-14T18:15:10Z

Thanks a lot for suggestion :)
I will definitely try that.

AMairesse · 2018-01-15T22:08:41Z

@ashwan1 you may also check https://github.com/SeanNaren/deepspeech.pytorch
Performance is way better, even without a language model. Pre-trained networks are bidirectional but there's support for unidirectional mode like in the DeepSpeech2 paper.

jenniferzhu · 2018-04-03T22:49:44Z

@lissyx A follow-up question: how can we use this model to transcribe long video. 5 sec clip is too short...

lissyx · 2018-04-04T11:36:41Z

@jenniferzhu How much long ? Best way until #1275 is done would be to cut audio on silences.

kausthub · 2018-04-27T09:30:05Z

@lissyx is it possible for me to test #1275 on my local ?? I really want to build an app for which does s2t "real" time. Can you please also suggest the steps which i need to do to setup this branch to run ?

Note:
I have already setup deepspeech and understand the main components involved in training,testing and running it.
Interested in making this "real-time".

Thanks in advance

lissyx · 2018-04-27T10:12:43Z

@kausthub Just checkout the streaming-inference branch and build with it.

kausthub · 2018-04-27T10:19:46Z

I couldn't find the steps to build. Sorry if this is a beginner level question. Thanks in advance

…

On Fri, 27 Apr 2018, 15:42 lissyx, ***@***.***> wrote: @kausthub <https://github.com/kausthub> Just checkout the streaming-inference branch and build with it. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#847 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJO3_fuMOmWfEvxiFKnTC9ibJ1iwnQRAks5tsu8qgaJpZM4PhQ1U> .

lissyx · 2018-04-27T10:20:57Z

@kausthub It's in native_client/README.md: https://github.com/mozilla/DeepSpeech/blob/streaming-inference/native_client/README.md

kausthub · 2018-04-27T10:27:44Z

Thanks will check that out.

…

On Fri, 27 Apr 2018, 15:51 lissyx, ***@***.***> wrote: @kausthub <https://github.com/kausthub> It's in native_client/README.md: https://github.com/mozilla/DeepSpeech/blob/streaming-inference/native_client/README.md — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#847 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AJO3_blSMRILzU8trMgmEHcasXIwS88Nks5tsvEWgaJpZM4PhQ1U> .

lissyx · 2018-07-25T11:28:18Z

Nothing anymore to do here.

lock · 2019-01-02T21:53:15Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

MainRo mentioned this issue Nov 24, 2017

Feature Request: REST/Streaming Service with GCP-Compatible API #977

Closed

nealmcb mentioned this issue Dec 14, 2017

Mozilla Deepspeech? MycroftAI/ZZZ-RETIRED__openstt#4

Closed

breandan mentioned this issue Jun 19, 2018

Replace CMUSphinx OpenASR/idiolect#52

Closed

lissyx closed this as completed Jul 25, 2018

lock bot locked and limited conversation to collaborators Jan 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inferencing in real-time #847

Inferencing in real-time #847

abuvaneswari commented Sep 22, 2017

lissyx commented Sep 22, 2017

elpimous commented Sep 22, 2017 •

edited

Loading

LearnedVector commented Sep 25, 2017

lissyx commented Sep 25, 2017

LearnedVector commented Sep 26, 2017

reuben commented Sep 26, 2017

LearnedVector commented Sep 26, 2017

abuvaneswari commented Sep 30, 2017

kdavis-mozilla commented Oct 1, 2017 •

edited

Loading

alanbekker commented Oct 26, 2017

alanbekker commented Oct 26, 2017

MainRo commented Nov 16, 2017

MainRo commented Dec 5, 2017

ashwan1 commented Jan 6, 2018

alanbekker commented Jan 7, 2018 via email

ashwan1 commented Jan 7, 2018

alanbekker commented Jan 8, 2018 via email

ashwan1 commented Jan 10, 2018

AMairesse commented Jan 13, 2018

ashwan1 commented Jan 14, 2018

AMairesse commented Jan 15, 2018

jenniferzhu commented Apr 3, 2018

lissyx commented Apr 4, 2018

kausthub commented Apr 27, 2018

lissyx commented Apr 27, 2018

kausthub commented Apr 27, 2018 via email

lissyx commented Apr 27, 2018

kausthub commented Apr 27, 2018 via email

lissyx commented Jul 25, 2018

lock bot commented Jan 2, 2019

Inferencing in real-time #847

Inferencing in real-time #847

Comments

abuvaneswari commented Sep 22, 2017

lissyx commented Sep 22, 2017

elpimous commented Sep 22, 2017 • edited Loading

LearnedVector commented Sep 25, 2017

lissyx commented Sep 25, 2017

LearnedVector commented Sep 26, 2017

reuben commented Sep 26, 2017

LearnedVector commented Sep 26, 2017

abuvaneswari commented Sep 30, 2017

kdavis-mozilla commented Oct 1, 2017 • edited Loading

alanbekker commented Oct 26, 2017

alanbekker commented Oct 26, 2017

MainRo commented Nov 16, 2017

MainRo commented Dec 5, 2017

ashwan1 commented Jan 6, 2018

alanbekker commented Jan 7, 2018 via email

ashwan1 commented Jan 7, 2018

alanbekker commented Jan 8, 2018 via email

ashwan1 commented Jan 10, 2018

AMairesse commented Jan 13, 2018

ashwan1 commented Jan 14, 2018

AMairesse commented Jan 15, 2018

jenniferzhu commented Apr 3, 2018

lissyx commented Apr 4, 2018

kausthub commented Apr 27, 2018

lissyx commented Apr 27, 2018

kausthub commented Apr 27, 2018 via email

lissyx commented Apr 27, 2018

kausthub commented Apr 27, 2018 via email

lissyx commented Jul 25, 2018

lock bot commented Jan 2, 2019

elpimous commented Sep 22, 2017 •

edited

Loading

kdavis-mozilla commented Oct 1, 2017 •

edited

Loading