Streaming inference #1275

reuben · 2018-02-28T15:24:38Z

This is the WIP of streaming inference support. We need to decide on a unidirectional architecture to use before we can merge this PR. Some pending stuff is better documentation of the streaming implementation, and actually testing the AOT model code paths.

At this point I'd appreciate a review of the API changes in deepspeech.h (see the clients and DeepSpeech::Model::stt for example usage).

reuben · 2018-02-28T15:27:51Z

Oh yes, also pending is changes to audio.py to make it compute features frame-by-frame instead of on the whole file at once. (That, or figure out why those two modes have different outputs and fix it.)

reuben · 2018-03-01T12:24:25Z

This API should be enough to implement the Web Speech API with some limitations (ordered by decreasing difficulty to solve):

Grammars aren't supported.
Only continuous recognition is supported.
Interim results aren't supported, only a single, final transcription.
maxAlternatives is currently hard-coded to 1.

continuous=false is easy to solve, the Web Speech implementation can apply VAD to the input and finish the inference and match the results against a grammar.

continuous=true with a grammar is more complicated, it would require changes to the decoding algorithm.

Interim results are easy to solve, although expensive (latency) with our current setup. We can just add a "resultSoFar()" method that decodes the currently accumulated logits. A streaming implementation of the decoder would make this less costly.

maxAlternatives is the top_paths CTC parameter. Supporting it is easy.

kdavis-mozilla

Took a look, seems pretty good. But didn't go through deepspeech.cc yet. I'll tackle that when it's not a WIP

kdavis-mozilla · 2018-03-02T12:43:46Z

DeepSpeech.py

@@ -1681,28 +1673,109 @@ def end(self, session):
                  ' or removing the contents of {0}.'.format(FLAGS.checkpoint_dir))
        sys.exit(1)

-def create_inference_graph(batch_size=None, use_new_decoder=False):
+def create_inference_step_graph(batch_x, seq_length, fw_cell, previous_state, batch_size=1, n_steps=16):


How do we guarantee that the graph defined here is always in sync with that of the method BiRNN?

Good question. Suggestions welcome. I'm still thinking about how to solve that without complicating the code more than it's worth it.

kdavis-mozilla · 2018-03-02T12:46:27Z

DeepSpeech.py

+    h3 = variable_on_worker_level('h3', [n_hidden_2, n_hidden_3], tf.random_normal_initializer(stddev=FLAGS.h3_stddev))
+    layer_3 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(layer_2, h3), b3)), FLAGS.relu_clip)
+
+    # Now we create the forward and backward LSTM units.


Comment now incorrect

kdavis-mozilla · 2018-03-02T12:47:13Z

DeepSpeech.py

+    layer_3 = tf.minimum(tf.nn.relu(tf.add(tf.matmul(layer_2, h3), b3)), FLAGS.relu_clip)
+
+    # Now we create the forward and backward LSTM units.
+    # Both of which have inputs of length `n_cell_dim` and bias `1.0` for the forget gate of the LSTM.


Comment now incorrect

kdavis-mozilla · 2018-03-02T12:55:30Z

DeepSpeech.py

+    # Output shape: [n_steps, batch_size, n_hidden_6]
+    return layer_6, output_state
+
+def create_inference_graph(batch_size=None, n_steps=16, use_new_decoder=False):


How will we make sure graph elements, for example use of a BasicLSTMCell, stay consistent across this method and BiRNN?

kdavis-mozilla · 2018-03-02T13:51:10Z

native_client/client.cc

@@ -26,59 +26,39 @@

 using namespace DeepSpeech;

-struct ds_result {
+typedef struct {


If we're modernizing a bit why not use a class instead of a struct?

Also, why not user a std::string? (If this string is going over the ABI boundary at some point, ignore this statement.)

A class is just a struct where members are private by default. I don't see how either is more modern, but I made them classes anyway.

I don't use an std::string here because that makes it harder to handle the nullptr return case of stt.

kdavis-mozilla · 2018-03-02T13:53:48Z

native_client/client.cc


  return res;
 }

-struct ds_audio_buffer {
+typedef struct {


Same statements on modernizing

kdavis-mozilla · 2018-03-02T13:57:00Z

native_client/deepspeech.h

@@ -8,24 +8,14 @@
 namespace DeepSpeech
 {

-  class Private;
+  struct Private;


Why switch from class to struct?

kdavis-mozilla · 2018-03-02T13:57:40Z

native_client/deepspeech.h

-  class Private;
+  struct Private;
+
+  struct StreamingState;


Why use struct?

kdavis-mozilla · 2018-03-02T14:17:01Z

native_client/deepspeech.h

-  class Private;
+  struct Private;
+
+  struct StreamingState;

  class Model {
    private:
      Private* mPriv;


Classes that pass over the ABI boundary should have a non-inline virtual destructor as far as I know.

I assume you're talking about Model, not Private, as that's an opaque pointer. I made Model's destructor virtual.

reuben · 2018-03-05T20:34:52Z

Tests passing, modulo AOT and production tests, which is expected.

lissyx · 2018-03-06T10:14:34Z

DeepSpeech.py

+                initializer_nodes='',
+                variable_names_blacklist='previous_state_c,previous_state_h')
+
+            # Load and export as string


Why do we want to export it as pb and pbtxt ?

I was going to remove that but forgot. It's there so I can easily load an exported graph into Tensorboard to visualize it.

Right. Do we need this by default ? I fear it's going to just waste disk space and time for most of the usecase. Would an extra CLI flag defaulting to false be a good deal ?

I'll make it a standalone tool in bin/graph_binary_to_text.py

lissyx · 2018-03-06T10:15:53Z

native_client/BUILD

        ### => Trying to be more fine-grained
-        ### Obtained by trial/error process ...
+        ### Use bin/ops_in_graph.py to list all the ops used by a frozen graph.


lissyx · 2018-03-06T10:17:07Z

native_client/BUILD

    ]),
-    includes = ["kenlm", "boost_locale"],
+    includes = ["kenlm", "boost_locale", "c_speech_features", "kiss_fft130"],


What about the copts tuning to disable AVX, AVX2 and FMA for performances reasons?

Good catch. I'll try to get bazel to do the right thing here.

Maybe the reason why it was made is also not valid anymore? Would be useful that we re-evaluate that :)

lissyx · 2018-03-06T10:17:43Z

native_client/BUILD

    ] + if_native_model([
        "//tensorflow/compiler/tf2xla:xla_compiled_cpu_function",
-    ])
-      + if_cuda([
+    ])+ if_cuda([


I'm (happily) surprised you could remove the Slice GPU specific, but it might be a side-effect of the graph change itself?

Probably. There's no slice ops in the streaming graph.

lissyx · 2018-03-06T10:18:07Z

native_client/BUILD

        ### => Trying to be more fine-grained
-        ### Obtained by trial/error process ...
+        ### Use bin/ops_in_graph.py to list all the ops used by a frozen graph.
        ### CPU only build libdeepspeech.so from 63M to 36M


nit: Update on size impact ?

Will check.

lissyx · 2018-03-06T10:19:20Z

native_client/beam_search.h

@@ -53,7 +53,7 @@ class KenLMBeamScorer : public tensorflow::ctc::BaseBeamScorer<KenLMBeamState> {
  // ExpandState is called when expanding a beam to one of its children.
  // Called at most once per child beam. In the simplest case, no state
  // expansion is done.
-  void ExpandState(const KenLMBeamState& from_state, int from_label,
+  void ExpandState(const KenLMBeamState& from_state, int /*from_label*/,


Why is this commented and not the rest?

Because that's the only parameter that isn't used. This was done to silence a warning.

lissyx · 2018-03-06T10:19:38Z

native_client/beam_search.h

nit: blank change because of line ending

That was on purpose to silence a warning about no line end at the end of the file

lissyx · 2018-03-06T10:20:35Z

native_client/client.cc

@@ -26,59 +26,39 @@

 using namespace DeepSpeech;

-struct ds_result {
+typedef struct {
  char* string;
  double cpu_time_overall;


For comparison purpose, I'd like we keep the three measures :-)

lissyx · 2018-03-06T10:24:56Z

util/audio.py


 try:
-    from deepspeech.utils import audioToInputVector
+    from deepspeech import audioToInputVector
 except ImportError:
    import numpy as np
    from python_speech_features import mfcc


nit: we should update the warning message, to make it more clear:

DeepSpeech Python bindings could not be loaded, resorting to slower code to compute the audio vectors. Refer to README.md for instructions on how to install (or build) deepspeech python bindings.

Good catch, will do.

lissyx · 2018-03-06T10:28:35Z

native_client/deepspeech.h

+       *        to {@link feedAudioContent()} and {@link finishStream()}.
+       *
+       * @param aPreAllocFrames Number of timestep frames to reserve. One timestep
+       *                        is equivalent to two window lenghts (50ms), so


Is the fact that "one timestep is equivalent to two window lengths" is dependant on parameters set during the training ? We should mention it if it's the case. (and there's a typo in lenghts :))

It's dependent on the training code, yes. Feature computation for training and inference need to be consistent.

Maybe it's worth mentionning it?

lissyx · 2018-03-06T10:30:20Z

native_client/deepspeech.h

+       * @return A context pointer that represents the streaming state.
+       */
+      StreamingState* setupStream(unsigned int aPreAllocFrames = 150,
+                                  unsigned int aSampleRate = 16000);


The more I think about that, the more I'm wondering if we should not have some graph metadata somehow to rely on this.

aPreAllocFrames is just an optimization knob, so not tied to the graph. The sample rate is kind of tied, in that it should be larger than or equal to the sample rate of the training data. (This requirement might go away depending on the result of our band-pass experiments).

lissyx · 2018-03-06T10:32:04Z

native_client/deepspeech.cc

+const int SAMPLE_RATE = 16000;
+
+//TODO: infer n_steps from model
+const int N_STEPS_PER_BATCH = 16;


Those three TODO about using data from the model, seems related to my previous comment about relying on some graph metadata somehow.

Yeah, n_steps and batch_size should definitely be obtained from the model directly. I think they're both easy to do, FWIW, just haven't gotten to it yet.

lissyx · 2018-03-06T10:33:11Z

native_client/deepspeech.cc


 using namespace tensorflow;
 using tensorflow::ctc::CTCBeamSearchDecoder;
 using tensorflow::ctc::CTCDecoder;

 namespace DeepSpeech {

+class StreamingState {
+public:
+  std::vector<float> accumulated_logits;


float, is there any win/loss compared to double for example?

The model outputs a float32 tensor, so there's no gain.

lissyx · 2018-03-06T10:35:09Z

native_client/deepspeech.cc


 using namespace tensorflow;
 using tensorflow::ctc::CTCBeamSearchDecoder;
 using tensorflow::ctc::CTCDecoder;

 namespace DeepSpeech {

+class StreamingState {


As much as I can see, this class is completely defined here. In deepspeech.h we only have a class StreamingState; statement. Therefore, I think it might be useful that this is being properly documented, since the StreamingState is exposed to the public.

It's an opaque pointer, so the layout is not exposed to API consumers. Is the documentation in setupStream/feedAudioContent/finishStream not enough?

You have a point here. The other docs are good, I missed the fact that since it's not in the header it's not publicly exposed. We got some people on Discourse asking to play with the logits, if I remember correctly. Maybe there's room to expose some of that?

I'd rather expose too little at first, than too much and be forced to maintain the "too much" part.

lissyx · 2018-03-06T10:38:56Z

native_client/deepspeech.cc

+   *
+   * @param n_frames       Number of timesteps to deal with
+   * @param logits         Matrix of logits, of dimensions:
+   *                       [n_frames][batch_size][num_classes]


The comment states it is a matrix of [n_frames][batch_size][num_classes] dimensions, but the prototype is std::vector<float>& logits which seems like a simple vector. Am I missing something?
Also, there is no n_frames as well :'(

Good catch, I changed this from bare pointers to std::vector and forgot to update the comments.

lissyx · 2018-03-06T10:39:59Z

native_client/deepspeech.cc

+   * @param mfcc batch input data
+   * @param n_frames number of timesteps in the data
+   *
+   * @param[out] output_logits Should be large enough to fit


nit: should this be @return instead of @param[out] ?

No, it's an outparam, we append the results of the inference to it.

Right, hence the void.

reuben · 2018-07-17T13:53:31Z

Rebased on top of master and included changes made recently to the decoder. I think this is good to merge (after a review pass), but before releasing we should deal with some infrastructure bits included here for development:

Temporarily switch to r1.4 for automation builds
f4837a0 -> figure out situation on the new cluster and update accordingly, train release model on 1.6/1.7/1.8?
Switch to patched TensorFlow with BlockLSTM op and kernel ad7377a -> This is broken now since the artifacts have expired, need to merge patch on the other TF branches and do proper fix. I'll do that today.

reuben · 2018-07-19T14:16:10Z

I'm going to close this PR and open a new one for the second round of reviews since this one is so large it takes almost a minute to load on my computer.

jwstolk · 2018-09-16T15:20:06Z

Is there a new PR yet? --> found it, it is probably the 7b87336 mentioned just above.

lissyx · 2018-09-16T15:39:45Z

This work was merged already

lock · 2019-01-03T01:53:43Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

reuben requested a review from kdavis-mozilla February 28, 2018 15:24

reuben force-pushed the streaming-inference branch 2 times, most recently from 6b118d9 to ba14779 Compare March 1, 2018 17:56

kdavis-mozilla reviewed Mar 2, 2018

View reviewed changes

reuben force-pushed the streaming-inference branch 4 times, most recently from 28988b8 to b3f86da Compare March 5, 2018 19:52

reuben changed the title ~~Streaming inference WIP~~ Streaming inference Mar 5, 2018

reuben requested review from lissyx and kdavis-mozilla March 5, 2018 21:00

lissyx reviewed Mar 6, 2018

View reviewed changes

Increase CI training epochs to guarantee overfitting

f84ce51

reuben force-pushed the streaming-inference branch from ad47c78 to a53b64c Compare July 17, 2018 13:49

reuben force-pushed the streaming-inference branch 5 times, most recently from 25a189a to 6fd09bf Compare July 19, 2018 01:58

reuben added 11 commits July 18, 2018 23:32

Convert native API to C instead of C++

92a0019

Remove some deepspeech_utils references that crept up during rebase

c06ad6d

Infer n_steps and n_context from model file

19bb31b

Switch to LSTMBlockFusedCell

b1c858b

Don't handle dropout rates as Python values

2d1e275

Add intermediateDecode method to the API

165e47c

Finalize training graph before starting training

79a6c9e

Disable AOT tasks

949b6ae

Remove word count bonus in favor of length normalization

6e0b08c

Architecture changes from training

f7e608f

Switch to updated head in TensorFlow r1.6 branch

6c8523d

reuben force-pushed the streaming-inference branch 2 times, most recently from 9d26197 to eb6f5f6 Compare July 19, 2018 12:51

Switch prod model to a streaming based one

7b87336

reuben force-pushed the streaming-inference branch from eb6f5f6 to 7b87336 Compare July 19, 2018 13:26

reuben closed this Jul 19, 2018

lock bot locked and limited conversation to collaborators Jan 3, 2019

Streaming inference #1275

Streaming inference #1275

Conversation

reuben commented Feb 28, 2018

reuben commented Feb 28, 2018

reuben commented Mar 1, 2018

kdavis-mozilla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reuben commented Mar 5, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lissyx Mar 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lissyx Mar 6, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reuben commented Jul 17, 2018

reuben commented Jul 19, 2018

jwstolk commented Sep 16, 2018 • edited Loading

lissyx commented Sep 16, 2018

lock bot commented Jan 3, 2019

lissyx Mar 6, 2018 •

edited

Loading

lissyx Mar 6, 2018 •

edited

Loading

jwstolk commented Sep 16, 2018 •

edited

Loading