Score CTC prefix beams with KenLM #805

reuben · 2017-08-31T14:26:55Z

Opening a PR to test the TaskCluster setup.

reuben · 2017-09-01T09:24:50Z

FWIW, all the tests passed: https://tools.taskcluster.net/groups/Vxu6YJ_vR-6GFI-vi0Z0nA

I need to figure out a solution for users in general, probably downloading the appropriate native_client.tar.xz automatically from the bin/run-* scripts and extracting the library so that training can be done without having to set up a TensorFlow build environment.

reuben · 2017-09-01T09:54:04Z

@kdavis-mozilla @lissyx I've split the changes into logical chunks where possible. I'll work on the scripts I mentioned above before merging the PR, but the commits that are here are ready to be reviewed.

lissyx · 2017-09-01T11:48:10Z

native_client/BUILD

+    name = "ctc_decoder_with_kenlm",
+    srcs = ["beam_search.cc",
+            "alphabet.h",
+            "trie_node.h"] +


nit: the "] +" should be on the next line

lissyx · 2017-09-01T11:48:45Z

native_client/BUILD

+        "trie_node.h",
+        "alphabet.h",
+    ] + glob(["kenlm/lm/*.cc", "kenlm/util/*.cc", "kenlm/util/double-conversion/*.cc",
+                 "kenlm/lm/*.hh", "kenlm/util/*.hh", "kenlm/util/double-conversion/*.h"],


nit: alignment

lissyx · 2017-09-01T11:48:57Z

native_client/BUILD

+                 "kenlm/lm/*.hh", "kenlm/util/*.hh", "kenlm/util/double-conversion/*.h"],
+                exclude = ["kenlm/*/*test.cc", "kenlm/*/*main.cc"]),
+    includes = ["kenlm"],
+    copts = ['-std=c++11'],


nit: use " instead of '

lissyx · 2017-09-01T11:50:36Z

native_client/beam_search.cc

+limitations under the License.
+==============================================================================*/
+
+// This test illustrates how to make use of the CTCBeamSearchDecoder using a


I think those are not needed anymore :)

lissyx · 2017-09-01T11:52:49Z

native_client/beam_search.cc

+      c->set_output(out_idx++, c->Matrix(batch_size, top_paths));
+      return tf::Status::OK();
+    })
+    .Doc(R"doc(


I think the doc needs to be updated to include the new parameters

I didn't change this line so GitHub is still showing the comment but it's been fixed.

lissyx · 2017-09-01T11:56:15Z

native_client/beam_search.cc

+    OP_REQUIRES_OK(ctx, ctx->GetAttr("top_paths", &top_paths));
+    decode_helper_.SetTopPaths(top_paths);
+
+    // const tf::Tensor* model_tensor;


Is that leftover from previous testing ?

lissyx · 2017-09-01T11:56:32Z

native_client/generate_trie.cpp

+}
+
+int main(void) {
+  return generate_trie("/Users/remorais/Development/DeepSpeech/data/alphabet.txt",


Seems to be leftover as well :/

lissyx · 2017-09-01T12:01:18Z

native_client/generate_trie.cpp

+  ifs.open(vocab_path, std::ifstream::in);
+
+  if (!ifs.is_open()) {
+    std::cout << "unable to open vocabulary" << std::endl;


errors should go to stderr

lissyx · 2017-09-01T12:09:40Z

.tc.training.yml

      apt-get -qq -y install make build-essential libssl-dev zlib1g-dev libbz2-dev libreadline-dev libsqlite3-dev wget curl llvm libncurses5-dev xz-utils tk-dev &&
      {{ SYSTEM_ADD_USER }} &&
      echo -e "#!/bin/bash\nset -xe\nexport PATH=/home/build-user/bin:$PATH && env && id && wget https://github.com/git-lfs/git-lfs/releases/download/v2.2.1/git-lfs-linux-amd64-2.2.1.tar.gz -O - | tar -C /tmp -zxf - && PREFIX=/home/build-user/ /tmp/git-lfs-2.2.1/install.sh && mkdir ~/DeepSpeech/ && git clone --quiet {{ GITHUB_HEAD_REPO_URL }} ~/DeepSpeech/ds/ && cd ~/DeepSpeech/ds && git checkout --quiet {{ GITHUB_HEAD_SHA }}" > /tmp/clone.sh && chmod +x /tmp/clone.sh &&
      {{ SYSTEM_DO_CLONE }} &&
-      sudo -H -u build-user TENSORFLOW_WHEEL=${TENSORFLOW_WHEEL} /bin/bash /home/build-user/DeepSpeech/ds/tc-train-tests.sh 2.7.13
+      sudo -H -u build-user TENSORFLOW_WHEEL=${TENSORFLOW_WHEEL} {{ TASK_ENV_VARS }} /bin/bash /home/build-user/DeepSpeech/ds/tc-train-tests.sh 2.7.13


We don't need {{ TASK_ENV_VARS }}, just DEEPSPEECH_ARTIFACTS_ROOT=${DEEPSPEECH_ARTIFACTS_ROOT}

lissyx · 2017-09-01T12:34:29Z

DeepSpeech.py

@@ -484,7 +538,7 @@ def calculate_mean_edit_distance_and_loss(model_feeder, tower, dropout):
    avg_loss = tf.reduce_mean(total_loss)

    # Beam search decode the batch
-    decoded, _ = tf.nn.ctc_beam_search_decoder(logits, batch_seq_len, merge_repeated=False)
+    decoded, _ = decode_with_lm(logits, batch_seq_len, merge_repeated=False, beam_width=1024)


Would it make sense to be able to keep the ability to use the vanilla tensorflow decoder ?

I don't think anyone will want to use it, and I'd rather avoid forking the execution/maintenance paths and have some part of the code break because it's not being tested.

lissyx · 2017-09-01T13:01:16Z

native_client/generate_trie.cpp

+  }
+
+  std::ofstream ofs;
+  ofs.open(trie_path);


shouldn't we check that as well? if we cannot open the output file

lissyx · 2017-09-01T13:14:56Z

native_client/generate_trie.cpp

+  ofs.open(trie_path);
+
+  std::string word;
+  while (ifs >> word) {


So, this is going to read line by line the worlds.txt file?

Word by word, it splits on whitespace.

lissyx · 2017-09-01T13:18:27Z

native_client/trie_node.h

+    }
+  }
+
+  int GetFrequency() {


That is a bit misleading, a frequency is not exactly a number of occurences in my mind, or am I missing something ?

Sure, it's more of a count. I'll rename it (or delete it, since there's no users).

lissyx · 2017-09-01T13:42:25Z

native_client/generate_trie.cpp

+    for_each(word.begin(), word.end(), [](char& a) { a = tolower(a); });
+    lm::WordIndex vocab = GetWordIndex(model, word);
+    float unigram_score = ScoreWord(model, vocab);
+    root.Insert(word.c_str(), [&a](char c) {


nit: For readability, I guess the translator should be completely on a new line:

root.Insert( word.c_str(), [&a](char c) { }, vocab, unigram_score);

lissyx · 2017-09-01T13:43:55Z

native_client/trie_node.h

+      TrieNode *child = children[vocabIndex];
+      if (child == nullptr)
+        child = children[vocabIndex] = new TrieNode(vocab_size);
+      child->Insert(word + 1, translator, lm_word, unigram_score);


Are we playing with pointers here with the + 1 ?

lissyx · 2017-09-01T13:45:38Z

native_client/trie_node.h

+    if (wordCharacter != '\0') {
+      int vocabIndex = translator(wordCharacter);
+      TrieNode *child = children[vocabIndex];
+      if (child == nullptr)


nit: i'm not a big fan of brace-less conditions, it's often root for nasty bugs in the future :)

lissyx · 2017-09-01T13:48:44Z

native_client/trie_node.h

+  }
+
+  static void ReadFromStream(std::istream& is, TrieNode* &obj, int vocab_size) {
+    int prefixCount;


having a local variable of a static class named the same way a member of the class is confusing, imho

But it is eventually the local variable prefixCount.

I renamed member variables to have a _ suffix.

lissyx · 2017-09-01T14:12:26Z

native_client/generate_trie.cpp

+  Model::State in_state = model.NullContextState();
+  Model::State out;
+  lm::FullScoreReturn full_score_return;
+  full_score_return = model.FullScore(in_state, vocab, out);


Why is there this out that we don't use ?

Asking because according to that comment f1e859e#diff-7a0b3b152c06fc03b20e63b72c26583aR11 (which I don't completely get the point, so I might just be misunderstanding something there), it seems like out_state has some use.

Keeping state is a performance optimization when scoring a sentence of various words, in this function we want to score the word independently of any context so that's why we throw it away.

Thanks, maybe worth a comment then ?

lissyx · 2017-09-01T14:23:45Z

native_client/beam_search.cc

+};
+
+// CTC beam search
+class CTCBeamSearchDecoderOp : public tf::OpKernel {


That class name is already used by Tensorflow vanilla CTC Beam Search decoder. Can we rename to something matching the op? Like CTCBeamSearchDecoderWithLM.

Technically this is ::CTCBeamSearchDecoderOp rather than ::tensorflow::CTCBeamSearchDecoderOp, but you're right, I'll rename it :)

kdavis-mozilla

Cool PR!

There's lots of little stuff which I won't go in to here.

However, I think the biggest single thing is that the trie needs to support unicode. Currently it just supports ascii, see my comments on the code. So it blocks the internationalization work you already did with the alphabet.txt file.

kdavis-mozilla · 2017-09-01T12:14:26Z

DeepSpeech.py

+                   top_paths=1, merge_repeated=True):
+  """Performs beam search decoding on the logits given in input.
+
+  **Note** The `ctc_greedy_decoder` is a special case of the


Should we keep this comment from the original ctc_beam_search_decoder? I think it's not needed.

kdavis-mozilla · 2017-09-01T12:21:51Z

DeepSpeech.py

@@ -484,7 +538,7 @@ def calculate_mean_edit_distance_and_loss(model_feeder, tower, dropout):
    avg_loss = tf.reduce_mean(total_loss)

    # Beam search decode the batch
-    decoded, _ = tf.nn.ctc_beam_search_decoder(logits, batch_seq_len, merge_repeated=False)
+    decoded, _ = decode_with_lm(logits, batch_seq_len, merge_repeated=False, beam_width=1024)


Should we pull the beam width out as a command line parameter that defaults to 1024?

kdavis-mozilla · 2017-09-01T12:35:25Z

native_client/BUILD

+                exclude = ["kenlm/*/*test.cc", "kenlm/*/*main.cc"]),
+    includes = ["kenlm"],
+    copts = ['-std=c++11'],
+    linkopts = ['-lm'],


Ditto previous nit on ' vs "

kdavis-mozilla · 2017-09-01T13:44:07Z

native_client/alphabet.h

@@ -54,6 +54,11 @@ class Alphabet {
    return size_;
  }

+  bool IsSpace(unsigned int label) const {
+    const std::string& str = StringFromLabel(label);
+    return str.size() == 1 && str[0] == ' ';


There is a complicated world of word boundaries that are not covered here. (For example see the Unicode Standard Annex #29 Unicode Text Segmentation on this[1].)

I guess it's too complicated to address now, but, non-the-less, eventually we should have a move general approach here.

kdavis-mozilla · 2017-09-01T13:49:47Z

native_client/kenlm/COPYING.LESSER.3

@@ -0,0 +1,165 @@
+		   GNU LESSER GENERAL PUBLIC LICENSE


According to Mozilla's legal advice on this, inclusion of GNU LGPL code[1] in to a MPL code base forces us to distribute the "Larger Work" subject to the terms of both licenses, as long as certain conditions are met, the details of which are described here[1] .

kdavis-mozilla · 2017-09-04T19:53:23Z

native_client/beam_search.cc

+
+      // TODO replace with OOV unigram prob?
+      // If we have no valid prefix we assume a very low log probability
+      float min_unigram_score = -10.0f;


The choice of OOV unigram prob seems reasonable as it's the min unigram score one can expect walking down the trie assuming the language model was pruned and all words under some min probability epsilon were replaced with unknown.

For KenLM the unknown word has WordIndex model->GetVocabulary().NotFound(). So I'd guess one should be able to set this value more systematically.

kdavis-mozilla · 2017-09-04T20:12:24Z

native_client/beam_search.cc

+      // TODO try two options
+      // 1) unigram score added up to language model scare
+      // 2) langugage model score of (preceding_words + unigram_word)
+      to_state->score = min_unigram_score + to_state->language_model_score;


Question, as the current character is not a space do

from_state->language_model_score

and

to_state->language_model_score

differ?

As far as I can see they do not differ as there's been no call to ScoreIncompleteWord() and UpdateWithLMScore between evaluating from_state->language_model_score and to_state->language_model_score.

They do not. Until there's a space we accumulate min unigram scores and then when a word boundary is reached we score it with the LM and update language_model_score.

kdavis-mozilla · 2017-09-05T04:54:06Z

native_client/beam_search.cc

+  float ScoreIncompleteWord(const Model::State& model_state,
+                            const std::string& word,
+                            Model::State& out) const {
+    lm::FullScoreReturn full_score_return;


As we're doing C++ we could write these lines as

lm::WordIndex vocab = model->GetVocabulary().Index(word); lm::FullScoreReturn full_score_return = model->FullScore(model_state, vocab, out); return full_score_return.prob;

kdavis-mozilla · 2017-09-05T05:17:40Z

native_client/beam_search.cc

+    OP_REQUIRES_OK(ctx, ctx->GetAttr("top_paths", &top_paths));
+    decode_helper_.SetTopPaths(top_paths);
+
+    // const tf::Tensor* model_tensor;


Could you remove this commented out code?

kdavis-mozilla · 2017-09-05T05:19:19Z

native_client/beam_search.cc

+  }
+
+  void Compute(tf::OpKernelContext *ctx) override {
+    const tf::Tensor *inputs;


Why such C based C++? Introducing all the variables at the start of the function.

This code is copied from CTCBeamSearchDecoder in tensorflow, mostly unmodified. All I change is which beam scorer to use. I can make the code more C++-y, but then if we ever need to uplift changes from there to here it's harder.

OK, understood.

lissyx · 2017-09-05T17:06:22Z

tc-tests-utils.sh

@@ -47,7 +47,7 @@ assert_correct_ldc93s1()
  assert_correct_inference "$1" "she had your dark suit in greasy wash water all year"
 }

-download_material()
+download_native_client_files()


I am stealing part of that because I need to perform something similar in #810

kdavis-mozilla · 2017-09-07T06:58:07Z

@reuben I guess you've seen Bug 1396777.

Do you want to fold those changes in to this PR or do a separate PR?

reuben · 2017-09-10T12:16:08Z

@lissyx the TC build is failing because the version of GCC/libstdc++ on the builders does not support codecvt. How would I go about upgrading to GCC 5.0/the corresponding libstdc++ version?

Alternatively, if that's too complicated, I can use Boost.Locale, but that's a new dependency I'd rather avoid.

reuben · 2017-09-10T12:17:10Z

@kdavis-mozilla I've addressed all review comments. Could you take a second look? In particular, the trie_node.h, beam_search.cc and DeepSpeech.py changes. Thanks!

lissyx · 2017-09-10T16:46:19Z

@reuben You might be able to get #include <codecvt> by installting GCC 5, however:

this will require that we also build Tensorflow with that
this needs to be checked on OSX as well
ARM RPi3 toolchain is GCC 4.9

There is nothing newer available for RPi3, switching toolchain for ARM cross compilation might be very tricky.

reuben · 2017-09-10T17:27:33Z

@lissyx that's what I was afraid of. I remember having trouble building TensorFlow with GCC>4.9, so I think the easier solution is to go with Boost.Locale.

kdavis-mozilla · 2017-09-11T09:37:55Z

@reuben On the review. Should I wait until the switch to Boost.Locale?

reuben · 2017-09-11T10:47:43Z

@kdavis-mozilla I'll try to upgrade to GCC 5.0 first. But yeah, I'll let you know once it's ready for review.

reuben · 2017-09-11T19:36:11Z

There's no way to pass the appropriate include path to Bazel from the external world, so I vendored the necessary headers instead in native_client/boost_locale.

kdavis-mozilla

Just a few little nits, fix them if you want to

kdavis-mozilla · 2017-09-12T12:19:30Z

DeepSpeech.py

+tf.app.flags.DEFINE_string  ('lm_binary_path',       'data/lm/lm.binary', 'path to the language model binary file created with KenLM')
+tf.app.flags.DEFINE_string  ('lm_trie_path',         'data/lm/trie', 'path to the language model trie file created with native_client/generate_trie')
+tf.app.flags.DEFINE_integer ('beam_width',        1024,       'beam width used in the CTC decoder when building candidate transcriptions')
+tf.app.flags.DEFINE_float   ('lm_weight',         2.15,        'the alpha hyperparameter of the CTC decoder. Language Model weight.')


Could you give the reference which defines your α, β, and β'?

For example, the original Deep Speech paper[1] and the Deep Speech 2 paper[2] don't define β'.

I don't remember and can't find where I got the beta' terminology from. I'll reword the explanation.

kdavis-mozilla · 2017-09-12T12:19:48Z

DeepSpeech.py

+tf.app.flags.DEFINE_string  ('lm_trie_path',         'data/lm/trie', 'path to the language model trie file created with native_client/generate_trie')
+tf.app.flags.DEFINE_integer ('beam_width',        1024,       'beam width used in the CTC decoder when building candidate transcriptions')
+tf.app.flags.DEFINE_float   ('lm_weight',         2.15,        'the alpha hyperparameter of the CTC decoder. Language Model weight.')
+tf.app.flags.DEFINE_float   ('word_count_weight', -0.10,        'the beta hyperparameter of the CTC decoder. Word insertion weight (penalty).')


Could you give the reference which defines your α, β, and β'?

For example, the original Deep Speech paper[1] and the Deep Speech 2 paper[2] don't define β'.

kdavis-mozilla · 2017-09-12T12:19:55Z

DeepSpeech.py

+tf.app.flags.DEFINE_integer ('beam_width',        1024,       'beam width used in the CTC decoder when building candidate transcriptions')
+tf.app.flags.DEFINE_float   ('lm_weight',         2.15,        'the alpha hyperparameter of the CTC decoder. Language Model weight.')
+tf.app.flags.DEFINE_float   ('word_count_weight', -0.10,        'the beta hyperparameter of the CTC decoder. Word insertion weight (penalty).')
+tf.app.flags.DEFINE_float   ('valid_word_count_weight', 1.10,        'the beta\' hyperparameter of the CTC decoder. Valid word insertion weight.')


Could you give the reference which defines your α, β, and β'?

For example, the original Deep Speech paper[1] and the Deep Speech 2 paper[2] don't define β'.

kdavis-mozilla · 2017-09-12T12:42:27Z

native_client/beam_search.cc

  }

 private:
-  Model *model;
-  Alphabet *alphabet;
+  Model model;


Why do some members have a finale '_' and some don't?

Oversight! Will fix it.

kdavis-mozilla · 2017-09-12T12:42:32Z

native_client/beam_search.cc

-  Model *model;
-  Alphabet *alphabet;
+  Model model;
+  Alphabet alphabet;


Why do some members have a finale '_' and some don't?

kdavis-mozilla · 2017-09-12T12:42:43Z

native_client/beam_search.cc

-  Model *model;
-  Alphabet *alphabet;
+  Model model;
+  Alphabet alphabet;
  TrieNode *trieRoot;


Why do some members have a finale '_' and some don't?

lock · 2019-01-03T06:06:07Z

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

reuben force-pushed the ctc_decode_score_with_kenlm branch from 9f500ed to abe318e Compare August 31, 2017 14:28

kdavis-mozilla requested review from lissyx and kdavis-mozilla and removed request for lissyx and kdavis-mozilla August 31, 2017 14:28

reuben force-pushed the ctc_decode_score_with_kenlm branch 4 times, most recently from 2f701d3 to 895be68 Compare September 1, 2017 08:54

reuben force-pushed the ctc_decode_score_with_kenlm branch from 895be68 to e1ca3bc Compare September 1, 2017 09:50

lissyx reviewed Sep 1, 2017

View reviewed changes

kdavis-mozilla suggested changes Sep 5, 2017

View reviewed changes

lissyx reviewed Sep 5, 2017

View reviewed changes

kdavis-mozilla mentioned this pull request Sep 8, 2017

WER calculation seems to be a computational bottleneck #776

Closed

reuben force-pushed the ctc_decode_score_with_kenlm branch from 13e08c6 to 7af3b4d Compare September 11, 2017 19:15

reuben requested a review from kdavis-mozilla September 11, 2017 19:36

kdavis-mozilla approved these changes Sep 12, 2017

View reviewed changes

reuben mentioned this pull request Sep 12, 2017

Support custom alphabet mappings (Fixes #692) #797

Merged

reuben added 10 commits September 13, 2017 11:41

Import KenLM

af71da0

Write a CTC beam search decoder TF op that scores beams with our LM

fc91e3d

Remove current re-scoring of decoder output and switch to custom op

2cccd33

Make sure automation works with the new decoder

86b0ed6

Address review comments

1f3d26d

Package LICENSE and README.mozilla with native_client.tar.xz

b42c83c

Import Boost.Locale files needed for utf_to_utf conversion

7d06bd9

Switch from <codecvt> to Boost.Locale for charset transformation

194de74

Cleanup deepspeech_utils library definition

d6a2f58

Address final review comments

1bfb028

reuben force-pushed the ctc_decode_score_with_kenlm branch from c798f31 to 1bfb028 Compare September 13, 2017 14:43

Expose util/tc.py functionality as externally runnable and document it

4ccab71

reuben force-pushed the ctc_decode_score_with_kenlm branch from 1cc5402 to 4ccab71 Compare September 13, 2017 18:27

reuben merged commit 8818874 into master Sep 13, 2017

dsouza95 mentioned this pull request Sep 18, 2017

Problem using native client library + python bindings #836

Closed

reuben deleted the ctc_decode_score_with_kenlm branch November 17, 2017 18:54

lock bot locked and limited conversation to collaborators Jan 3, 2019

Score CTC prefix beams with KenLM #805

Score CTC prefix beams with KenLM #805

Conversation

reuben commented Aug 31, 2017

reuben commented Sep 1, 2017

reuben commented Sep 1, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lissyx Sep 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reuben Sep 1, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdavis-mozilla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kdavis-mozilla commented Sep 7, 2017

reuben commented Sep 10, 2017

reuben commented Sep 10, 2017

lissyx commented Sep 10, 2017

reuben commented Sep 10, 2017

kdavis-mozilla commented Sep 11, 2017

reuben commented Sep 11, 2017

reuben commented Sep 11, 2017

kdavis-mozilla left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

lock bot commented Jan 3, 2019

lissyx Sep 1, 2017 •

edited

Loading

reuben Sep 1, 2017 •

edited

Loading