Skip to content

Commit

Permalink
integrating changes to modern rnn chapter (#2241)
Browse files Browse the repository at this point in the history
* integrating changes to modern rnn chapter

* pushing first wave of edits on lstm notebook

* first wave of edits on GRU implemennted and pushed

* rewrote text for deep rnn, strippedfluff, simplified

* first major hack at bidirectional RNN. cut out the irrelevant bits, and the 'wrong application'

* implemented changes MT notebook

* implemented changes to encoder-decoder, made light edit to birnn

* wrapping up modern RNN push
  • Loading branch information
zackchase authored Aug 22, 2022
1 parent 55db13d commit 7c3b31b
Show file tree
Hide file tree
Showing 10 changed files with 891 additions and 822 deletions.
268 changes: 151 additions & 117 deletions chapter_recurrent-modern/beam-search.md
Original file line number Diff line number Diff line change
@@ -1,71 +1,87 @@
# Beam Search
:label:`sec_beam-search`

In :numref:`sec_seq2seq`,
we predicted the output sequence token by token
until the special end-of-sequence "<eos>" token
is predicted.
In this section,
we will begin with formalizing this *greedy search* strategy
and exploring issues with it,
then compare this strategy with other alternatives:
*exhaustive search* and *beam search*.

Before a formal introduction to greedy search,
let's formalize the search problem
using
the same mathematical notation from :numref:`sec_seq2seq`.
At any time step $t'$,
the probability of the decoder output $y_{t'}$
is conditional
on the output subsequence
$y_1, \ldots, y_{t'-1}$ before $t'$ and
the context variable $\mathbf{c}$ that
encodes the information of the input sequence.
In :numref:`sec_seq2seq`,
we introduced the encoder-decoder architecture,
and the standard techniques for training them end-to-end.
However, when it came to test-time prediction,
we mentioned only the *greedy* strategy,
where we select at each time step
the most token given the highest
predicted probability of coming next.
until, at some time step,
we find that we have predicted
the special end-of-sequence "<eos>" token.
In this section, we will begin
by formalizing this *greedy search* strategy
and identifying some problems
that practitioners tend to run into.
Subsequently, we compare this strategy
with two alternatives:
*exhaustive search* (illustrative but not practical)
and *beam search* (the standard method in practice).

Let's begin by setting up our mathematical notation,
borrowing conventions from :numref:`sec_seq2seq`.
At any time step $t'$, the decoder outputs
predictions representing the probability
of each token in the vocabulary
coming next in the sequence
(the likely value of $y_{t'+1}$,
conditioned on the previous tokens
$y_1, \ldots, y_{t'}$ and
the context variable $\mathbf{c}$,
produced by the encoder
to represent the input sequence.
To quantify computational cost,
denote by
$\mathcal{Y}$ (it contains "<eos>")
the output vocabulary.
So the cardinality $\left|\mathcal{Y}\right|$ of this vocabulary set
is the vocabulary size.
denote by $\mathcal{Y}$
the output vocabulary
(including the special end-of-sequence token "<eos>").
Let's also specify the maximum number of tokens
of an output sequence as $T'$.
As a result,
our goal is to search for an ideal output
from all the
Our goal is to search for an ideal output from all
$\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$
possible output sequences.
Of course,
for all these output sequences,
portions including and after "<eos>" will be discarded
in the actual output.
Note that this slightly overestimates
the number of distinct outputs
because there are no subsequent tokens
after the "<eos>" token occurs.
However, for our purposes,
this number roughly captures
the size of the search space.


## Greedy Search

First, let's take a look at
a simple strategy: *greedy search*.
This strategy has been used to predict sequences in :numref:`sec_seq2seq`.
In greedy search,
at any time step $t'$ of the output sequence,
we search for the token
with the highest conditional probability from $\mathcal{Y}$, i.e.,

$$y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$

as output.
Once "<eos>" is outputted or the output sequence has reached its maximum length $T'$, the output sequence is completed.

So what can go wrong with greedy search?
In fact,
the *optimal sequence*
should be the output sequence
with the maximum
$\prod_{t'=1}^{T'} P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c})$,
which is
the conditional probability of generating an output sequence based on the input sequence.
Unfortunately, there is no guarantee
that the optimal sequence will be obtained
by greedy search.
Consider the simple *greedy search* strategy from :numref:`sec_seq2seq`.
Here, at any time step $t'$,
we simply select the token
with the highest conditional probability
from $\mathcal{Y}$, i.e.,

$$y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c}).$$

Once our model outputs "<eos>"
(or we reach the maximum length $T'$)
the output sequence is completed.

This strategy might looks reasonable,
and in fact it's not so bad!
Considering how computationally undemanding it is,
you'd be hard pressed to get more bang for your buck.
However, if we put aside efficiency for a minute,
it might seem more reasonable to search
for the *most likely sequence*,
not the sequence of (greedily selected) *most likely tokens*.
It turns out that these two objects can be quite different.
The most likely sequence is the one that maximizes the expression
$\prod_{t'=1}^{T'} P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c})$.
In our machine translation example,
if the decoder truly recovered the probabilities
of the underlying generative process,
then this would give us the most likely translation.
Unfortunately, there is no guarantee
that greedy search will give us this sequence.

![At each time step, greedy search selects the token with the highest conditional probability.](../img/s2s-prob1.svg)
:label:`fig_s2s-prob1`
Expand All @@ -74,83 +90,93 @@ Let's illustrate it with an example.
Suppose that there are four tokens
"A", "B", "C", and "<eos>" in the output dictionary.
In :numref:`fig_s2s-prob1`,
the four numbers under each time step represent the conditional probabilities of generating "A", "B", "C", and "<eos>" at that time step, respectively.
At each time step,
greedy search selects the token with the highest conditional probability.
Therefore, the output sequence "A", "B", "C", and "<eos>" will be predicted
in :numref:`fig_s2s-prob1`.
The conditional probability of this output sequence is $0.5\times0.4\times0.4\times0.6 = 0.048$.

![The four numbers under each time step represent the conditional probabilities of generating "A", "B", "C", and "<eos>" at that time step. At time step 2, the token "C", which has the second highest conditional probability, is selected.](../img/s2s-prob2.svg)
the four numbers under each time step represent
the conditional probabilities of generating "A", "B", "C",
and "<eos>" at that time step, respectively.
At each time step, greedy search selects
the token with the highest conditional probability.
Therefore, the output sequence "A", "B", "C", and "<eos>"
will be predicted (:numref:`fig_s2s-prob1`).
The conditional probability of this output sequence
is $0.5\times0.4\times0.4\times0.6 = 0.048$.

![The four numbers under each time step represent
the conditional probabilities of generating "A", "B", "C", and "<eos>" at that time step.
At time step 2, the token "C", which has the second highest conditional probability,
is selected.](../img/s2s-prob2.svg)
:label:`fig_s2s-prob2`


Next, let's look at another example
in :numref:`fig_s2s-prob2`.
Next, let's look at another example in :numref:`fig_s2s-prob2`.
Unlike in :numref:`fig_s2s-prob1`,
at time step 2
we select the token "C"
at time step 2 we select the token "C"
in :numref:`fig_s2s-prob2`,
which has the *second* highest conditional probability.
Since the output subsequences at time steps 1 and 2,
on which time step 3 is based,
have changed from "A" and "B" in :numref:`fig_s2s-prob1` to "A" and "C" in :numref:`fig_s2s-prob2`,
have changed from "A" and "B" in :numref:`fig_s2s-prob1`
to "A" and "C" in :numref:`fig_s2s-prob2`,
the conditional probability of each token
at time step 3 has also changed in :numref:`fig_s2s-prob2`.
Suppose that we choose the token "B" at time step 3.
Now time step 4 is conditional on
the output subsequence at the first three time steps
"A", "C", and "B",
which is different from "A", "B", and "C" in :numref:`fig_s2s-prob1`.
Therefore, the conditional probability of generating each token at time step 4 in :numref:`fig_s2s-prob2` is also different from that in :numref:`fig_s2s-prob1`.
As a result,
the conditional probability of the output sequence "A", "C", "B", and "<eos>"
in :numref:`fig_s2s-prob2`
Therefore, the conditional probability of generating
each token at time step 4 in :numref:`fig_s2s-prob2`
is also different from that in :numref:`fig_s2s-prob1`.
As a result, the conditional probability of the output sequence
"A", "C", "B", and "<eos>" in :numref:`fig_s2s-prob2`
is $0.5\times0.3 \times0.6\times0.6=0.054$,
which is greater than that of greedy search in :numref:`fig_s2s-prob1`.
In this example,
the output sequence "A", "B", "C", and "<eos>" obtained by the greedy search is not an optimal sequence.
In this example, the output sequence "A", "B", "C", and "<eos>"
obtained by the greedy search is not the optimal sequence.

## Exhaustive Search

If the goal is to obtain the optimal sequence, we may consider using *exhaustive search*:
exhaustively enumerate all the possible output sequences with their conditional probabilities,
then output the one
with the highest conditional probability.

Although we can use exhaustive search to obtain the optimal sequence,
its computational cost $\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$ is likely to be excessively high.
For example, when $|\mathcal{Y}|=10000$ and $T'=10$, we will need to evaluate $10000^{10} = 10^{40}$ sequences. This is next to impossible!
On the other hand,
the computational cost of greedy search is
If the goal is to obtain the most likely sequence,
we may consider using *exhaustive search*:
exhaustively enumerate all the possible output sequences
with their conditional probabilities,
and then output the one that scores
the highest predicted probability.


While this would certainly give us what we desire,
it would come at a prohibitive computational cost
of $\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$,
exponential in the sequence length and with an enormous
base given by the vocabulary size.
For example, when $|\mathcal{Y}|=10000$ and $T'=10$,
we will need to evaluate $10000^{10} = 10^{40}$ sequences.
These are small numbers compared to real applications
but already beyond the capabilities any foreseeable computers
On the other hand, the computational cost of greedy search is
$\mathcal{O}(\left|\mathcal{Y}\right|T')$:
it is usually significantly smaller than
that of exhaustive search. For example, when $|\mathcal{Y}|=10000$ and $T'=10$, we only need to evaluate $10000\times10=10^5$ sequences.
miraculously cheap but far from optimal.
For example, when $|\mathcal{Y}|=10000$ and $T'=10$,
we only need to evaluate $10000\times10=10^5$ sequences.


## Beam Search

Decisions about sequence searching strategies
lie on a spectrum,
with easy questions at either extreme.
What if only accuracy matters?
Obviously, exhaustive search.
What if only computational cost matters?
Clearly, greedy search.
A real-world application usually asks
a complicated question,
somewhere in between those two extremes.

*Beam search* is an improved version of greedy search. It has a hyperparameter named *beam size*, $k$.
At time step 1,
we select $k$ tokens with the highest conditional probabilities.
You could view sequence decoding strategies as lying on a spectrum,
with *beam search* striking a compromise
between the efficiency of greedy search
and the optimality of exhaustive search.
The most straightforward version of beam search
is characterized by a single hyperparameter,
the *beam size*, $k$.
At time step 1, we select the $k$ tokens
with the highest predicted probabilities.
Each of them will be the first token of
$k$ candidate output sequences, respectively.
At each subsequent time step,
based on the $k$ candidate output sequences
at the previous time step,
we continue to select $k$ candidate output sequences
with the highest conditional probabilities
with the highest predicted probabilities
from $k\left|\mathcal{Y}\right|$ possible choices.

![The process of beam search (beam size: 2, maximum length of an output sequence: 3). The candidate output sequences are $A$, $C$, $AB$, $CE$, $ABD$, and $CED$.](../img/beam-search.svg)
Expand All @@ -166,7 +192,10 @@ where one of them is “<eos>”.
Let the beam size be 2 and
the maximum length of an output sequence be 3.
At time step 1,
suppose that the tokens with the highest conditional probabilities $P(y_1 \mid \mathbf{c})$ are $A$ and $C$. At time step 2, for all $y_2 \in \mathcal{Y},$ we compute
suppose that the tokens with the highest conditional probabilities
$P(y_1 \mid \mathbf{c})$ are $A$ and $C$.
At time step 2, for all $y_2 \in \mathcal{Y},$
we compute

$$\begin{aligned}P(A, y_2 \mid \mathbf{c}) = P(A \mid \mathbf{c})P(y_2 \mid A, \mathbf{c}),\\ P(C, y_2 \mid \mathbf{c}) = P(C \mid \mathbf{c})P(y_2 \mid C, \mathbf{c}),\end{aligned}$$

Expand All @@ -178,34 +207,39 @@ $$\begin{aligned}P(A, B, y_3 \mid \mathbf{c}) = P(A, B \mid \mathbf{c})P(y_3 \mi

and pick the largest two among these ten values, say
$P(A, B, D \mid \mathbf{c})$ and $P(C, E, D \mid \mathbf{c}).$
As a result, we get six candidates output sequences: (i) $A$; (ii) $C$; (iii) $A$, $B$; (iv) $C$, $E$; (v) $A$, $B$, $D$; and (vi) $C$, $E$, $D$.
As a result, we get six candidates output sequences:
(i) $A$; (ii) $C$; (iii) $A$, $B$; (iv) $C$, $E$; (v) $A$, $B$, $D$; and (vi) $C$, $E$, $D$.


In the end, we obtain the set of final candidate output sequences based on these six sequences (e.g., discard portions including and after “<eos>”).
Then
we choose the sequence with the highest of the following score as the output sequence:
In the end, we obtain the set of final candidate output sequences
based on these six sequences (e.g., discard portions including and after “<eos>”).
Then we choose the sequence with the highest
of the following score as the output sequence:

$$ \frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}\mid \mathbf{c}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$
:eqlabel:`eq_beam-search-score`

where $L$ is the length of the final candidate sequence and $\alpha$ is usually set to 0.75.
Since a longer sequence has more logarithmic terms in the summation of :eqref:`eq_beam-search-score`,
where $L$ is the length of the final candidate sequence
and $\alpha$ is usually set to 0.75.
Since a longer sequence has more logarithmic terms
in the summation of :eqref:`eq_beam-search-score`,
the term $L^\alpha$ in the denominator penalizes
long sequences.

The computational cost of beam search is $\mathcal{O}(k\left|\mathcal{Y}\right|T')$.
This result is in between that of greedy search and that of exhaustive search. In fact, greedy search can be treated as a special type of beam search with
a beam size of 1.
With a flexible choice of the beam size,
beam search provides a tradeoff between
accuracy versus computational cost.
This result is in between that of greedy search and that of exhaustive search.
Greedy search can be treated as a special case of beam search
arising when the beam size is set to 1.




## Summary

* Sequence searching strategies include greedy search, exhaustive search, and beam search.
* Beam search provides a tradeoff between accuracy versus computational cost via its flexible choice of the beam size.
Sequence searching strategies include
greedy search, exhaustive search, and beam search.
Beam search provides a tradeoff between accuracy versus
computational cost via its flexible choice of the beam size.


## Exercises
Expand Down
Loading

0 comments on commit 7c3b31b

Please sign in to comment.