Neural machine translation by jointly learning to align and translate #12

standing-o · 2022-05-03T12:50:40Z

Neural machine translation by jointly learning to align and translate

Neural machine translation (NMT) often belong to a family of encoder–decoders and encode a source sentence into a fixed-length vector from which a decoder generates a translation.
The use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.

The performance of a basic encoder–decoder deteriorates rapidly as the length of an input sentence increases.
We introduce an extension to the encoder–decoder model which learns to align and translate jointly.
Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated.
The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.
It encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation.
➔ This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector.

In the Encoder–Decoder framework, an encoder reads the input sentence, a sequence of vectors x = (x₁, · · · , x_Tx), into a vector c.
The most common approach is to use an RNN s.t. h_t = f(x_t, h_t-1) and c = q({h₁, ..., h_Tx}), where h_t ∈ Rⁿ is a hidden state at time t, and c is a vector generated from the sequence of the hidden states.
f and q are some nonlinear functions. For instances, Sutskever et al. (2014) used an LSTM as f and q({h₁, ..., h_Tx}) = h^T.
The decoder is often trained to predict the next word y_t' given the context vector c and all the previously predicted words {y₁, ..., y_T}.
➔ i.e., the decoder defines a probability over the translation y by decomposing the joint probability into the ordered conditionals:

where y = y₁, ..., y_Ty.
With an RNN, each conditional probability is modeled as p(yt | {y₁, ..., y_t-1} , c) = g(y_t−1, s_t, c), where g is a nonlinear, potentially multi-layered, function that outputs the probability of y_t, and s_t is the hidden state of the RNN.

Each conditional probability:

where s_i is an RNN hidden state for time i, computed by s_i = f(s_i-1, y_i-1, c_i).
➔ The probability is conditioned on a distinct context vector ci for each target word y_i.
The context vector ci depends on a sequence of annotations (h₁, ..., h_Tx) to which an encoder maps the input sentence.
Each annotation h_i contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence.
The context vector c_i is computed as a weighted sum of these annotations h_i:
The weight α_ij of each annotation h_j is computed by:

is an alignment model which scores how well the inputs around position j and the output at position i match.
➔ The score is based on the RNN hidden state s_i−1 and the j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system.
The alignment model directly computes a soft alignment, which allows the gradient of the cost function to be backpropagated through. This gradient can be used to train the alignment model as well as the whole translation model jointly.

We would like the annotation of each word to summarize not only the preceding words, but also the following words.
➔ We propose to use a bidirectional RNN (BiRNN, Schuster and Paliwal, 1997).
We obtain an annotation for each word x_j by concatenating the forward hidden state
h_j^-> and the backward one h_j^<-:
The annotation hj contains the summaries of both the preceding words and the following words. Due to the tendency of RNNs to better represent recent inputs, the annotation h_j will be focused on the words around x_j.
➔ This sequence of annotations is used by the decoder and the alignment model later to compute the context vector

Dataset : WMT ’14 contains the English-French parallel corpora
➔ After a usual tokenization, we use a shortlist of 30,000 most frequent words in each language to train our models. Any word not included in the shortlist is mapped to a special token ([UNK]).
Models : RNN Encoder–Decoder (RNNencdec, Cho et al.), RNNsearch(proposed model),
➔ We train each model twice: first with the sentences of length up to 30 words (RNNencdec-30, RNNsearch-30) and then with the sentences of length up to 50 word (RNNencdec-50, RNNsearch-50).
SGD with Adadelta : Each SGD update direction is computed using a minibatch of 80 sentences.
Once a model is trained, we use a beam search to find a translation that approximately maximizes the conditional probability.

BLEU score Table : the performance of the RNNsearch is as high as that of the conventional phrase-based translation system (Moses)
The performance of RNNencdec dramatically drops as the length of the sentences increases.
Both RNNsearch-30 and RNNsearch-50 are more robust to the length of the sentences. RNNsearch50, especially, shows no performance deterioration even with sentences of length 50 or more.

We extended the basic encoder–decoder by letting a model (soft-)search for a set of input words, or their annotations computed by an encoder, when generating each target word.
→ This frees the model from having to encode a whole source sentence into a fixed-length vector, and also lets the model focus only on information relevant to the generation of the next target word.
➔ This has a major positive impact on the ability of the neural machine translation system to yield good results on longer sentences.
All of the pieces of the translation system, including the alignment mechanism, are jointly trained towards a better log-probability of producing correct translations.

The text was updated successfully, but these errors were encountered:

standing-o added Time-Series Attention Transformer and removed Time-Series labels May 3, 2022

Repository owner locked and limited conversation to collaborators Jul 26, 2022