You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Neural machine translation (NMT) often belong to a family of encoder–decoders and encode a source sentence into a fixed-length vector from which a decoder generates a translation.
The use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly.
Introduction
The performance of a basic encoder–decoder deteriorates rapidly as the length of an input sentence increases.
We introduce an extension to the encoder–decoder model which learns to align and translate jointly.
Each time the proposed model generates a word in a translation, it (soft-)searches for a set of positions in a source sentence where the most relevant information is concentrated.
The model then predicts a target word based on the context vectors associated with these source positions and all the previous generated target words.
It encodes the input sentence into a sequence of vectors and chooses a subset of these vectors adaptively while decoding the translation.
➔ This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector.
RNN Encoder-decoder
In the Encoder–Decoder framework, an encoder reads the input sentence, a sequence of vectors x = (x1, · · · , xTx), into a vector c.
The most common approach is to use an RNN s.t. ht = f(xt, ht-1) and c = q({h1, ..., hTx}), where ht ∈ Rn is a hidden state at time t, and c is a vector generated from the sequence of the hidden states.
f and q are some nonlinear functions. For instances, Sutskever et al. (2014) used an LSTM as f and q({h1, ..., hTx}) = hT.
The decoder is often trained to predict the next word yt' given the context vector c and all the previously predicted words {y1, ..., yT}.
➔ i.e., the decoder defines a probability over the translation y by decomposing the joint probability into the ordered conditionals:
where y = y1, ..., yTy.
With an RNN, each conditional probability is modeled as p(yt | {y1, ..., yt-1} , c) = g(yt−1, st, c), where g is a nonlinear, potentially multi-layered, function that outputs the probability of yt, and st is the hidden state of the RNN.
Learning to align and translate (New model architecture)
Decoder : General description
Each conditional probability:
where si is an RNN hidden state for time i, computed by si = f(si-1, yi-1, ci).
➔ The probability is conditioned on a distinct context vector ci for each target word yi.
The context vector ci depends on a sequence of annotations (h1, ..., hTx) to which an encoder maps the input sentence.
Each annotation hi contains information about the whole input sequence with a strong focus on the parts surrounding the i-th word of the input sequence.
The context vector ci is computed as a weighted sum of these annotations hi:
The weight αij of each annotation hj is computed by:
is an alignment model which scores how well the inputs around position j and the output at position i match.
➔ The score is based on the RNN hidden state si−1 and the j-th annotation hj of the input sentence.
We parametrize the alignment model a as a feedforward neural network which is jointly trained with
all the other components of the proposed system.
The alignment model directly computes a soft alignment, which allows the gradient of the cost function to be backpropagated through. This gradient can be used to train the alignment model as well as the whole translation model jointly.
Encoder : Bidirectional RNN for annotating sequences
We would like the annotation of each word to summarize not only the preceding words, but also the following words.
➔ We propose to use a bidirectional RNN (BiRNN, Schuster and Paliwal, 1997).
We obtain an annotation for each word xj by concatenating the forward hidden state
hj-> and the backward one hj<-:
The annotation hj contains the summaries of both the preceding words and the following words. Due to the tendency of RNNs to better represent recent inputs, the annotation hj will be focused on the words around xj.
➔ This sequence of annotations is used by the decoder and the alignment model later to compute the context vector
Experiment settings
Dataset : WMT ’14 contains the English-French parallel corpora
➔ After a usual tokenization, we use a shortlist of 30,000 most frequent words in each language to train our models. Any word not included in the shortlist is mapped to a special token ([UNK]).
Models : RNN Encoder–Decoder (RNNencdec, Cho et al.), RNNsearch(proposed model),
➔ We train each model twice: first with the sentences of length up to 30 words (RNNencdec-30, RNNsearch-30) and then with the sentences of length up to 50 word (RNNencdec-50, RNNsearch-50).
SGD with Adadelta : Each SGD update direction is computed using a minibatch of 80 sentences.
Once a model is trained, we use a beam search to find a translation that approximately maximizes the conditional probability.
Results
BLEU score Table : the performance of the RNNsearch is as high as that of the conventional phrase-based translation system (Moses)
The performance of RNNencdec dramatically drops as the length of the sentences increases.
Both RNNsearch-30 and RNNsearch-50 are more robust to the length of the sentences. RNNsearch50, especially, shows no performance deterioration even with sentences of length 50 or more.
Conclusion
We extended the basic encoder–decoder by letting a model (soft-)search for a set of input words, or their annotations computed by an encoder, when generating each target word.
→ This frees the model from having to encode a whole source sentence into a fixed-length vector, and also lets the model focus only on information relevant to the generation of the next target word.
➔ This has a major positive impact on the ability of the neural machine translation system to yield good results on longer sentences.
All of the pieces of the translation system, including the alignment mechanism, are jointly trained towards a better log-probability of producing correct translations.
The text was updated successfully, but these errors were encountered:
Neural machine translation by jointly learning to align and translate
Abstract
Neural machine translation (NMT)
often belong to a family of encoder–decoders and encode a source sentence into a fixed-length vector from which a decoder generates a translation.Introduction
align
and translate jointly.➔ This frees a neural translation model from having to squash all the information of a source sentence, regardless of its length, into a fixed-length vector.
RNN Encoder-decoder
➔ i.e., the decoder defines a probability over the translation y by decomposing the joint probability into the ordered conditionals:
where y = y1, ..., yTy.
Learning to
align
and translate (New model architecture)Decoder : General description
where si is an RNN hidden state for time i, computed by si = f(si-1, yi-1, ci).
➔ The probability is conditioned on a distinct context vector ci for each target word yi.
is an
alignment
model which scores how well the inputs around position j and the output at position i match.➔ The score is based on the RNN hidden state si−1 and the j-th annotation hj of the input sentence.
alignment
model a as a feedforward neural network which is jointly trained withall the other components of the proposed system.
alignment
model directly computes a softalignment
, which allows the gradient of the cost function to be backpropagated through. This gradient can be used to train thealignment
model as well as the whole translation model jointly.Encoder : Bidirectional RNN for annotating sequences
➔ We propose to use a bidirectional RNN (BiRNN, Schuster and Paliwal, 1997).
hj-> and the backward one hj<-:
➔ This sequence of annotations is used by the decoder and the
alignment
model later to compute the context vectorExperiment settings
➔ After a usual tokenization, we use a shortlist of 30,000 most frequent words in each language to train our models. Any word not included in the shortlist is mapped to a special token ([UNK]).
➔ We train each model twice: first with the sentences of length up to 30 words (RNNencdec-30, RNNsearch-30) and then with the sentences of length up to 50 word (RNNencdec-50, RNNsearch-50).
Results
Conclusion
→ This frees the model from having to encode a whole source sentence into a fixed-length vector, and also lets the model focus only on information relevant to the generation of the next target word.
➔ This has a major positive impact on the ability of the neural machine translation system to yield good results on longer sentences.
alignment
mechanism, are jointly trained towards a better log-probability of producing correct translations.The text was updated successfully, but these errors were encountered: