Neural Machine Translation By Jointly Learning to Align and Translate Explained
Published:
In order to understand attention, we will begin by revisiting the soft‐attention mechanism, one of the earliest and most influential forms of attention in neural sequence models. In “Neural Machine Translation by Jointly Learning to Align and Translate” (ICLR 2015), Bahdanau and colleagues observed that forcing an entire input sentence into a single fixed-length vector creates a bottleneck, especially for long or complex sequences. Rather than squash all information into one summary vector, their model produces a separate annotation vector for each source word and then, at each decoding step, computes a tailored context by softly selecting and combining just the relevant annotations. In this way, the decoder need only attend to the parts of the input that matter most for generating the next target word, allowing the system to capture rich, long-range dependencies without overwhelming a single fixed representation. We will cover all this into detail along with the architecture presented in the paper.
A normal encoder-decoder architecture would encode the whole input sequence into a fixed-length vector. Like so
Instead, the proposed model encodes the input into a sequence of vectors and chooses only the relevant subset of vectors while decoding the translation (hence the title of the paper). This allows the model to deal better with long sentences because it doesn’t have to squash all the information into a fixed vector (
The traditional approach: Encoder-Decoder with fixed sized vector
In a normal RNN, the hidden state at time
Here,
In a “traditional” RNN Encoder-Decoder architecture the encoder would use these hidden states (
Where
We can draw the encoder architecture then to something like this:
As we’ve said, creating a single length-fixed vector that squash all the information from the hidden states comes at the risk of losing some of the important information, specially in long sequences.
Then, the decoder architecture would use this vector
Again,
Self-attention: variable vector
Instead of passing a single vector
Encoder
The encoder consist of a Bidirectional Recurrent Neural Network (BiRNN). As the name suggest, a forward Recurrent neural network reads the input sequence (
As in normal RNN, the hidden state at time
Here,
As I mentioned, the decoder is going to receive a variable context vector (
In this way, we can for each word an annotation that contains the summaries of both the preceding words and the following words.
Now, how is the context vector (
where
You may have realized that here we’re just applying a good-old softmax on
In practice this “alignment model” is a single hidden layer MLP: it linearly projects
To recap everything, we calculate a scalar
Roughly, this is what calculating
If we remove the lines from the hidden state in the decoder so its cleaner, this is how it looks like for all
Decoder
Finally, with the variable vector
Similarly, each hidden state can be calculated as:
In the paper
Then,
And then
The
That’s pretty much it. Thanks for reading!
Comments