Auto Byte



Science AI





<Effective Approaches to Attention-based Neural Machine Translation>

1. This paper examines two simple and effective classes of attentional mechanism: a global approach which always attends to all source words and a local one that only looks at a subset of source words at a time.

2. A global approach in which all source words are attended and a local one whereby
only a subset of source words are considered at a time. 
The former approach resembles the model of (Bahdanau et al., 2015) but is simpler architecturally. The latter can be viewed as an interesting blend between the hard and soft attention models proposed in (Xu et al., 2015): it is computationally less expensive than the global model or the
soft attention; at the same time, unlike the hard attention, the local attention is differentiable, making it easier to implement and train.

3. Neural Machine Translation

4. Common to these two types of models is the fact that at each time step t in the decoding phase, both approaches first take as input the hidden state h_t at the top layer of a stacking LSTM. The goal is then to derive a context vector c_t that captures relevant source-side information to help predict the current target word y_t

5. Global Attention

The idea of a global attentional model is to consider all the hidden states of the encoder when deriving the context vector c_t . In this model type, a variable-length alignment vector a_t , whose size equals the number of time steps on the source side, is derived by comparing the current target hidden state h_t with each source hidden state \overline{h_s}

The score is defined like this:

6. Global attentional model

At each time step t, the model infers a variable-length alignment weight vector at based on the current target state h_t and all source states \overline h_s . A global context vector c_t is then computed as the weighted average, according to a_t , over all the source states.

Given the alignment vector as weights, the context vector c_t is computed as the weighted average over all the source hidden states.

7. Compare to (Bahdanau et al., 2015): First, we simply use hidden states at the top LSTM layers in both the encoder and decoder as illustrated in Figure 2. Bahdanau et
al. (2015), on the other hand, use the concatenation of the forward and backward source hidden states in the bi-directional encoder and target hidden states in their non-stacking uni-directional decoder. Second, our computation path is simpler, while Bahdanau et al need to consider h_{t-1} . Lastly, Bahdanau et al. (2015) only experimented with one alignment function, the concat product; whereas we show later that the other alternatives are better.

8. Local Attention

In concrete details, the model first generates an aligned position p_tfor each target word at time t. The context vector c_t is then derived as a weighted average over the set of source hidden states within the window [ p_t−D, p_t+D], D is empirically selected.

We consider two variants of the model as below:

Monotonic alignment (local-m) – we simply set p_t = t assuming that source and target sequences are roughly monotonically aligned. The alignment vector at is defined according to the equation above.

Predictive alignment (local-p) – instead of assuming monotonic alignments, our model predicts an aligned position as follows:


9. Comparison to (Gregor et al., 2015): have proposed a selective attention mechanism, very simiar to our local attention, for the image generation
task. Their approach allows the model to select an image patch of varying location and zoom. We, instead, use the same “zoom” for all target positions, which greatly simplifies the formulation and still achieves good performance.

(I don't understand this well.)

10. Input-feeding Approach

In our proposed global and local approaches, the attentional decisions are made independently, which is suboptimal. Whereas, in standard MT, a coverage set is often maintained during the translation process to keep track of which source words have been translated. Likewise, in attentional NMTs, alignment decisions should be made jointly taking into account past alignment information. To address that, we propose an input-
approach in which attentional vectors \widetilde{h_t} are concatenated with inputs at the next time steps as illustrated below. The effects of having such connections are two-fold: (a) we hope to make the model fully aware of previous alignment choices and (b) we create a very deep network spanning both horizontally and vertically.




2013 年,Nal Kalchbrenner 和 Phil Blunsom 提出了一种用于机器翻译的新型端到端编码器-解码器结构 [4]。该模型可以使用卷积神经网络(CNN)将给定的一段源文本编码成一个连续的向量,然后再使用循环神经网络(RNN)作为解码器将该状态向量转换成目标语言。他们的研究成果可以说是神经机器翻译(NMT)的诞生;神经机器翻译是一种使用深度学习神经网络获取自然语言之间的映射关系的方法。NMT 的非线性映射不同于线性的 SMT 模型,而且是使用了连接编码器和解码器的状态向量来描述语义的等价关系。此外,RNN 应该还能得到无限长句子背后的信息,从而解决所谓的「长距离重新排序(long distance reordering)」问题。