Auto Byte



Science AI





<Addressing the Rare Word Problem in Neural Machine Translation>

1. A significant weakness in conventional NMT systems is their inability to correctly trans-late very rare words: end-to-end NMTs tend to have relatively small vocabularies with a single unk symbol that represents every possible out-of-vocabulary (OOV) word.

2. We train an NMT system on data that is augmented by the output of a word alignment algorithm, allowing the NMT system to emit, for each OOV word in the target sentence, the position of its corresponding word in the source sen-tence. This information is later utilized in a post-processing step that translates every OOV word using a dictionary.

3. Motivated by the strengths of standard phrasebased system, we propose and implement a novel approach to address the rare word problem of NMTs. Our approach annotates the training corpus with explicit alignment information that enables the NMT system to emit, for each OOV word, a “pointer” to its corresponding word in the source sentence. This information is later utilized in a post-processing step that translates the OOV words using a dictionary or with the identity translation, if no translation is found.

4. We propose to address the rare word problem by training the NMT system to track the origins of the unknown words in the target sentences. If we knew the source word responsible for each unknown target word, we could introduce a post-processing step that would replace each unk in the system’s output with a translation of its source word, using either a dictionary or the identity translation.

5. We treat the NMT system as a black box and train it on a corpus annotated by one of the models below. First, the alignments are produced with an unsupervised aligner(Berkeley aligner). Next, we use the alignment links to construct a word dictionary that will be used for the word translations in the post-processing step.

6. There are three kinds of models. Copyable Model, Positional All Model(PosAll), Positional Unknown Model(PosUnk).Copyable Model: unk_i in T corresponds to unk_i in S, but there are unk_\emptyset in T, so it is limited by its inability to translate unknown target words that are aligned to known words in the source sentence.

Positional All Model(PosAll): Use a universal unk token with a positional token p_d to reach complete alignments. But it doubles the length of T and makes learning more difficult.

Positional Unknown Model(PosUnk): Use unkpos_i in the T.




2013 年,Nal Kalchbrenner 和 Phil Blunsom 提出了一种用于机器翻译的新型端到端编码器-解码器结构 [4]。该模型可以使用卷积神经网络(CNN)将给定的一段源文本编码成一个连续的向量,然后再使用循环神经网络(RNN)作为解码器将该状态向量转换成目标语言。他们的研究成果可以说是神经机器翻译(NMT)的诞生;神经机器翻译是一种使用深度学习神经网络获取自然语言之间的映射关系的方法。NMT 的非线性映射不同于线性的 SMT 模型,而且是使用了连接编码器和解码器的状态向量来描述语义的等价关系。此外,RNN 应该还能得到无限长句子背后的信息,从而解决所谓的「长距离重新排序(long distance reordering)」问题。