# 论文笔记(1)

<Neural Machine Translation and Sequence-to-sequence Models: A Tutorial>

1. Machine Translation is a special case of seq2seq Learning, but the most important one. Because it can easily be applied to other tasks and it can also learn from other tasks.

2. A translation system includes Modeling, Learning and Search.

3. n-gram

3.1 Word-by-word Computation of Probabilities

3.2 Count-based n-gram Language Models

However, because of phrases which are too long to show in the corpus, we introduce n-gram, which is used to approximate the probability.

But there are still some phrases which do not show even n is small enough, then we need smoothing. Like that:

There are many smoothing techniques, such as: Context-dependent smoothing coeﬃcients, Back-oﬀ, and Modiﬁed distributions. See the paper for reference.

3.3 Evaluation of Language Models

As the usual way of machine learning tasks, the language models need Training data, Development data(Validation data), and Test data.

We usually use log-likelihood as a measure:

Another common measure of language model accuracy is perplexity:

An intuitive explanation of the perplexity is “how confused is the model about its decision?” More accurately, it expresses the value “if we randomly picked words from the probability distribution calculated by the language model at each time step, on average how many words would it have to pick to get the correct one?”

3.4 Handling Unknown Words

We usually assume that unknown words obey uniform distribution.

3.5 Further Reading(long distance, large scale...)

3.6 Exercise

4. Log-linear Language Models(use loss function)

4.1 calculate score, use the softmax...

4.2

It is insightful and reasonable. Because when P is close to 1, the loss is close to 0, when P is close to 0, the loss is close to negative infinite(I think we should add a minus notation here).

There are also a few things to consider to ensure that training remains stable:

1.Adjusting the learning rate: learning rate decay.

2.Early stopping: find the best stop time to prevent over fitting.

3.Shuﬄing training order: prevent bias(the news example).

And some optimization methods:

SGD with momentum: It is easy.

AdaGrad: AdaGrad focuses on the fact that some parameters are updated much more frequently than others. For example, in the model above, columns of the weight matrix W corresponding to infrequent context words will only be updated a few times for every pass through the corpus, while the bias b will be updated on every training example. Based on this, AdaGrad dynamically adjusts the training rate η for each parameter individually, with frequently updated (and presumably more stable) parameters such as b getting smaller updates, and infrequently updated parameters such as W getting larger updates.

Adam: Adam is another method that computes learning rates for each parameter. It does so by keeping track of exponentially decaying averages of the mean and variance of past gradients, incorporating ideas similar to both momentum and AdaGrad.

4.3 Derivatives for Log-linear Models

The answer to the above two questions is subtle. They include onehot.

Why? We need to refer to matrix calculus. Matrix calculus - Wikipedia

I proved the first one, which is right. I think the second one is similar to the first one.

4.4 Other Features for Language Modeling

Context word features, Context class, Context suﬃx features, Bag-of-words features...

4.5 Further Reading

Whole-sentence language models, Discriminative language models...

4.6 Exercise BlueCatの窝

2013 年，Nal Kalchbrenner 和 Phil Blunsom 提出了一种用于机器翻译的新型端到端编码器-解码器结构 。该模型可以使用卷积神经网络（CNN）将给定的一段源文本编码成一个连续的向量，然后再使用循环神经网络（RNN）作为解码器将该状态向量转换成目标语言。他们的研究成果可以说是神经机器翻译（NMT）的诞生；神经机器翻译是一种使用深度学习神经网络获取自然语言之间的映射关系的方法。NMT 的非线性映射不同于线性的 SMT 模型，而且是使用了连接编码器和解码器的状态向量来描述语义的等价关系。此外，RNN 应该还能得到无限长句子背后的信息，从而解决所谓的「长距离重新排序（long distance reordering）」问题。 