赵天琪作者

论文笔记(1)

<Neural Machine Translation and Sequence-to-sequence Models: A Tutorial>

1. Machine Translation is a special case of seq2seq Learning, but the most important one. Because it can easily be applied to other tasks and it can also learn from other tasks.

2. A translation system includes Modeling, Learning and Search.

3. n-gram

3.1 Word-by-word Computation of Probabilities

3.2 Count-based n-gram Language Models

However, because of phrases which are too long to show in the corpus, we introduce n-gram, which is used to approximate the probability.

But there are still some phrases which do not show even n is small enough, then we need smoothing. Like that:

There are many smoothing techniques, such as: Context-dependent smoothing coefficients, Back-off, and Modified distributions. See the paper for reference.

3.3 Evaluation of Language Models

As the usual way of machine learning tasks, the language models need Training data, Development data(Validation data), and Test data.

We usually use log-likelihood as a measure:

Another common measure of language model accuracy is perplexity:

An intuitive explanation of the perplexity is “how confused is the model about its decision?” More accurately, it expresses the value “if we randomly picked words from the probability distribution calculated by the language model at each time step, on average how many words would it have to pick to get the correct one?”

3.4 Handling Unknown Words

We usually assume that unknown words obey uniform distribution.

3.5 Further Reading(long distance, large scale...)

3.6 Exercise

4. Log-linear Language Models(use loss function)

4.1 calculate score, use the softmax...

4.2

It is insightful and reasonable. Because when P is close to 1, the loss is close to 0, when P is close to 0, the loss is close to negative infinite(I think we should add a minus notation here).

There are also a few things to consider to ensure that training remains stable:

1.Adjusting the learning rate: learning rate decay.

2.Early stopping: find the best stop time to prevent over fitting.

3.Shuffling training order: prevent bias(the news example).

And some optimization methods:

SGD with momentum: It is easy.

AdaGrad: AdaGrad focuses on the fact that some parameters are updated much more frequently than others. For example, in the model above, columns of the weight matrix W corresponding to infrequent context words will only be updated a few times for every pass through the corpus, while the bias b will be updated on every training example. Based on this, AdaGrad dynamically adjusts the training rate η for each parameter individually, with frequently updated (and presumably more stable) parameters such as b getting smaller updates, and infrequently updated parameters such as W getting larger updates.

Adam: Adam is another method that computes learning rates for each parameter. It does so by keeping track of exponentially decaying averages of the mean and variance of past gradients, incorporating ideas similar to both momentum and AdaGrad.

4.3 Derivatives for Log-linear Models

The answer to the above two questions is subtle. They include onehot.

Why? We need to refer to matrix calculus. Matrix calculus - Wikipedia

Derivative of Softmax loss function 机器学习中常用的矩阵求导公式 - 程序园

I proved the first one, which is right. I think the second one is similar to the first one.

4.4 Other Features for Language Modeling

Context word features, Context class, Context suffix features, Bag-of-words features...

4.5 Further Reading

Whole-sentence language models, Discriminative language models...

4.6 Exercise

BlueCatの窝
BlueCatの窝

关注机器学习,深度学习,自然语言处理,强化学习等人工智能新技术。

入门序列模型机器学习
1
相关数据
机器学习技术

机器学习是人工智能的一个分支,是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、计算复杂性理论等多门学科。机器学习理论主要是设计和分析一些让计算机可以自动“学习”的算法。因为学习算法中涉及了大量的统计学理论,机器学习与推断统计学联系尤为密切,也被称为统计学习理论。算法设计方面,机器学习理论关注可以实现的,行之有效的学习算法。

神经机器翻译技术

2013 年,Nal Kalchbrenner 和 Phil Blunsom 提出了一种用于机器翻译的新型端到端编码器-解码器结构 [4]。该模型可以使用卷积神经网络(CNN)将给定的一段源文本编码成一个连续的向量,然后再使用循环神经网络(RNN)作为解码器将该状态向量转换成目标语言。他们的研究成果可以说是神经机器翻译(NMT)的诞生;神经机器翻译是一种使用深度学习神经网络获取自然语言之间的映射关系的方法。NMT 的非线性映射不同于线性的 SMT 模型,而且是使用了连接编码器和解码器的状态向量来描述语义的等价关系。此外,RNN 应该还能得到无限长句子背后的信息,从而解决所谓的「长距离重新排序(long distance reordering)」问题。

参数技术

在数学和统计学裡,参数(英语:parameter)是使用通用变量来建立函数和变量之间关系(当这种关系很难用方程来阐述时)的一个数量。

混乱度技术

衡量概率分布或概率模型预测样本能力的一个度量单位,其可以被用来比较概率模型的好坏,值越低表示在预测样本方面的效果越好。

推荐文章
暂无评论
暂无评论~