研学社 · 入门组 | 《终极算法》第三章总结及第四章学习

近些年,人工智能领域发生了飞跃性的突破,更使得许多科技领域的学生或工作者对这一领域产生了浓厚的兴趣。在入门人工智能的道路上,The Master Algorithm 可以说是必读书目之一,其重要性不需多言。作者 Pedro Domingos 看似只是粗略地介绍了机器学习领域的主流思想,然而几乎所有当今已出现的、或未出现的重要应用均有所提及。本书既适合初学者从更宏观的角度概览机器学习这一领域,又埋下无数伏笔,让有心人能够对特定技术问题进行深入学习,是一本不可多得的指导性入门书籍。诙谐幽默的行文风格也让阅读的过程充满趣味。


以这本书为载体,机器之心「人工智能研学社 · 入门组」近期将正式开班(加入方式)!我们邀请所有对人工智能、机器学习感兴趣的初学者加入我们,通过对 The Master Algorithm 的阅读与讨论,宏观、全面地了解人工智能的发展历史与技术原理。

第3章总结

章节总结

如前一章所述,机器学习有五大学派。符号主义是其中之一。符号主义的核心信念是所有的智能都可以简化为符号操纵。任何事物都有两面性。符号主义的情况:

优点:

  • 简单:它不需要我们真正弄清楚进化或大脑如何运作,也避免了数学的复杂性。

  • 反向推导具有使知识变得相当容易学的关键属性。

不足之处:

  • 除非我们一直铭记最初的知识,否则很容易迷失在大量的归纳中。

  • 反向推导容易被噪音混淆。

  • 真实概念很少由一套规则简明扼要地定义——它们需要权衡和累积弱证据,直到出现清晰的图景。

什么是归纳:

示例 I

苏格拉底是人类。
.....?.....
因此,苏格拉底会死。

我们可以推导:

如果苏格拉底是人,那么他就会死

再进一步,我们可以应用牛顿原理,并将其概括到所有实体:

所有人都会死。

实例II - 使用归纳来计算基因组

如果温度高,则表达基因A。
.....?.....
如果基因C被表达,则基因D不表达。

如果我们知道第一个和第三个规则,而不知道第二个规则,并且我们有一个微阵列数据,其中B和D在高温下没有表达,我们可以通过反向推导来归纳第二个规则:

如果基因A表达并且基因B不表达,则基因C会表达。

一旦我们有了这个规则,这可以通过微阵列实验验证,我们可以将其作为进一步归纳推断的基础。

现在我们遇到一个问题。 反向推导是非常计算密集型的,这使得很难将其规模化,特别是对于大量数据集。 因此,符号主义选择的算法是决策树归纳。本书用生动的案例研究详细解释了决策树。决策树从根开始,每个节点询问一个属性的值,根据答案,我们前往一个或另一个分支。当我们到达叶子时,我们读出预测的概念:

Screen Shot 2017-01-07 at 6.48.14 PM.jpg

如果你要减税和生活,你是共和党人。
如果你反对减税,你就是民主党人。


决策树的属性:

  • 从根到叶的每条路径都对应于规则。

  • 不能超过一个你,或两者都没有。

(注意,具有这种属性的概念集称为类集合(sets of classes),预测它们的算法是分类器(classifier)。)

如何学习决策树:

  • 首先,我们选择一个属性来测试根。

  • 然后,我们将重点放在每个分支下的例子,然后选择下一个进行测试。(例如,我们检查减税是有利于生活还是有利于选择。)

  • 我们对每个引入的新节点重复这一点,直到分支中的所有示例都具有相同的类,此时我们用该类标记该分支。

先暂停一下,思考这个问题:我们如何验证决策树的逻辑和泛化性能? 我们如何避免在有限的数据上过拟合? 不管数据的大小,没有泛化性能,你就还在原地踏步。在机器学习中,先入为主的观念(假设)是不可或缺的。 总而言之,我们的目标是编写最简单的程序,它将不断从数据中更新自己,直到数据中的所有知识都被学习为止。


下面是本书中提到的两个学习规则的总结:

方法一

从限制性假设开始,如果它们不能解释数据,那么就逐渐放宽这一限制,这是机器学习中的一个典型方法。以好的匹配的规则为例:

  • 首先假设所有的匹配都很好。

  • 然后尝试排除所有没有特定属性的匹配项,并对每个属性重复这一操作。选择一个可以排除最多错误分类结果和最少的好的匹配项。

  • 依次添加每个其他属性。

  • 一旦排除了所有的不良匹配,就结束了:现在你有一个概念的定义,它只包括正匹配,排除了所有的负匹配。

方法二:「分治求解」

  • 在我们学习了每个规则之后,我们放弃了它所考虑的正样本。

  • 下一个规则试图解释尽可能多的余下的正样本,依此类推,直到全部都被考虑进来。

有时,算法是找到一个单一的规则,它保留了一些假设(不仅仅是一个)。 然后在每个步骤,在所有可能的方向扩展假设。


在本章中,作者介绍了以下概念:

理性主义者

感性欺骗人,而逻辑推理是唯一确定的知识之路。例如 专家、律师和数学家都是理性主义者。

经验主义者

所有推理都是错误的,知识必须来自观察和实验。例如 记者、医生和科学家都是经验主义者。

休谟归纳问题

归纳是从观察的结果中推断出不可观察的过程。休谟归纳问题是归纳法的一种不合理的合理。Hume认为,归纳既没有归纳理由也没有推导理由,因为两者都将落入循环推理中,不可避免地假设自然的一致性,这正是预期的结论。

过拟合

学习器从数据中发现一种不适合现实世界的模式,即这种模式不符合新数据。过拟合一些可能的原因:1)噪音(数据中的错误,不可预测的随机事件等),2)限制性假设太多但训练数据不足以说明区别。

第1周问答系列

什么是过拟合?

  • 参见章节摘要:「学习器从数据中发现不适合现实世界的模式,即这种模式不符合新数据。」

本章中提到的三种解决/改进过拟合的方法有哪些?

  • 使用统计显著性检验来确保我们看到的模式真的存在。

  • 使用「分治求解」方法来考虑更简单的假设,参考奥卡姆剃刀原理

  • 尝试获取更大/更具代表性的数据集

  • 预处理过程中的正则化

  • 评估过程中的交叉验证

  • 支持向量机(SVM)在分类/回归过程中

什么是归纳? 举个例子

  • 请参见章节摘要

用自己的例子制定决策树。

  • 省略;

  • 请在微信小组讨论您的设计。

第4章预习

章节总结

上个世纪,赫布定律(Hebb's rule)以及对大脑如何工作的理解自然地引发了基于联结主义的学习过程建模的发展,其中之一是感知器(Perceptron)模型。 它作为一个数学上没有异议的想法逐渐产生,但在应用中却出现了灾难性的后果。 后来,大脑和旋转玻璃之间的引人注目的比喻导致了第一个解决信用分配问题算法的产生,然后由玻尔兹曼机进行了改进。感知器模型再次进入S曲线函数和反向传播算法,在各个领域带来了成功的浪潮。

【重要章节】

感知器的繁荣和落寞

  • 在二十世纪二十年代中期,康奈尔心理学家弗兰克·罗森布拉特(Frank Rosenblatt)发明了感知器。 本节介绍感知器的基本思想,它如何对应于神经元,感知器中的权重,如何学习权重,以及当时感知器模型的约束。

物理学家:

  • 1982年,加州理工学院物理学家约翰·霍普菲尔德(John Hopfield)注意到大脑和旋转玻璃之间的类比,这激发了一种新型神经网络的定义,随着时间的推移,神经网络以最小能量状态作为回忆。然后波尔兹曼机产生了,通过用概率代替确定性神经元,有效地解决了信用分配问题。

在超空间攀登顶峰:

  • 用S曲线代替感知器的阶跃函数产生了一个重要而强大的算法:反向传播。 尽管发现全局性错误的一些问题仍然存在,但大多数时候它仍能充分解决问题。

深入大脑

  • 联结主义的显著进步之一是自动编码器:其输出与其输入相同的多层感知器,并且使用一些聪明的技巧,诸如稀疏自动编码器和稀疏自动编码器堆栈的变体在应用中起重要作用。 最后,作者探讨了建模智能的进步如何可以提高对大脑工作的理解。

【关键概念】

  • Hebb规则

  • 感知器模型

  • 波尔兹曼机

  • 反向传播

  • 自动编码器

【测验】

  1. 原始感知器模型如何对应于神经元模型?

  2. 玻尔兹曼机的核心是什么?

  3. 为什么作者在反向传播中大部分时间都表示局部最小化是足够的?

  4. 稀疏自动编码器的技巧是什么?

  5. 为什么我们离构建人造脑还有很长的路?


Chapter #3 Review

【Chapter Summary】

As mentioned in the previous chapter, there are five tribes of machine learning, and Symbolism is one of them.  The Symbolists’ core belief is that all intelligence can be simplified as manipulation of symbols. Everything has its two sides. In Symbolists' case:

Advantages:

  • It is simple : it doesn’t require us to actually figure out how evolution or the brain works, and it also avoids the mathematical complexities.

  • Inverse deduction has a crucial property that makes incorporating knowledge quite easy.

Shortcomings:

  • It is easy to get lost in a vast number of  possible inductions unless we stay close to the initial knowledge.

  • Inverse deduction is easily confused by noise.

  • Real concepts can seldom be concisely defined by a set of rules --- they require weighing and accumulating weak evidence until a clear picture emerges.

What's induction:

Example I

Socrates is human.
.....?.....
Therefore Socrates is mortal.

We can deduct that:

If Socrates is human, then he’s mortal.

One step more, we can apply Newton's principle, and generalize it to all entities:

All humans are mortal.

Example II - using induction to figure out genomes

If the temperature is high, gene A is expressed.
.....?.....
If gene C is expressed, gene D is not.

If we knew the first and third rules but not the second, and we had microarray data where B and D were not expressed at a high temperature, we could induce the second rule by inverse deduction:

If gene A is expressed and gene B is not, gene C is expressed.

Once we have that rule, which is perhaps verified by a microarray experiment, we can use it as the basis for further inductive inferences.

Now we encounter a problem. Inverse deduction is very computationally intensive, which makes it hard to scale especially for massive data sets. Thus, the symbolists' algorithm of choice is the decision tree induction. In the book, the decision tree is explained in details with vivid case studies. A decision tree starts at the root, each node asks about the value of one attribute, and depending on the answer, we follow one or another branch. When we arrive at a leaf, we read off the predicted concept:

Screen Shot 2017-01-07 at 6.48.14 PM.jpg

If you’re for cutting taxes and pro-life, you’re a Republican.
If you’re against cutting taxes, you’re a Democrat.

Attributes of a decision tree:

  • Each path from the root to a leaf corresponds to a rule.

  • You can not be more than one, or none of the above.

(Notes that sets of concepts with such property are called sets of classes, and the algorithm that predicts them is a classifier.)

How to learn a decision tree:

  • Firstly, we pick an attribute to test at the root.

  • Then we focus on examples which went down each branch and pick the next test for those. (For example, we check whether tax-cutters are pro-life or pro-choice.)

  • We repeat this for each new node we induce until all the examples in a branch have the same class, at which point we label that branch with the class.

Stop here and think about this question : How can we verify the logic and generalization of a decision tree ? How can we avoid the overfitting on limited data ? Despite the size of data, without generalization, you won’t even get off the ground. In machine learning, preconceived notions (assumptions) are indispensable . All in all , Our goal is to write the simplest program possible, that it will keep updating itself from data, until all knowledge in the data are learned.

Here is a summary of two learning rules mentioned in the book:

Method I
   Starting with restrictive assumptions and gradually relaxing them if they fail to explain the data is a typical approach in machine learning. Take learning the rules of good matches as an example:

  • Start by assuming that all matches are good.

  • Then try excluding all matches that do not have a certain attribute and repeat it for each attribute. Choose the one that can exclude the most misclassified results and the fewest good matches.

  • Add every other attribute in turn.

  • Once all the bad matches are excluded, you’re done: now you have a definition of the concept that includes only the positive matches and excludes all the negative ones.

Method II
   “Divide and Conquer”

  • After we learn each rule, we discard the positive examples that it accounts for.

  • The next rule tries to account for as many of the remaining positive examples as possible, and so on until all are accounted for.

Sometime the algorithm is to find a single rule, which keeps some hypotheses around (not just one). Then at each step, extend the hypotheses in all possible ways.

In this chapter, the author introduces the following concepts:

[Rationalist] The senses deceive people while logical reasoning is the only sure path to knowledge.
   e.g. Pundits, lawyers, and mathematicians are rationalists.

[Empiricist] All reasoning is fallible, and knowledge must come from observation and experimentation.
   e.g. Journalists, doctors, and scientists are empiricists.

[Hume's Problem of Induction] Induction is the process of inferring the unobserved from the observed. Hume's problem of induction is that inductive method is not rationally justifiable. Hume argues that there is neither inductive justification nor deductive justification for induction, because both would fall into circular reasoning with an unavoidable assumption of the uniformity of nature, which is exactly the conclusion expected to be proved.

[Overfitting] A learner finds a pattern in the data that is unsuitable to the real world, i.e., this pattern does not fit the new data. Some possible causes of overfitting: 1) noise (errors in the data, random events that are not predictable, etc) ; 2) Too many restrictive hypotheses but not enough training data to tell the distinction.


Week 1 Q & A Collection

  • What is overfitting?

    • See chapter summary: “A learner finds a pattern in the data that is unsuitable to the real world, i.e., this pattern does not fit the new data.”

  • What are three ways to solve / improve overfitting mentioned in this chapter?

    • “Use statistical significance tests to make sure the patterns we’re seeing are really there.”

    • Think of simpler hypothesis using “Divide and Conquer” method, refer Occam’s razor principle

    • Try to obtain a larger/more representative dataset

    • Regularization during pre-processing

    • Cross Validation during evaluation

    • Support Vector Machines (SVM) during classification/regression

  • What is induction? Give an example.

    • See chapter summary

  • Develop a decision tree with your own example.

    • Omit;

    • Please show your design in the wechat group for discussion

Chapter #4 Preview

【Chapter Summary】

Last century,  Hebb's rule and the understanding of how the brain works naturally sparked the developments in modeling the learning process based on connectionism, one of which was the Perceptron model. It rose as a mathematical unimpeachable idea, but fell with disastrous effects in application. Later, a striking analogy between brain and spin glass led to the first algorithm to solve the credit-assignment problem, which then was refined by the Boltzmann machine. Perceptron model came to stage again with S-Curve function and backpropagation algorithm, bringing waves of successes in a variety of fields.

【Important Sections】

  • The Rise and Fall of Perceptron

    • In the mid-twentieth century,  Cornell psychologist Frank Rosenblatt invented Perceptron. This section introduces the basic idea of perceptron, how it corresponds to neuron, the weights in perceptron, how to learn weights, and the constrains of perceptron model at that time.

  • Physicist Makes Brain out of Glass:

    • In 1982, Caltech physicist John Hopfield noticed the analogy between brain and spin glass, which inspired the definition of a new type of neural network that evolves over time with the minimum energy state as memories. Then Boltzmann machine occurred by replacing the deterministic neurons with probabilistic ones, which effectively solved the credit-assignment problem.

  • Climbing Mountains in Hyperspace:

    • Replacing the perceptron's step function with an S curve yielded an important and powerful algorithm: backpropagation. Though some issues on finding global error remains,  for most of the time it solves problem sufficiently.

  • Deeper into the Brain

    • One of the significant progress by connectionists is autoencoder: a multilayer perceptron whose output is the same as its input, and with some clever tricks, variations such as sparse autoencoder and sparse autoencoder stack come to play important roles in application. In the end, the author explores how the progress in modeling intelligence may improve the understanding of how brain works.

【Key Concepts】

  • Hebb's Rule

  • Perceptron Model

  • Boltzmann Machine

  • Back propagation

  • Autoencoder

【Quiz】

  1. How does the original perceptron model correspond to the neuron model?

  2. What is the core of the Boltzmann machine?

  3. Why does the author state that a local minimum is sufficient for most of the times in backpropagation?

  4. What is the trick for sparse autoencoder?

  5. Why are we still far away from building an artificial brain?

入门算法理论书籍深度研学社机器学习