# 研学社 · 入门组 | 第六期：初入贝叶斯网络

## 贝叶斯定理 • 托马斯贝叶斯：这位牧师第一次提出了对概率的新思考方式
• Pierre-Simon de Laplace：首次从贝叶斯观点出发发展出贝叶斯定理的法国人 • 朴素贝叶斯分类器可以表达为原因→ 效果图模型，就如上图所示。
• 朴素假设：给定分类标签，所有的特征都是条件独立的。比如说，p(X1|Y) 与 p(X2|Y) 是相互独立的。

• 运行时间复杂度 = O(CD)。C= 类型数， D=特征数
• 一个有足够数据去估测的过于简单模型比一个数据不足的完美模型更好
• 优势：快速；避免了过拟合；这个简单模型经验上标校友联表现优良，即使朴素假设并不实际
• 成对连接

• Markov 假定（错误但有用）一个事件的概率在文本的每个位置都是一样的 • 隐马尔可夫模型（HMM）：在一个隐藏状态中假定马尔可夫过程
• Speech Recognition(Siri)语音识别（Siri）hidden states: written wordsobservations: sounds spoken to Sirithe goal is to infer the words from the sounds 隐藏状态：写下来的文字 观察：说给 Siri 的话 目的是从声音中推断出文字 其它应用：计算生物学，词性标记

### Bayesian Network and its applications (Everything is connected, but not directly)

• 贝叶斯网络（Judea Pearl）是一个非常复杂的相关性随机变量网络，其中每个变量仅直接和其他很少的几个变量相关。
• 朴素贝叶斯、马尔可夫链，和隐马尔可夫模型是贝叶斯网络的几种特例。
• 例子：报警器。
• 你房子里装的报警器会因为盗贼试图入侵而激发，也会被地震激发如果警报器响了，邻居 Bob 或 Claire 会电话通知你 • 警报响了以后，Bob会根据盗窃或地震打电话。对于一个已有的警报，Bob 的电话与盗窃和地震是条件独立的。当他发现警报响起时，Bob 打电话通知的事件是与盗窃和地震条件独立的。若没有独立的结构，我们需要了解 2^5 种可能性。用这个结构，我们只需要 1+1+4+2+2 = 10 种可能性。
• 应用：需要领域知识辨识出图像的结构！！
• 生物：一个给定细胞中基因是如何互相调控的广告：选择放在网络上的广告游戏：给玩家评分，基于类似的技能匹配玩家

• 推理问题即在没有构建出完整概率表的情况下，如何计算一个特定的概率
• 在很多案例中，我们可以做到这点，且避免成指数放大
• 环路信念传播：
• 图（graph）包括了循环。我们假设图没有循环，仅仅是不停的往复传播概率，直到收敛。但它有可能得出一个错误的答案，或根本就不收敛。
• 马尔可夫链蒙特卡尔理论
• 设计一个收敛到贝叶斯网络分布的马尔可夫链。需要经过一系列步骤。使用一个建议分布 Q (通常是易处理的）逼近于复杂的真实（通常很棘手而且是高维的）数据分布。优势：一个好用的马尔可夫链会收敛到 s 稳态分布。劣势：很难收敛，并且会导致坏结果。

p(hypo|data) * p(data) = p(data|hypo) * p(hypo)

p(hypo|data) = p(data|hypo) * p(hypo)

• 频率学派：最大似然估计（MLE）：在进行推论时，我们只关心似然度，并选择给出所最大化 p(data|hypo) 的假设作为预测。
• 贝叶斯学派：最大后验概率（MLE）：我们也需要把先验 p(hypo) 纳入计算，不仅是似然度，还要选择给出最大 p(data|hypo) * p(hypo) 的假设作为预测。
• 如果我们认为所有假设都服从均匀分布，那么MAP = MLE 。
• 计算 MAP 需要先计算 p(data)。然而，实际上，p(data) 由高维度特征构成，因此， p(data) 很难精确计算。我们只能用数值法粗略估算它的下确界或上确界。除了计算之外，MAP经常引发数据分布的偏差，即 MAP 容易过拟合。适当的选择适合给定问题的方式永远是很重要的。
• MLE 的劣势：如果到目前时间还没有发生（可能性＝0），那么根据 MLE 它将来也永远不会出现。

• 马尔可夫网络是一组有着相关权重的特征，其定义了一个概率分布
• 它确实是一个无向图模型 • 应用：图像分割（把每个像素看作一个结点）

1. 有两种统计学家，一种是频率学派，认为频率就是概率。另一种是贝叶斯学派的，用新数据更新先验概率，以得出后验概率。
2. 贝叶斯网络是有向图模型，背后的定律就是贝叶斯理论。
3. 贝叶斯网络推论可以使精确或近似值。
4. MLE 和 MAP 是完全不同的估计方法。
5. 马尔可夫随机场是一种无向图模型。

1. 什么是贝叶斯定理？
2. P(A|B) = P(A) P(B|A) / P(B)“贝叶斯”定理只是一个说明了你在看到新证据是会更新相信程度的简单规律：如果证据和假设一致，假设的可能性就会提高，反之则降低
3. 什么是朴素假设，它在朴素贝叶斯分类器中扮演了什么角色？
4. “朴素假设：所有特征都是条件独立于给定类别标签的。”角色：它是朴素贝叶斯分类器的基础
5. 隐马尔可夫模型（HMM） 和马尔可夫模型之间的差别在哪里？
6. HMM 是马尔可夫模型中的一类，有着未被观察的（或部分被观察的）系统状态
7. 为何领域知识对与图像模型的构建和推论很重要？
8. “应用：需要领域知识辨识图的结构！”

## 第七章总结

［章节小结］

［重要的部分］

• 最近邻：面部识别和划定国界中的应用
• K-最近邻（KNN）：基础推荐系统中的应用
• 解决升维问题
• 支持向量机（SVM）
• 类比算法的主要子问题

［关键性概念］

• 历史
• 1951年两个伯克利统计学家写的一篇晦涩的技术报告，是发展的开端，后来出现了最近邻算法。到了 2000年代，支持向量机被开发了出来。
• “最近邻”的应用：基于数据点的简单分类：数据集越大越好
• Facebook 面部识别：最近邻有能力在整个Facebook数据库的标记照片中找到最相似的图片，在它 “最近邻” 的基础上。如果数据库中的图像与 Jane 刚刚上传的最相近，那么这正面孔就属于 Jane。Positiville vs. Negapolis：最近邻能通过计算两个国家首都的中心点找到两国边界。为获取更高的精确度，可以通过更多的附近城市重新定义位置。边界是以数据点位置和距离测量的形式揭示的，唯一的成本是查询时间。
• “ k- 最近邻 ” 应用：分类测试案例的方法是寻找它的多个 k 最近邻，并让它们 ”投票选举” 哪一个比单一最近邻更稳健。
• 贝叶斯推荐系统：现在意见一致的人很可能将来也会保持一致。有类似评分习惯的用户能被用作小组中其他人的参考。窍门：一个让最近邻法更有效的简单方式，删除所有由邻正确分类的样本案例，因为只有 ”边界线” 数据点是必须的。
• 怀疑论：最近邻能知道概念间的真实边界吗？
• 1967 年 Tom Cover 和 Peter Har 证明，只要有足够的数据，最近邻最差也只比最佳分类器的错误率高两倍。
• 维度问题
• 低维度上，最近邻很起作用，但高维度却不行。最近邻通常会对更高维度的非相关属性感到困惑。而且在更高的维度上，边界会崩溃维度问题是机器学习中第二糟糕的问题，排在过拟合之后。解决方案：对于最近邻，我们可以抛弃所有信息增益低于一定阈值的属性，只测量退化空间中的相似性。或者我们可以把属性选择”包围”在学习者四周，用一个保留删除属性的登山式搜索，只要它不会破坏最近邻在留存数据上的准确性。
• 支持向量机 (SVM)
• 历史：由 Vladimir Vapnik 在贝尔实验室所发明看起来像是加权 k－最近邻：这两个类别的边界是由一组有权重的案例集定义的，同时还有一个相似性度量。它被称为支持向量，因为他们是“支持”划分边界的向量。用 SVM 能得到一个非常光滑的边界。如何学习 SVM：我们选择支持向量和他们的权重。 SVM 找到适应于正面和反面地雷之间最快的那条蛇。SVM 在给定样本下，最大化其间的间隔，也就是在一定条件下最小化权重，由于精确值是任意的，这可能就是我们要找的那个。SVM vs. 感知机：SVM 作为多层感知机创造性的取得了很好的效果，这几年来被精巧地用于数字识别。 SVM 可以被看成是感知机的一般化，因为当你使用一种特别近似测量（向量点积）方法时得到的就是类别间的超平面边界。但SVM 相比多层感知机有一个主要的优势：权重有一个单一最佳条件，而不是很多的局部条件，所以更容易进行可靠地学习。一个 SVM 选择的支持向量越少，生成的效果就越好。一个 SVM 的期望误差最多也只是支持向量案例的一部分。维度增加时，这部分的大小也会增加。最终，SVM 还是免不了受到维度的限制，但它比绝大多数其他的方法都更好一些。直线：边沿如何弯曲并不重要，他们永远都是直线（或超平面）。
• 一个类推的两个主要子问题
• 1）搞清楚两个事情有多相似。2）确定从他们的相似性中推断出其他哪些东西。类推最简洁的技巧是跨问题领域学习。通过结构映射，类推可以做很多事情。结构映射有两个描述，寻求某些部分和关系间的连贯一致性，然后，基于这种一致性，从一个案例中转化出更深一层的特性。最后，根据 Douglas Hofstadter 的观点，没有类比推理不能解决的问题。
• 符号学家和类推学家之间的争论
• 在认知科学界，符号学家和类推学家之间之间一直存在着长久的争论。符号学家会指出一些他们能建模但类推学家无法做到的事情；而类推学家会想出解决方法，再找到一些他们能建模但单符号学家不能的事情，然后不断循环往复。Domingos 设计出了一个叫做 RISE 的算法，向着主算法（Master Algorithm）的形成迈进了一步，它能结合符号化和类比化推理。

### 【Quiz】

1. k-最近邻是如何基于最近邻演化的？
2. 为什么维度的提高会给最近邻算法带来问题？
3. 解释一下 SVM 的机制。

## Chapter #6 Review

### Bayes' theorem

The theorem that runs the world — Bayes' theorem Bayes’ theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down.

### History of Bayes' theorem:

• Tomas Bayes: a preacher who first describes a new way to think about chance.
• Pierre-Simon de Laplace: a French who first codifies Bayes' insights to Bayes' theorem

Humans, it turns out, are not very good at Bayesian inference, at least when verbal reasoning is involved. A problem is that we tend to neglect the cause’s prior probability.

HIV example: If you get a positive test result for HIV, and the test only gives 1% false positives. At first sight, it seems like your chance of having AIDS is now 99%. Let's apply Bayes' theorem here, p(HIV|positive) = p(HIV) × p(positive|HIV) / p(positive) = 0.003 × 0.99 / 0.01 = 0.297 (here we assume p(HIV) — the possibility of get HIV in the general population — is 0.003 in US; p(positive)

is the probability that the test comes out positive whether or not you have AIDS, let's say 0.01). So given a positive test result, the probability of getting HIV is only 0.297.

Frequentists: the probability is a frequency. They estimate probability by counting how often the corresponding events occur

Bayesians: the probability is a subjective degree of belief. They state that you should update your prior beliefs with new evidence to obtain your posterior beliefs.

### Naive Bayes Classifier (All models are wrong, but some are useful.) — George Box . (BoxCox) • The Naive Bayes classifier can be expressed as a cause → effects graphical model, like the figure shown above.
• Naive Assumption: all features are conditionally independent given the class label. For example, p(X1|Y) is independent with p(X2|Y).

P(X1, X2|Y) = p(X1|Y) * p(X2|Y) by independence

• Runtime complexity = O(CD). C=the number of classes, D=the number of features
• An oversimplified model that you have enough data to estimate is better than a perfect one with insufficient data.
• Advantages: fast; immune to overfitting; this simple model performs well empirically, even though the naive assumption is not realistic.
• Pairwise connection

### Markov Chain & HMM (From Eugene Onegin to Siri)

• Markov assumed (wrongly but usefully) that the probability of an event is the same at every position in the text. • Hidden Markov model: assume a Markov process within the hidden state.
• Speech Recognition(Siri)hidden states: written wordsobservations: sounds spoken to Sirithe goal is to infer the words from the sounds • Other application: computational biology, part-of-speech tagging

### Bayesian Network and its applications (Everything is connected, but not directly)

• Bayesian Network(Judea Pearl) is a very complex network of dependencies among random variables, provided that each variable only directly depends on a few others.
• Naive Bayes, Markov Chain, and Hidden Markov Model are special cases of Bayesian Networks
• Example: Burglar alarms.
• The alarm in your house can be triggered by a burglar attempting to break in, or an earthquake.Neighbor Bob or Neighbor Claire would call you if the alarm sounds. • Bob calling depends on Burglary and Earthquake, but only through Alarm. Bob’s call is conditionally independent of the burglary and an earthquake given Alarm. When Bob observes that the alarm is sound, the event Bob calls is conditionally independent with the burglary and Earthquake.Without independent structure, we need to learn 2^5 probabilities. With this structure, we only need 1+1+4+2+2 = 10 probabilities. (Avoid the curse of dimensionality)
• Applications: Need domain knowledge to identify the structure of the graph!
• Biology: how genes regulate each other in a given cell.Ad: choosing ads to place on the web.Game: rate players and match players based on similar skills.

### Inference (The inference problem)

• The inference problem is how to compute a specific probability without building a full table of probability.
• In many cases, we can do this and avoid the exponential blowup
• Loopy belief propagation:
• The graph contains loops.we pretend that the graph has no loops and just keep propagating probabilities back and forth until they converge.But it can converge to a wrong answer, or not converge at all.
• Markov Chain Monte Carlo:
• Design a Markov chain that converges to the distribution of our Bayesian network. Need to take a sequence of steps.Use a proposal distribution Q (often tractable) to approximate the complex (often intractable and high-dimension) real data distribution P.Advantage: A well-behaved Markov chain converges to s stable distribution.Disadvantage: It is hard to converge and can result in a bad result; burn-in phase.

### Maximum A Posteriori & Maximum Likelihood Estimation ( Learning the Bayesian way)

p(hypo|data) * p(data) = p(data|hypo) * p(hypo)

We can ignore p(data) because it is the same for all hypotheses.

p(hypo|data) = p(data|hypo) * p(hypo)

prior: p(hypo)
likelihood: p(data|hypo)
posterior: p(data|hypo) * p(hypo)
• Frequentist: Maximum Likelihood Estimation (MLE): when doing inference, we only care about the likelihood, and choose the hypothesis that gives the maximum of p(data|hypo) as our prediction.
• Bayesian: Maximum A Posteriori (MAP): we also need to take the prior p(hypo) into account, not just likelihood, and choose the hypothesis that given the maximum p(data|hypo) * p(hypo) as our prediction.
• If we assume that all hypotheses are uniformly distributed: MAP = MLE
• Calculating MAP needs to compute p(data) first. However, practically, p(data) consists of high dimension features, hence, p(data) is hard to compute explicitly. Often, we only could use numeric methods to poorly estimate its lower or higher boundary. Besides the computational consideration, MAP often incurs the bias of data distribution, i.e. MAP tends to overfit. It is always important to properly choosing the methods that suit the given problem.
• Disadvantage of MLE: if the event never occurs so far (likelihood = 0), it will also never show up in the future based on MLE.

### Markov network / Markov random field (Markov weights the evidence)

• A Markov network is a set of features with corresponding weights, which all together define a probability distribution. (p.171)
• It is indeed an undirected graphical model

[Image: https://quip.com/-/blob/KbdAAAsU1Hh/y9n31veW589ysNSGgQGPEQ]

• Application: image segmentation (view every pixel as a node)

## Reflection & Highlights:

1. There are two kinds of statisticians. One is frequentist, who regards the frequency as probability. The other is Bayesian, who updates the prior probability with new data to get the posterior probability.
2. Bayesian Networks are directed graphical model, and the underlying rule is Bayes' theorem.
3. Bayesian network inference could be exact or approximate.
4. MLE and MAP are completely different estimation methods.
5. Markov random field is an undirected graphical model.

## Week 6 Q & A Collection

1. What is Bayes' theorem?
2. P(A|B) = P(A) P(B|A) / P(B)“Bayes’ theorem is just a simple rule for updating your degree of belief in a hypothesis when you receive new evidence: if the evidence is consistent with the hypothesis, the probability of the hypothesis goes up; if not, it goes down.”
3. What is Naive assumption, and what role does it play in Naive Bayes classifier?
4. “Naive Assumption: all features are conditionally independent given the class label.” Role: It is the foundation of Naive Bayes classifier
5. What is the difference between HMM and Markov model?
6. HMM is one type of Markov models with unobserved (or partially observable) system state
7. Why is domain knowledge important in building and inferencing graphical models?
8. “Applications: Need domain knowledge to identify the structure of the graph!”

## Chapter #7 Preview

### 【Chapter Summary】

This chapter describes the “analogizers” tribe of machine learning, and pinpoints the algorithm methods that identify similarities. The author begins with a fun example of a culprit who committed crimes using great impersonation skills. On the same train of logic, algorithms work well based on some simple comparisons. Analogy was the spark that ignited many historical advancements in science. Analogical reasoning inspired many intellectual developments in machine learning. This branch of learning is the least cohesive out of the five tribes. This chapter introduces 1) nearest neighbor and k-nearest neighbor; 2) SVM; 3) general analogical methods.

### 【Important Sections】

• Nearest-Neighbour: application in facial recognition and drawing nation borders
• K-nearest Neighbour (KNN): application in basic recommendation system
• Solving the problem of rising dimensions
• Support Vector Machine (SVM)
• The main subproblems of analogical algorithms

### 【Key Concepts】

• History
• The development started slowly with an obscure technical report written in 1951 by two Berkeley statisticians, coning what was later called the nearest neighbor algorithm. By 2000s, support vector machine is developed.
• Application of “Nearest-neighbor”: simple classification based on data points: the bigger the database, the better.
• Facebook facial recognition: nearest neighbor has the ability to find the picture most similar to it in Facebook's entire database of labeled photo based on its “nearest neighbors”. If the image in the database is most similar to the one Jane just uploaded, then the face is Jane's. Positiville vs. Negapolis: nearest neighbor can find borders between two countries by calculating the midpoint of their capitals. To achieve higher exactitude, the border can be refined by the location of more nearby cities. The boarder is implicit in terms of the locations of the data points and the distance measurements, and the only cost is query time.
• Application of “k-nearest-neighbor”: test example is classified by finding its k nearest neighbors and letting them “vote” which is more robust than a single nearest neighbor.
• Basic recommendation system: people who are in agreement right now are likely to keep the agreement in the future. Users with similar rating habits can be used as the reference to others in the group.A trick: A simple way to make nearest-neighbor more efficient is to delete all the examples that are correctly classified by their neighbors because only “borderline” data points are necessary.
• Skepticism: can nearest neighbor learn the true borders between concepts?
• 1967 Tom Cover and Peter Hart proved that given enough data, nearest-neighbor is at worst only twice as error-prone as the best imaginable classifier.
• Problem of Dimensionality
• In low dimension, nearest-neighbor works well, but not otherwise. Nearest-neighbor is often confused by irrelevant attributes of higher dimensions. Also in a higher dimension, borders can break down. The problem of dimensionality is the second worst problem in machine learning, after overfitting. Solution: for nearest-neighbor, we can discard all the attributes whose information gain is below some threshold and then only measure similarity in the reduced space. Or we can “wrap” the attribute selection around the learner itself, with a hill-climbing searching that keeps deleting attributes as long as it doesn't hurt the nearest-neighbor's accuracy on held-out data.
• Support Vector Machine (SVM)
• History: invented by Vladimir Vapnik at Bell Labs. Looks like weighted k-nearest-neighbor: the frontier between two classes is defined by a set of examples with weights, together with a similarity measure. It is called support vectors because they are vectors “support” the demarcation border. With SVM, we can have a very smooth border.How to Learn SVM: we choose the support vectors and their weights. The SVM finds the fattest snake that fits between the positive and negative landmines. The SVM minimizes the weights under the constraint that all examples have a given margin, which could be one since the precise value is arbitrary. SVM vs. Perceptron: SVMs did well out of the box as multilayer perceptrons that had been carefully crafted for digital recognition over the years. SVMs can be seen as a generalization of the perceptron, because a hyperplane boundary between classes is what you get when you use a particularly similar measure (the dot product between vectors). But SVMs have a major advantage compared to multilayer perceptrons: the weights have a single optimum instead of many local ones and so learning them reliably is much easier. The fewer support vectors an SVM selects, the better it generalizes. The expected error rate of an SVM is at most the fraction of examples that are support vectors. As the number of dimensions goes up, the fraction goes up too. In the end, SVM is not immune to the curse of dimensionality, it is more resistant than most others. Straight Lines: it doesn't matter how curvy the frontiers are, they are always straight lines (or hyperplanes).
• Two Main Subproblems of Analogy:
• 1) Figuring out how similar two things are 2) Deciding what else to infer from their similaritiesAnalogizer's neatest trick is learning across problem domains. Analogizes can do a lot of things using structure mapping. Structure mapping takes two descriptions, find a coherent correspondence between some of the parts and relationships, and then, based on correspondence, transfer further properties from one. Finally, according to Douglas Hofstadter there is nothing that analogical reasoning can't solve.
• Debate Between Symbolists and Analogizer
• Cognitive science has seen a long-running debate between symbolists and analogizers. Symbolists point to something they can model that analogizers can't; then analogizers figure out how to do it, come up with something they can model that symbolists can't, and the cycle repeats. Domingos designed an algorithm called RISE that was a step toward the Master Algorithm because it combined symbolic and analogical reasoning.

### 【Quiz】

1. How does k-nearest neighbor improve based on nearest-neighbor?
2. Why would rising dimensionality create problems for nearest-neighbor algorithms?
3. Explain the mechanism of SVM. 