# Bayesian Learning

Most of the information presented here has been taken from Chapter 6 of the book Machine Learning by Tom M. Mitchell. Also A Tutorial on Learning with Bayesian Networks was consulted.

## Contents

## Introduction

Bayesian reasoning provides a probabilistic approach to inference. Bayesian learning methods are based on the assumption that inferences can be made by reasoning about the probability distribution of the hypotheses together with the observed data. Bayes theorem lies at the heart of Bayesian learning approach.

### Bayes Theorem

Bayes theorem expresses the conditional probability (also called posterior probability) of a hypothesis h given data D in terms of the prior probability of h, prior probability of D, and the conditional probability of D given h.

Bayes Theorem: <math> P(h|D) = {P(D|h)P(h) \over P(D)} </math>

As implied by the formula,<math>P(h|D)</math> increases with <math>P(h)</math> and <math>P(D|h)</math> while decreases with <math>P(D)</math>.

Given a set of candidate hypotheses H, the maximally probable hypothesis h<math>\in</math>H given data set D, is called *maximum a posteriori* (MAP) hypothesis. We calculate the MAP hypothesis using Bayes theorem to obtain the posterior probability of each candidate hypothesis and find the maximal one.

<math> {h_{MAP}} \equiv {argmax_{h \in H} P(h|D)}={argmax_{h \in H} {P(D|h)P(h) \over P(D)}}= {argmax_{h \in H} {P(D|h)P(h)}}</math> ( P(D) being independent of h is dropped)

If every hypothesis in H is considered to have equal aprior probability then we can drop the P(h) term from the above formula. P(D|h) is often called the likelihood of the data D given h, and the hypothesis that maximizes P(D|h) is called maximum likelihood (ML) hypothesis,<math>h_{ML}</math>.

<math>h_{ML} \equiv {argmax_{h \in H} {P(D|h)}} </math>

### Example

Suppose there are two hypotheses 1) a patient has cancer and 2) does not have cancer. Let P(cancer) =0.008 and <math>P(\neg cancer)</math>=.992. Let the data of a cancer test have two outcomes <math>\oplus</math>(positive) and <math>\circleddash</math> (negative).Let <math>P(\oplus|cancer)</math> = 0.98, <math>P(\circleddash|cancer)</math> = 0.02, <math> P(\oplus|\neg cancer)</math> =0.03 ,<math>P(\circleddash|\neg cancer)</math> =0.97.

Suppose a patient's test result came positive. What should be the diagnosis (hypothesis)?

Using Bayes rule we calculate <math>h_{MAP}</math> as follows:

<math>P(cancer)P(\oplus|cancer)</math> = .98*.008 = 0.0078

<math>P(\neg cancer)P(\oplus|\neg cancer)</math> = 0.03*.992 = 0.0298

Thus <math>h_{MAP}</math> = <math>\neg cancer</math>. So the patient will be diagnosed with not having cancer.

## Brute-force MAP learning algorithm

Using Bayes theorem we can come up with a straight forward brute force learning algorithm, that calculates the probability for each possible hypothesis, then outputs the most probable one.

**BRUTE-FORCE MAP LEARNING ALGORITHM:**

1. For each hypothesis h in H, calculate the posterior probability

<math>P(h|D)= { {P(D|h)P(h)}\over P(D)}</math>

2. Output the hypothesis <math>h_{MAP}</math> with the highest posterior probability

<math>h_{MAP}= { argmax_{h \in H} P(h|D)}</math>

One major problem of this algorithm is that, in a large hypothesis space, the amount of computation to be done, blows up.

## EM Algorithm (Li He)

EM (expectation-maximization) algorithm is an algorithm for calculating maximum likelihood estimation for model parameters when laten variables are present in the model. The EM algorithm was explained and given its name in a classic 1977 paper by Arthur Dempster, Nan Laird, and Donald Rubin.

Given a statistical model consisting of a set <math>\mathbf{X}</math> of observed data, a set of unobserved latent data or missing values <math>\mathbf{Z}</math>, and a vector of unknown parameters <math>\boldsymbol\theta</math>, along with a likelihood function <math>L(\boldsymbol\theta; \mathbf{X}, \mathbf{Z}) = p(\mathbf{X}, \mathbf{Z}|\boldsymbol\theta)</math>, the maximum likelihood estimate (MLE) of the unknown parameters is determined by the marginal likelihood of the observed data

<math> L(\boldsymbol\theta; \mathbf{X}) = p(\mathbf{X}|\boldsymbol\theta) = \sum_{\mathbf{Z}} p(\mathbf{X}, \mathbf{Z}|\boldsymbol\theta) </math>

However, this quantity is often intractable.

The EM algorithm seeks to find the MLE of the marginal likelihood by iteratively applying the following two steps:

- { Expectation step (E-step)}: Calculate the expected value of the log likelihood function, with respect to the conditional distribution of <math>\mathbf{Z}</math> given <math>\mathbf{X}</math> under the current estimate of the parameters <math>\boldsymbol\theta^{(t)}</math>:

<math> Q(\boldsymbol\theta|\boldsymbol\theta^{(t)}) = \operatorname{E}_{\mathbf{Z}|\mathbf{X},\boldsymbol\theta^{(t)}}\left[ \log L (\boldsymbol\theta;\mathbf{X},\mathbf{Z}) \right] </math>

- { Maximization step (M-step)}: Find the parameter that maximizes this quantity:

<math> \boldsymbol{\theta}^{(t+1)} = \operatorname{argmax}_{\boldsymbol{\theta}} Q(\boldsymbol{\theta}|\boldsymbol{\theta^{(t)}}) </math>

A typical application of EM algorithm is to estimate parameters for Gaussian mixture models, where each data point is generated by one of several Gaussian distributions but which Gaussian distribution generates which data point is unknown. Note that in typical models to which EM is applied:

- The observed data points <math>\mathbf{X}</math> may be discrete (taking values in a finite or countably infinite set) or continuous (taking values in an uncountably infinite set). There may in fact be a vector of observations associated with each data point.
- The missing values (aka latent variables) <math>\mathbf{Z}</math> are discrete, drawn from a fixed number of values, and there is one latent variable per observed data point.
- The parameters are continuous, and are of two kinds: Parameters that are associated with all data points, and parameters associated with a particular value of a latent variable.

## Bayesian Classifiers

Instead of coming up with the most probable hypothesis to the training data and then use it to classify a new instance, Bayesian techniques can be used to directly come up with a classification of the new instance. The following discusses the different Bayesian classifiers available.

### Bayes Optimal Classifier

Given several hypotheses in the hypothesis space H, the most probable classification of the new instance can be obtained by combining the prediction of the all the hypotheses weighted by their posterior probabilities. If the classification of a new instance can take a value <math>v_j </math> from any set V, then <math>P(v_j|D) = \sum_{h \in H} P(v_j|h_i)P(h_i|D) </math>.

Bayes optimal classifier states that the optimal classification is the value <math>v_j </math> for which <math>P(v_j|D) </math> is max.

**Bayes optimal classification:** <math>argmax_{v_j \in V}\sum_{h \in H} P(v_j|h_i)P(h_i|D) </math>

#### Example

Let there be 3 hypotheses in the hypothesis space H, <math>h_1</math> , <math>h_2</math> and <math>h_3</math>. Their posterior probabilities are 0.4, 0.3 and 0.3 respectively.Let the set of possible classifications be V = {+,-}. A new instance x is classified by <math>h_1</math> as positive, and negative by the other two.

Therefore

<math>\sum_{h \in H} P(+|h_i)P(h_i|D) = 0.4</math>

<math>\sum_{h \in H} P(-|h_i)P(h_i|D) = 0.6</math>

and

<math>argmax_{v_j} \sum_{h \in H} P(v_j|h_i)P(h_i|D) = - </math>

This classifier obtains the best performance, given any training data. However as it calculates for every hypothesis in H, its computation cost is very high.

### Markov Chain Monte Carlo Methods(Alvaro Pinto)

When is not possible to calculated the conjugated a priori distribution and the posterior probability is complex, it is possible to use a computer to generated the posterior distribution. The are a variety of methods to make this simulation, which are known as the common name of Markov chain Monte Carlo methods. Here the Gibbs sampling is presented which is specially suitable to estimated the distribution in classification methods.

### Gibbs Algorithm

Gibbs algorithm is a less optimal method which randomly selects a hypothesis(say h) from H according to the posterior probability distribution of hypotheses and uses h to predict the classification of a new instance.Under certain conditions the misclassification error of Gibbs algorithm is at worst twice that of Bayes optimal classifier, which is pretty decent.

#### (Alvaro Pinto)

The Gibbs sampling is appropriate if we want to obtain the join distribution when is easy to sample the conditional distributions.

Lets assume that we want to to sample the join distribution of two random variables variables <math>f(x,y)</math>, and assume that the consitional probabilities are <math>f(x \vert y) </math> and <math>f(y \vert x)</math>.

The method is implemented as follow:

- We have to arbitrary fix a initial value <math>y^{(0)}</math> and obtain a random value for <math> x </math> with this distribution <math>f(x \vert y^{(0)}) </math>. Where <math> x^{(0)}</math> is this value.
- Then, we have to obtain a random value for y with the distribution <math>f(y \vert x^{(0)}) </math>. Where <math> y^{(1)}</math> in this value.
- Then, we have to come back to 1 with <math> y^{(1)}</math> instead of <math> y^{(0)}</math>, and alternated between 1 and 2 to obtained pairs of values <math> ( x^{i},y^{i} )</math>, for <math>i= 1,.....,N</math>

We have to demonstrated that for a sufficient number of <math> N </math> which is big enough, the pair <math> ( x^{N},y^{N} )</math> is a random value with the join distribution <math>f(x,y)</math>

An important problem in to investigated the converged os the sequence. We can demonstrated that under certain general circumstances the algorithm converge, but this converge might required an enormous number of iterations for some problems.

### Naive Bayes Classifier

Naive Bayes classifier is applied to learning tasks where each instance x is described as a conjunction of attribute values <math><a_1,a_2,....a_n></math> and the target function f(x) can take any value from a set V. The idea is to calculate the most probable value, <math>v_{MAP}</math> for x from V.

<math>v_{MAP} = argmax_{v_j \in V}P(v_j|a_1,a_2,....a_n)</math>

Using Bayes theorem: <math>v_{MAP} = argmax_{v_j \in V}{P(a_1,a_2,....a_n|v_j)P(v_j) \over P(a_1,a_2,....a_n)} = argmax_{v_j \in V}{P(a_1,a_2,....a_n|v_j)P(v_j)}</math>

Naive Bayes classifier assumes that the attribute values are conditional independent given the target value. That is <math>P(a_1,a_2,....a_n|v_j) = \prod_i P(a_i|v_j)</math>
**Naive Bayes Classification:**

<math>v_{NB} = argmax_{v_j \in V}P(v_j)\prod_i P(a_i|v_j)</math> where <math>v_{NB}</math> denotes the target value outputted by the classifier.

#### Example

Let an instance x be described by two attributes A and B. Let the possible values of A be {<math>{a_1, a_2}</math>} and that of B be {<math>{b_1, b_2}</math>}. Let the set of classifications be V = {+,-}. Let P(+) = 0.6 and P(-) = 0.4. Let <math>P(a_1|+) = 0.3,P(a_1|-) = 0.7,P(a_2|+) = 0.5,P(a_2|-) = 0.5,P(b_1|+) = 0.2,P(b_1|-) = 0.8,P(b_2|+) = 0.1,P(b_2|-) = 0.9</math>. So the classification of a new instance <math>x = <a_1,b_1></math> will be given by

<math>P(+)P(a_1|+)P(b_1|+) = 0.6*0.3*0.2 = 0.036</math>

<math>P(-)P(a_1|-)P(b_1|-) = 0.4*0.7*0.0.8 =0.224 </math>

So the classification of x is -.

A drawback of the Naive Bayes approach is that if any of the individual probabilities in the numerator amount to zero then that term dominates the Bayes classifier. To avoid this difficulty a Bayesian approach using m-estimate of probability is adopted.

**m-estimate of probability:** <math>{n_c + mp} \over {n + m}</math>

where <math>n_c</math> is the number of occurences of a particular attribute value, n is the number of occurences of the specified target value, p is the prior estimate of the probability of the attribute value and m is a constant called *equivalent sample size* which determines how heavily to weigh p relative to the observed data. Typically in absence of prior information p is assumed to be uniform distribution. If m=0, the m-estimate is equivalent to the simple fraction <math>n_c \over n</math> else the fraction will be combined with p according to weight m.

Naive Bayes approach is different from the above mentioned classifiers in the fact that it does not search the hypotheses space. It simply counts the frequency of various data combinations within the training set to classify instances.

## Bayesian Networks

Bayesian networks describe the joint probability distribution of a set of variables(typically the attributes) by specifying a set of conditional independence assumptions as well as a set of conditional probabilities. In a Bayesian network each variable is represented as a node and edges denote that a variable is conditionally independent of its non-descendants given its immediate predecessors. If there is directed path from variable Y to X, then X is a descendant of Y. For each variable there is a conditional probability table describing its probability distribution given its parents.

The joint probability for any assignment of values <math><y_1,y_2......y_n></math> to the tuple of network variables <math><Y_1,Y_2.......Y_n></math> is given by:

<math>P(y_1.....,y_n) = \prod_{i=1}^n P(y_i|Parents(Y_i))</math>

where <math>Parents(Y_i)</math> denotes the set of immediate predecessors of <math>Y_i</math> in the network.

The above figure illustrates a Bayesian network. Each node is a variable. The arcs denote the relationship between the variables. For example variable Campfire is conditionally independent of its non-descendants Lightning and Thunder. The table given is the conditional probability table for Campfire given its parents Storm and BusTourGroup.

### Inference

The computation of a probability of interest given a model is known as probabilistic inference.Bayesian networks can be used for probabilistic inference. If values for all of the other variables in the network are known then the inference procedure is pretty simple. However it might be the case that the values of all variables might not be available. A brute force method of trying all possible combination might be tried but that blows up the computation time exponentially. In general the inference problem is NP-Hard.

Research has been done to develop probabilistic inference algorithms for Bayesian networks with discrete variables.For example, one algorithm reverses arcs in the network structure until the answer to the given probabilistic query can be read directly from the graph[Howard and Matheson(1981), Olmsted(1983), and Shachter(1988)]. In this algorithm, each arc reversal corresponds to an application of Bayes theorem. Another algorithm is a message-passing scheme that updates the probability distributions for each node in a Bayesian network in response to observations of one or more variables[Pearl '86]. Another algorithm first transforms the Bayesian network into a tree where each node in the tree corresponds to a subset of variables. The algorithm then exploits several mathematical properties of this tree to perform probabilistic inference[Lauritzen and Spiegelhalter(1988), Jensen et al.(1990), and Dawid(1992)]. Exact inference methods are known to work for some networks. Monte Carlo methods provide approximate solution by simulating the network randomly.

### Learning Bayesian Networks

Learning Bayesian Networks from the training data is an active area of research. There can be several variants of the problem. If both the structure and all the values of the variables are available then learning reduces to simply counting and calculating probability. However if the structure and only part of the variables are known then the problem becomes analogous to learning in an artificial neural network. The available training techniques like Gradient Ascent can be applied. This process searches through a hypotheses space that corresponds to the set of all possible tables of the probability tables. The gradient ascent rule maximizes <math>P(D|h)</math> by following the gradient of lnP(D|h) with respect to parameters that define the conditional probability tables of the network. The search in the hypothesis space corresponds to searching for the maximum likelihood hypothesis for the table entries. Other approaches involve Expectation Maximization etc.

If the structure of the network is not known then various search approaches are used, like a Bayesian scoring metric to chose among alternative network, a heuristic search(greedy) algorithm called <math>K_2</math> for learning the structure when data is fully observable etc. Constrained based approaches have also been developed which infer independence and dependence relationships from the data,to construct the network.

### Applications of Bayesian Networks

(Adam Voshall) Bayesian networks have been applied broadly across many fields, ranging from testing software reliability and hardware fault tolerance to intelligent tutoring software for academics [Neapolitan(2004)]. One specific implementation of Bayesian networks for medical diagnosis is NasoNet, which is specifically designed to help diagnose nasopharyngeal cancer (cancer of the nasal passages) and determine the stage of the tumor in order to help physicians prescribe the most appropriate treatment [Galan, et al.(2002)]. This implementation uses the symptoms that the patient presents with along with the results of the examination and diagnostic tests to build the network. The network itself is modeled using the temporal noisy OR gate model, in which each attribute is assigned a binary value that can change with time. This approach allows the network to model the evolution of the cancer, increasing it's accuracy compared to non-temporal approaches. This modeling is especially useful in determining the stage of the tumor and how it is most likely to spread, though, given the complexity and uncertainty of cancer metastasis, still remains challenging. To diagnose or predict events, the algorithm had the best results when comparing the events with the highest posterior probabilities. Overall, the NasoNet algorithm does extremely well in diagnosing or predicting events, giving the correct diagnosis or prediction roughly 90% of the time.

## Conclusion

Bayesian learning methods are important to machine learning for two reasons. Firstly they provide the basis for probabilistic learning methods using prior probabilities of different hypotheses and probability of observing data given a hypothesis. Secondly they can be used to study and analyze other learning algorithms that do not explicitly handle probabilities for hypotheses. They help to understand and characterize the operations of many algorithms in Machine Learning. Bayesian learning methods are an active area of research and have been used widely for solving a wide range of problems

## References

[1] Mitchell, Tom. Machine Learning. Boston, MA: WCB/McGraw-Hill, 1997. 154-197. Print.

[2] A Tutorial on Learning with Bayesian Networks

[3] http://en.wikipedia.org/wiki/Bayesian_inference

[4] Expectation-maximization algorithm, http://en.wikipedia.org/wiki/EM_algorithm

[5] Neapolitan, R.E. Learning Bayesian Networks. Upper River Saddle, NJ: Prentice hall, 2004.639-645. Print.

[6] Galan, S.F, Aguado, F., Diez, F.J., and Mira, J. (2002) NasoNet, modeling the spread of nasopharyngeal cancer with networks of probabilistic events in discrete time. Artif Intell Med. 2002 Jul;25(3):247-64

[7] Daniel Pena , "Analisis de Datos multivariantes" McGrawn Hill, Spain 2002.