# Bayesian Networks

## Bayesian Networks

### A. Darwiche

#### 11.1 Introduction

A Bayesian network is a tool for modeling and reasoning with uncertain beliefs. A Bayesian network consists of two parts: a qualitative component in the form of a directed acyclic graph (DAG), and a quantitative component in the form conditional probabilities; see Fig. 11.1. Intuitively, the DAG of a Bayesian network explicates variables of interest (DAG nodes) and the direct influences among them (DAG edges). The conditional probabilities of a Bayesian network quantify the dependencies between variables and their parents in the DAG. Formally though, a Bayesian network is interpreted as specifying a unique probability distribution over its variables. Hence, the network can be viewed as a factored (compact) representation of an exponentiallysized probability distribution. The formal syntax and semantics of Bayesian networks will be discussed in Section 11.2.

The power of Bayesian networks as a representational tool stems both from this ability to represent large probability distributions compactly, and the availability of inference algorithms that can answer queries about these distributions without necessarily constructing them explicitly. Exact inference algorithms will be discussed in Section 11.3 and approximate inference algorithms will be discussed in Section 11.4.

Bayesian networks can be constructed in a variety of ways, depending on the application at hand and the available information. In particular, one can construct Bayesian networks using traditional knowledge engineering sessions with domain experts, by automatically synthesizing them from high level specifications, or by learning them from data. The construction of Bayesian networks will be discussed in Section 11.5.

There are two interpretations of a Bayesian network structure, a standard interpretation in terms of probabilistic independence and a stronger interpretation in terms of causality. According to the stronger interpretation, the Bayesian network specifies a family of probability distributions, each resulting from applying an intervention to the situation of interest. These causal Bayesian networks lead to additional types of queries, and require more specialized algorithms for computing them. Causal Bayesian networks will be discussed in Section 11.6.

A ΘA true 0.6 false 0.4 A B ΘB|A true true 0.2 true false 0.8 false true 0.75 false false 0.25 A C ΘC|A true true 0.8 true false 0.2 false true 0.1 false false 0.9 B C D ΘD|B,C true true true 0.95 true true false 0.05 true false true 0.9 true false false 0.1 false true true 0.8 false true false 0.2 false false true 0 false false false 1 C E ΘE|C true true 0.7 true false 0.3 false true 0 false false 1

Figure 11.1: A Bayesian network over five propositional variables. A table is associated with each node in the network, containing conditional probabilities of that node given its parents.

#### 11.2 Syntax and Semantics of Bayesian Networks

We will discuss the syntax and semantics of Bayesian networks in this section, starting with some notational conventions.

##### 11.2.1 Notational Conventions

We will denote variables by upper-case letters (A) and their values by lower-case letters (a). Sets of variables will be denoted by bold-face upper-case letters (**A**) and their instantiations by bold-face lower-case letters (**a**). For variable A and value a, we will often write a instead of A = a and, hence, Pr(a) instead of Pr(A = a) for the probability of A = a. For a variable A with values true and false, we may use A or a to denote A = true and ¬A or a to denote A = false. Therefore, Pr(A), Pr(A = true) and Pr(a) all represent the same probability in this case. Similarly, Pr(¬A), Pr(A = false) and Pr(a) all represent the same probability.

Table 11.1. A probability distribution Pr(.) and the result of conditioning it on evidence Alarm, Pr(.|Alarm)

##### 11.2.2 Probabilistic Beliefs

The semantics of Bayesian networks is given in terms of probability distributions and is founded on the notion of probabilistic independence. We review both of these notions in this section.

L et X_{1},...,X_{n }be a set of variables, where each variable X_{i }has a finite number of values x_{i}. Every instantiation x_{1},...,x_{n }of these variables will be called a *possible world*, denoted by ω, with the set of all possible worlds denoted by Ω. A *probability distribution *Pr over variables X_{1},...,X_{n }is a mapping from the set of worlds Ω induced by variables X_{1},...,X_{n }into the interval [0,1], such that 1; see Table 11.1. An *event *η is a set of worlds. A probability distribution Pr assigns a probability in [0,1] to each event η as follows: Pr.

E vents are typically denoted by *propositional sentences*, which are defined inductively as follows. A sentence is either primitive, having the form X = x, or complex, having the form ¬α, α ∨ β, α ∧ β, where α and β are sentences. A propositional sentence α denotes the event *Mods*(α), defined as follows: *Mods*(X = x) is the set

X

(β) β) event can be denoted by the sentence Burglary ∨ Earthquake

If some event β is observed and does not have a probability of 0 according to the current distribution Pr, the distribution is updated to a new distribution, denoted Pr(.|β), using *Bayes conditioning*:

Pr . (11.1)

Bayes conditioning follows from two commitments: worlds that contradict evidence β must have zero probabilities, and worlds that are consistent with β must maintain their relative probabilities.^{68}^{ }Table 11.1 depicts the result of conditioning the given distribution on evidence Alarm = true, which initially has a probability of 0.2442.

When evidence β is accommodated, the belief is some event α may remain the same. We say in this case that α is independent of β. More generally, event α is independent of event β given event γ iff

Pr(α|β ∧ γ) = Pr(α|γ) or Pr(β ∧ γ) = 0. (11.2)

We can also generalize the definition of independence to variables. In particular, we will say that variables **X **are independent of variables **Y **given variables **Z**, written

I(**X**,**Z**,**Y**), iff

Pr(**x**|**y**,**z**) = Pr(**x**|**z**) or Pr(**y**,**z**) = 0

for all instantiations **x**, **y**, **z **of variables **X**, **Y **and **Z**. Hence, the statement I(**X**,**Z**,**Y**) is a compact representation of an exponential number of independence statements of the form given in (11.2).

Probabilistic independence satisfies some interesting properties known as the graphoid axioms [130], which can be summarized as follows:

I(**X**,**Z**,**Y**) iff I(**Y**,**Z**,**X**)

I(**X**,**Z**,**Y**) & I(**X**,**ZW**,**Y**) iff I(**X**,**Z**,**YW**).

The first axiom is called Symmetry, and the second axiom is usually broken down into three axioms called decomposition, contraction and weak union; see [130] for details.

We will discuss the syntax and semantics of Bayesian networks next, showing the key role that independence plays in the representational power of these networks.

##### 11.2.3 Bayesian Networks

A *Bayesian network *over variables **X **is a pair (G,Θ), where

- G is a directed acyclic graph over variables
**X**; - Θ is a set of conditional probability tables (CPTs), one CPT Θ
_{X}|**U**for each variable X and its parents**U**in G. The CPT Θ_{X}|**U**maps each instantiation x**u**to a probability θ_{x}|**u**such that

We will refer to the probability θ_{x}|**u **as a *parameter *of the Bayesian network, and to the set of CPTs Θ as a *parametrization *of the DAG G.

A Bayesian network over variables **X **specifies a unique probability distributions over its variables, defined as follows [130]:

def θx_{|}**u**:x**u**∼**x**

Pr(**x**) ^{= }# θ_{x}|**u**, (11.3)

where ∼ represents the compatibility relationship among variable instantiations; hence, x**u **∼ **x **means that instantiations x**u **and **x **agree on the values of their common variables. In the Bayesian network of Fig. 11.1, Eq. (11.3) gives:

Pr(a,b,c,d,e) = θ_{e}|cθ_{d}|b,cθ_{c}|aθ_{b}|aθ_{a}, where a,b,c,d,e are values of variables A,B,C,D,E, respectively.

The distribution given by Eq. (11.3) follows from a particular interpretation of the structure and parameters of a Bayesian network (G,Θ). In particular:

*Parameters:*Each parameter θ_{x}|**u**is interpreted as the conditional probability of x given**u**, Pr(x|**u**).*Structure:*Each variable X is assumed to be independent of its nondescendants**Z**given its parents**U**: I(X,**U**,**Z**).^{69}

The above interpretation is satisfied by a unique probability distribution, the one given in Eq. (11.3).

##### 11.2.4 Structured Representations of CPTs

The size of a CPT Θ_{X}|**U **in a Bayesian network is exponential in the number of parents **U**. In general, if every variable can take up to d values, and has at most k parents, the size of any CPT is bounded by O(d^{k}+^{1}). Moreover, if we have n network variables, the total number of Bayesian network parameters is bounded by O(nd^{k}+^{1}). This number is usually quite reasonable as long as the number of parents per variable is relatively small. If number of parents **U **for variable X is large, the Bayesian network representation looses its main advantage as a compact representation of probability distributions, unless one employs a more structured representation for network parameters than CPTs.

The solutions to the problem of large CPTs fall in one of two categories. First, we may assume that the parents **U **interact with their child X according to a specific model, which allows us to specify the CPT Θ_{X}|**U **using a smaller number of parameters (than exponential in the number of parents **U**). One of the most popular examples of this approach is the *noisy-or model *of interaction and its generalizations [130, 77, 161, 51]. In its simplest form, this model assumes that variables have binary values true/false, that each parent U ∈ **U **being true is sufficient to make X true, except if some exception α_{U }materializes. By assuming that exceptions α_{U }are independent, one can induce the CPT Θ_{X}|**U **using only the probabilities of these exceptions. Hence, the CPT for X can be specified using a number of parameters which is linear in the number of parents **U**, instead of being exponential in the number of these parents.

The second approach for dealing with large CPTs is to appeal to nontabular representations of network parameters that exploit the *local structure *in network CPTs. In broad terms, local structure refers to the existence of nonsystematic redundancy in the probabilities appearing in a CPT. Local structure typically occurs in the form of *determinism*, where the CPT parameters take extreme values (0,1). Another form of local structure is *context-specific independence *(*CSI*) [15], where the distribution for X can sometimes be determined by only a subset of its parents **U**. Rules [136, 134] and decision trees (and graphs) [61, 80] are among the more common structured representations of CPTs.

##### 11.2.5 Reasoning about Independence

We have seen earlier how the structure of a Bayesian network is interpreted as declaring a number of independence statements. We have also seen how probabilistic independence satisfies the graphoid axioms. When applying these axioms to the independencies declared by a Bayesian network structure, one can derive new independencies.

In fact, any independence statement derived this way can be read off the Bayesian network structure using a graphical criterion known as *d-separation *[166, 35, 64]. In particular, we say that variables **X **are d-separated from variables **Y **by variables **Z **if every (undirected) path from a node in **X **to a node in **Y **is blocked by **Z**. A path is blocked by **Z **if it has a *sequential *or *divergent *node in **Z**, or if it has a *convergent *node that is not in **Z **nor any of its descendants are in **Z**. Whether a node Z ∈ **Z **is sequential, divergent, or convergent depends on the way it appears on the path: → Z → is sequential, ← Z → is divergent, and → Z ← is convergent. There are a number of important facts about the d-separation test. First, it can be implemented in polynomial time. Second, it is sound and complete with respect to the graphoid axioms. That is, **X **and **Y **are d-separated by **Z **in DAG G if and only if the graphoid axioms can be used to show that **X **and **Y **are independent given **Z**.

There are secondary structures that one can build from a Bayesian network which can also be used to derive independence statements that hold in the distribution induced by the network. In particular, the *moral graph *G_{m }of a Bayesian network is an undirected graph obtained by adding an undirected edge between any two nodes that share a common child in DAG G, and then dropping the directionality of edges. If variables **X **and **Y **are separated by variables **Z **in moral graph G_{m}, we also have that **X **and **Y **are independent given **Z **in any distribution induced by the corresponding Bayesian network.

Another secondary structure that can be used to derive independence statements for a Bayesian network is the jointree [109]. This is a tree of clusters, where each cluster is a set of variables in the Bayesian network, with two conditions. First, every family (a node and its parents) in the Bayesian network must appear in some cluster. Second, if a variable appears in two clusters, it must also appear in every cluster on the path between them; see Fig. 11.4. Given a jointree for a Bayesian network (G,Θ), any two clusters are independent given any cluster on the path connecting them [130]. One can usually build multiple jointrees for a given Bayesian network, each revealing different types of independence information. In general, the smaller the clusters of a jointree, the more independence information it reveals. Jointrees play an important role in exact inference algorithms as we shall discuss later.

##### 11.2.6 Dynamic Bayesian Networks

The *dynamic Bayesian network *(*DBN*) is a Bayesian network with a particular structure that deserves special attention [44, 119]. In particular, in a DBN, nodes are partitioned into *slices*, 0,1,...,t, corresponding to different time points. Each slice has the same set of nodes and the same set of inter-slice edges, except possibly for the first slice which may have different edges. Moreover, intra-slice edges can only cross from nodes in slice t to nodes in a following slice t + 1. Because of their recurrent structure, DBNs are usually specified using two slices only for t and t + 1; see Fig. 11.2.

By restricting the structure of a DBN further at each time slice, one obtains more specialized types of networks, some of which are common enough to be studied outside the framework of Bayesian networks. Fig. 11.3 depicts one such restriction, known as a *Hidden Markov Model *[160]. Here, variables S_{i }typically represent unobservable states of a dynamic system, and variables O_{i }represent observable sensors

Figure 11.2: Two Bayesian network structures for a digital circuit. The one on the right is a DBN, representing the state of the circuit at two times steps. Here, variables A,...,E represent the state of wires in the circuit, while variables X,Y,Z represent the health of corresponding gates.

Figure 11.3: A Bayesian network structure corresponding to a Hidden Markov Model.

that may provide information on the corresponding system state. HMMs are usually studied as a special purpose model, and are equipped with three algorithms, known as the *forward–backward*, *Viterbi *and *Baum–Welch *algorithms (see [138] for a description of these algorithms and example applications of HMMs). These are all special cases of Bayesian network algorithms that we discuss in later sections.

Given the recurrent and potentially unbounded structure of DBNs (their size grows with time), they present particular challenges and also special opportunities for inference algorithms. They also admit a more refined class of queries than general Bayesian networks. Hence, it is not uncommon to use specialized inference algorithms for DBNs, instead of applying general purpose algorithms that one may use for arbitrary Bayesian networks. We will see examples of such algorithms in the following sections.

#### 11.3 Exact Inference

Given a Bayesian (G,Θ) over variables **X**, which induces a probability distribution Pr, one can pose a number of fundamental queries with respect to the distribution Pr:

*Most Probable Explanation*(*MPE*): What’s the most likely instantiation of network variables**X**, given some evidence**e**?

*MPE*(**e**) = argmaxPr(**x**,**e**).

**x**

*Probability of Evidence*(*PR*): What’s the probability of evidence**e**, Pr(**e**)? Related to this query is*Posterior Marginals*: What’s the conditional probability Pr(X|**e**) for every variable X in the network^{70}?*Maximum a Posteriori Hypothesis*(*MAP*): What’s the most likely instantiation of some network variables**M**, given some evidence**e**?

*MAP*(**e**,**M**) = argmaxPr(**m**,**e**).

**m**

These problems are all difficult. In particular, the decision version of MPE, PR, and MAP, are known to be NP-complete, PP-complete and NP^{PP}-complete, respectively [32, 158, 145, 123]. We will discuss exact algorithms for answering these queries in this section, and then discuss approximate algorithms in Section 11.4. We start in Section 11.3.1 with a class of algorithms known as *structure-based *as their complexity is only a function of the network topology. We then discuss in Section 11.3.2 refinements of these algorithms that can exploit local structure in network parameters, leading to a complexity which is both a function of network topology and parameters. Section 11.3.3 discusses a class of algorithms based on search, specialized for MAP and MPE problems. Section 11.3.4 discusses an orthogonal class of methods for *compiling *Bayesian networks, and Section 11.3.5 discusses the technique of reducing exact probabilistic reasoning to logical inference.

It should noted here that by *evidence*, we mean a variable instantiation **e **of some network variables **E**. In general, one can define evidence as an arbitrary event α, yet most of the algorithms we shall discuss assume the more specific interpretation of evidence. These algorithms can be extended to handle more general notions of evidence as discussed in Section 11.3.6, which discusses a variety of additional extensions to inference algorithms.

##### 11.3.1 Structure-Based Algorithms

When discussing inference algorithms, it is quite helpful to view the distribution induced by a Bayesian network as a product of *factors*, where a factor f(**X**) is simply a mapping from instantiations **x **of variables **X **to real numbers. Hence, each CPT Θ_{X}|**U **of a Bayesian network is a factor over variables X**U**; see Fig. 11.1. The product of two factors f(**X**) and f(**Y**) is another factor over variables **Z **= **X **∪ **Y**: f(**z**) = f(**x**)f(**y**) where **z **∼ **x **and **z **∼ **y**.^{71}^{ }The distribution induced by a Bayesian network (G,Θ) can then be expressed as a product of its CPTs (factors) and the inference problem in Bayesian networks can then be formulated as follows. We are given a function f(**X**) (i.e., probability distribution) expressed as a product of factors f_{1}(**X**_{1}),...,f_{n}(**X**_{n}) and our goal is to answer questions about the function f(**X**) without necessarily computing the explicit product of these factors.

We will next describe three computational paradigms for exact inference in Bayesian networks, which share the same computational guarantees. In particular, all methods can solve the PR and MPE problems in time and space which is exponential only in the network *treewidth *[8, 144]. Moreover, all can solve the MAP problem exponential only in the network *constrained treewidth *[123]. Treewidth (and constrained treewidth) are functions of the network topology, measuring the extent to which a network resembles a tree. A more formal definition will be given later.

###### Inference by variable elimination

The first inference paradigm we shall discuss is based on the influential concept of variable elimination [153, 181, 45]. Given a function f(**X**) in factored form,

, and some corresponding query, the method will eliminate a variable X from this function to produce another function, while ensuring that the new function is as good as the old function as far as answering the query of interest. The idea is then to keep eliminating variables one at a time, until we can extract the answer we want from the result. The key insight here is that when eliminating a variable, we will only need to multiply factors that mention the eliminated variable. The order in which variables are eliminated is therefore important as far as complexity is concerned, as it dictates the extent to which the function can be kept in factored form.

The specific method for eliminating a variable depends on the query at hand. In particular, if the goal is to solve PR, then we eliminate variables by *summing *them out. If we are solving the MPE problem, we eliminate variables by *maxing *them out. If we are solving MAP, we will have to perform both types of elimination. To sum out a variable X from factor f(**X**) is to produce another factor over variables **Y **= **X**−X, denoted . To max out variable X is similar:

( max_{X }f)(**y**) = max_{x }f(**y**,x). Note that summing out variables is commutative and so is maxing out variables. However, summing out and maxing out do not commute. For a Bayesian network (G,Θ) over variables **X**, map variables **M**, and some evidence **e**, inference by variable elimination is then a process of evaluating the following expressions:

^{•}• *MPE*: max**X**$X ΘX|**U**λX.

*PR*: | .

• *MAP*: max .

Here, λ_{X }is a factor over variable X, called an *evidence indicator*, used to capture evidence **e**: λ_{X}(x) = 1 if x is consistent with evidence **e **and λ_{X}(x) = 0 otherwise. Evaluating the above expressions leads to computing the probability of MPE, the probability of evidence, and the probability of MAP, respectively. Some extra bookkeeping allows one to recover the identity of MPE and MAP [130, 45].

As mentioned earlier, the order in which variables are eliminated is critical for the complexity of variable elimination algorithms. In fact, one can define the width of an elimination order as one smaller than the size of the largest factor constructed during the elimination process, where the size of a factor is the number of variables over which it is defined. One can then show that variable elimination has a complexity which is exponential only in the width of used elimination order. In fact, the treewidth of a Bayesian network can be defined as the width of its best elimination order. Hence, the time and space complexity of variable elimination is bounded by O(nexp(w)), where n is the number of network variables (also number of initial factors), and w is

Figure 11.4: A Bayesian network (left) and a corresponding jointree (right), with the network factors and evidence indicators assigned to jointree clusters.

the width of used elimination order [45]. Note that w is lower bounded by the network treewidth. Moreover, computing an optimal elimination order and network treewidth are both known to be NP-hard [9].

Since summing out and maxing out do not commute, we must max out variables **M **last when computing MAP. This means that not all variable orders are legitimate; only those in which variables **M **come last are. The **M**-constrained treewidth of a Bayesian network can then be defined as the width of its best elimination order having variables **M **last in the order. Solving MAP using variable elimination is then exponential in the constrained treewidth [123].

###### Inference by tree clustering

Tree clustering is another algorithm for exact inference, which is also known as the jointree algorithm [89, 105, 157]. There are different ways for deriving the jointree algorithm, one of which treats the algorithm as a refined way of applying variable elimination.

The idea is to organize the given set of factors into a tree structure, using a jointree for the given Bayesian network. Fig. 11.4 depicts a Bayesian network, a corresponding jointree, and assignment of the factors to the jointree clusters. We can then use the jointree structure to control the process of variable elimination as follows. We pick a leaf cluster **C**_{i }(having a single neighbor **C**_{j}) in the jointree and then eliminate variables that appear in that cluster but in no other jointree cluster. Given the jointree properties, these variables are nothing but **C**_{i }\**C**_{j}. Moreover, eliminating these variables requires that we compute the product of all factors assigned to cluster **C**_{i }and then eliminate **C**_{i }\ **C**_{j }from the resulting factor. The result of this elimination is usually viewed as a message sent from cluster **C**_{i }to cluster **C**_{j}. By the time we eliminate every cluster but one, we would have projected the factored function on the variables of that cluster (called the root). The basic insight of the jointree algorithm is that by choosing different roots, we can project the factored function on every cluster in the jointree. Moreover, some of the work we do in performing the elimination process towards one root (saved as messages) can be reused when eliminating towards another root. In fact, the amount of work that can be reused is such that we can project the function f on all clusters in the jointree with time and space bounded by O(nexp(w)), where n is the number of jointree clusters and w is the width of given jointree (size of its largest cluster minus 1). This is indeed the main advantage of the jointree algorithm over the basic variable elimination algorithm, which would need O(n^{2 }exp(w)) time and space to obtain the same result. Interesting enough, if a network has treewidth w, then it must have a jointree whose largest cluster has size w + 1. In fact, every jointree for the network must have some cluster of size w +1. Hence, another definition for the treewidth of a Bayesian network is as the width of its best jointree (the one with the smallest maximum cluster).^{72}

The classical description of a jointree algorithm is as follows (e.g., [83]). We first construct a jointree for the given Bayesian network; assign each network CPT Θ_{X}|**U **to a cluster that contains X**U**; and then assign each evidence indicator λ_{X }to a cluster that contains X. Fig. 11.4 provides an example of this process. Given evidence **e**, a jointree algorithm starts by setting evidence indicators according to given evidence. A cluster is then selected as the root and message propagation proceeds in two phases, inward and outward. In the *inward phase*, messages are passed toward the root. In the *outward phase*, messages are passed away from the root. The inward phase is also known as the *collect *or *pull *phase, and the outward phase is known as the *distribute *or *push *phase. Cluster i sends a message to cluster j only when it has received messages from all its other neighbors k. A message from cluster i to cluster j is a factor M_{ij }defined as follows:

,

where Φ_{i }is the product of factors and evidence indicators assigned to cluster i. Once message propagation is finished, we have the following for each cluster i in the jointree:

Pr(**C**_{i},**e**) = Φ_{i }#M_{k,i}.

k

Hence, we can compute the joint marginal for any subset of variables that is included in a cluster.

The above description corresponds to a version of the jointree algorithm known as the Shenoy–Shafer architecture [157]. Another popular version of the algorithm is the Hugin architecture [89]. The two versions differ in their space and time complexity on arbitrary jointrees [106]. The jointree algorithm is quite versatile allowing even more architectures (e.g., [122]), more complex types of queries (e.g., [91, 143, 34]), including MAP and MPE, and a framework for time space tradeoffs [47].

###### Inference by conditioning

A third class of exact inference algorithms is based on the concept of *conditioning *[129, 130, 39, 81, 162, 152, 37, 52]. The key concept here is that if we know the value of a variable X in a Bayesian network, then we can remove edges outgoing from X, modify the CPTs for children of X, and then perform inference equivalently on the simplified network. If the value of variable X is not known, we can still exploit this idea by doing a case analysis on variable X, hence, instead of computing Pr(**e**), we compute . This idea of conditioning can be exploited in different ways. The first exploitation of this idea was in the context of loop-cutset conditioning [129, 130, 11]. A loop-cutset for a Bayesian network is a set of variables **C **such that removing edges outgoing from **C **will render the network a polytree: one in which we have a single (undirected) path between any two nodes. Inference on polytree networks can indeed be performed in time and space linear in their size [129]. Hence, by using the concept of conditioning, performing case analysis on a loop-cutset **C**, one can reduce the query Pr(**e**) into a set of queries _{c }, each of which can be answered in linear time and space using the polytree algorithm.

This algorithm has linear space complexity as one needs to only save modest information across the different cases. This is a very attractive feature compared to algorithms based on elimination. The bottleneck for loop-cutset conditioning, however, is the size of cutset **C **since the time complexity of the algorithm is exponential in this set. One can indeed construct networks which have a bounded treewidth, leading to linear time complexity by elimination algorithms, yet an unbounded loop-cutset. A number of improvements have been proposed on loop-cutset conditioning (e.g., [39, 81, 162, 152, 37, 52]), yet only *recursive conditioning *[39] and its variants [10, 46] have a treewidth-based complexity similar to elimination algorithms.

The basic idea behind recursive conditioning is to identify a cutset **C **that is not necessarily a loop-cutset, but that can decompose a network N in two (or more) subnetworks, say, N ^{l}_{c }and N ^{r}_{c }with corresponding distributions Pr^{l}_{c }and Pr^{r}_{c }for each instantiation **c **of cutset **C**. In this case, we can write

Pr(**e**) ), **c c**

where **e**^{l}/**c**^{l }and **e**^{r}/**c**^{r }are parts of evidence/cutset pertaining to networks N ^{l }and N ^{r}, respectively. The subqueries Pr and Pr can then be solved using the

same technique, recursively, by finding cutsets for the corresponding subnetworks N ^{l}_{c }and N ^{r}_{c}. This algorithm is typically driven by a structure known as a *dtree*, which is a binary tree with its leaves corresponding to the network CPTs. Each dtree provides a complete recursive decomposition over the corresponding network, with a cutset for each level of the decomposition [39].

Given a dtree where each internal node T has children T ^{l }and T ^{r}, and each leaf node has a CPT associated with it, recursive conditioning can then compute the probability of evidence **e **as follows:

), T is an internal node with cutset **C**

*rc*(T,**e**) %;

u, T is a leaf node with CPT Θ_{X} | U. |

Note that similar to loop-cutset conditioning, the above algorithm also has a linear space complexity which is better than the space complexity of elimination algorithms. Moreover, if the Bayesian network has treewidth w, there is then a dtree which is both balanced and has cutsets whose sizes are bounded by w + 1. This means that the above algorithm can run in O(nexp(w logn)) time and O space. This is worse than the time complexity of elimination algorithms, due to the logn factor, where n is the number of network nodes.

A careful analysis of the above algorithm, however, reveals that it may make identical recursive calls in different parts of the recursion tree. By caching the value of a recursive call *rc*(T,.), one can avoid evaluating the same recursive call multiple times. In fact, if a network has a treewidth w, one can always construct a dtree on which caching will reduce the running time from O(nexp(w logn)) to O(nexp(w)), while bounding the space complexity by O(nexp(w)), which is identical to the complexity of elimination algorithms. In principle, one can cache as many results as available memory would allow, leading to a framework for trading off time and space [3], where space complexity ranges from O to O(nexp(w)), and time complexity ranges from O(nexp(w logn)) to O(nexp(w)). Recursive conditioning can also be used to compute multiple marginals [4], in addition to MAP and MPE queries [38], within the same complexity discussed above.

We note here that the quality of a variable elimination order, a jointree and a dtree can all be measured in terms of the notion of *width*, which is lower bounded by the network treewidth. Moreover, the complexity of algorithms based on these structures are all exponential only in the width of used structure. Polynomial time algorithms exists for converting between any of these structures, while preserving the corresponding width, showing the equivalence of these methods with regards to their computational complexity in terms of treewidth [42].

##### 11.3.2 Inference with Local (Parametric) Structure

The computational complexity bounds given for elimination, clustering and conditioning algorithms are based on the network topology, as captured by the notions of treewidth and constrained treewidth. There are two interesting aspects of these complexity bounds. First, they are independent of the particular parameters used to quantify Bayesian networks. Second, they are both best-case and worst-case bounds for the specific statements given for elimination and conditioning algorithms.

Given these results, only networks with reasonable treewidth are accessible to these structure-based algorithms. One can provide refinements of both elimination/clustering and conditioning algorithms, however, that exploit the parametric structure of a Bayesian network, allowing them to solve some networks whose treewidth can be quite large.

For elimination algorithms, the key is to adopt nontabular representations of factors as initially suggested by [182] and developed further by other works (e.g., [134, 50, 80, 120]). Recall that a factor f(**X**) over variables **X **is a mapping from instantiations **x **of variables **X **to real numbers. The standard statements of elimination algorithms assume that a factor f(**X**) is represented by a table that has one row of each instantiation **x**. Hence, the size of factor f(**X**) is always exponential in the number of variables in **X**. This also dictates the complexity of factor operations, including multiplication, summation and maximization. In the presence of parametric structure, one can afford to use more structured representations of factors that need not be exponential in the variables over which they are defined. In fact, one can use any factor representation as long as they provide corresponding implementations of the factor operations of multiplication, summing out, and maxing out, which are used in the context of elimination algorithms. One of the more effective structured representations of factors is the *algebraic decision diagram *(*ADD*) [139, 80], which provides efficient implementations of these operations.

In the context of conditioning algorithms, local structure can be exploited at multiple levels. First, when considering the cases **c **of a cutset **C**, one can skip a case **c **if it is logically inconsistent with the logical constraints implied by the network parameters. This inconsistency can be detected by some efficient logic propagation techniques that run in the background of conditioning algorithms [2]. Second, one does not always need to instantiate all cutset variables before a network is disconnected or converted into a polytree, as some partial cutset instantiations may have the same effect if we have context-specific independence [15, 25]. Third, local structure in the form of equal network parameters within the same CPT will reduce the number of distinct subproblems that need to be solved by recursive conditioning, allowing caching to be much more effective [25]. Considering various experimental results reported in recent years, it appears that conditioning algorithms have been more effective in exploiting local structure, especially determinism, as compared to algorithms based on variable eliminating (and, hence, clustering).

Network preprocessing can also be quite effective in the presence of local structure, especially determinism, and is orthogonal to the algorithms used afterwards. For example, preprocessing has proven quite effective and critical for networks corresponding to genetic linkage analysis, allowing exact inference on networks with very high treewidth [2, 54, 55, 49]. A fundamental form of preprocessing is CPT decomposition, in which one decomposes a CPT with local structure (e.g., [73]) into a series of CPTs by introducing auxiliary variables [53, 167]. This decomposition can reduce the treewidth of given network, allowing inference to be performed much more efficiently. The problem of finding an optimal CPT decomposition corresponds to the problem of determining tensor rank [150], which is NP-hard [82]. Closed form solutions are known, however, for CPTs with a particular local structure [150].

##### 11.3.3 Solving MAP and MPE by Search

MAP and MPE queries are conceptually different from PR queries as they correspond to optimization problems whose outcome is a variable instantiation instead of a probability. These queries admit a very effective class of algorithms based on branch and bound search. For MPE, the search tree includes a leaf for each instantiation **x **of nonevidence variables **X**, whose probability can be computed quite efficiently given Eq. (11.3). Hence, the key to the success of these search algorithms is the use of evaluation functions that can be applied to internal nodes in the search tree, which correspond to partial variable instantiations **i**, to upper bound the probability of any completion **x **of instantiation **i**. Using such an evaluation function, one can possibly prune part of the search space, therefore, solving MPE without necessarily examining the space of all variable instantiations. The most successful evaluation functions are based on relaxations of the variable elimination algorithm, allowing one to eliminate a variable without necessarily multiplying all factors that include the variable [95, 110]. These relaxations lead to a spectrum of evaluation functions, that can trade accuracy with efficiency.

A similar idea can be applied to solving MAP, with a notable distinction. In MAP, the search tree will be over the space of instantiations of a subset **M **of network variables. Moreover, each leaf node in the search tree will correspond to an instantiation **m **in this case. Computing the probability of a partial instantiation **m **requires a PR query

A θA true 0.5 false 0.5 A B θB|A true true 1 true false 0 false true 0 false false 1 A C θC|A true true 0.8 true false 0.2 false true 0.2 false false 0.8

Figure 11.5: A Bayesian network.

though, which itself can be exponential in the network treewidth. Therefore, the success of search-based algorithms for MAP depends on both the efficient evaluation of leaf nodes in the search tree, and on evaluation functions for computing upper bounds on the completion of partial variable instantiations [123, 121]. The most successful evaluation function for MAP is based on a relaxation of the variable elimination algorithm for computing MAP, allowing one to use any variable order instead of insisting on a constrained variable order [121].

##### 11.3.4 Compiling Bayesian Networks

The probability distribution induced by a Bayesian network can be compiled into an *arithmetic circuit*, allowing various probabilistic queries to be answered in time linear in the compiled circuit size [41]. The compilation time can be amortized over many online queries, which can lead to extremely efficient online inference [25, 27]. Compiling Bayesian networks is especially effective in the presence of local structure, as the exploitation of local structure tends to incur some overhead that may not be justifiable in the context of standard algorithms when the local structure is not excessive. In the context of compilation, this overhead is incurred only once in the offline compilation phase.

To expose the semantics of this compilation process, we first observe that the probability distribution induced by a Bayesian network, as given by Eq. (11.3), can be expressed in a more general form:

= | , (11.4)

**x **λ_{x}:x ∼**x **θ_{x}|**u**:x**u**∼**x**

where λ_{x }is called an evidence indicator variable (we have one indicator λ_{x }for each variable X and value x). This form is known as the *network polynomial *and represents the distribution as follows. Given any evidence **e**, let f(**e**) denotes the value of polynomial f with each indicator variable λ_{x }set to 1 if x is consistent with evidence **e **and set to 0 otherwise. It then follows that f(**e**) is the probability of evidence **e**. Following is the polynomial for the network in Fig. 11.5:

f = λaλbλcθaθb|aθc|a + λaλbλc¯θaθb|aθc¯|a + ···λa¯λ_{b}¯λc¯θa¯θ_{b}¯|¯aθc¯|¯a.

Figure 11.6: Two circuits for the Bayesian network in Fig. 11.5.

The network polynomial has an exponential number of terms, but can be factored and represented more compactly using an arithmetic circuit, which is a rooted, directed acyclic graph whose leaf nodes are labeled with evidence indicators and network parameters, and internal nodes are labeled with multiplication and addition operations. The size of an arithmetic circuit is measured by the number of edges that it contains. Fig. 11.6 depicts an arithmetic circuit for the above network polynomial. This arithmetic circuit is therefore a compilation of corresponding Bayesian network as it can be used to compute the probability of any evidence **e **by evaluating the circuit while setting the indicators to 1/0 depending on their consistency with evidence **e**. In fact, the partial derivatives of this circuit with respect to indicators λ_{x }and parameters θ_{x}|**u **can all be computed in a single second pass on the circuit. Moreover, the values of these derivatives can be used to immediately answer various probabilistic queries, including the marginals over networks variables and families [41]. Hence, for a given evidence, one can compute the probability of evidence and posterior marginals on all network variables and families in two passes on the arithmetic circuit.

One can compile a Bayesian network using exact algorithms based on elimination [26] or conditioning [25], by replacing their addition and multiplication operations by corresponding operations for building the circuit. In fact, for jointree algorithms, the arithmetic circuit can be generated directly from the jointree structure [124]. One can also generate these compilations by reducing the problem to logical inference as discussed in the following section. If structure-based versions of elimination and conditioning algorithms are used to compile Bayesian networks, the size of compiled arithmetic circuits will be exponential in the network treewidth in the best case. If one uses versions that exploit parametric structure, the resulting compilation may not be lower bounded by treewidth [25, 27]. Fig. 11.6 depicts two arithmetic circuits for the same network, the one on the right taking advantage of network parameters and is therefore smaller than the one on the left, which is valid for any value of network parameters.

##### 11.3.5 Inference by Reduction to Logic

One of the more effective approaches for exact probabilistic inference in the presence of local structure, especially determinism, is based on reducing the problem to one of

A ΘA a1 0.1 a2 0.9 A B ΘB|A a1 b1 0.1 a1 b2 0.9 a2 b1 0.2 a2 b2 0.8 A C ΘC|A a1 c1 0.1 a1 c2 0.9 a2 c1 0.2 a2 c2 0.8

Figure 11.7: The CPTs of Bayesian network with two edges A → B and A → C.

logical inference. The key technique is to encode the Bayesian network as a propositional theory in conjunctive normal form (CNF) and then apply algorithms for model counting [147] or knowledge compilation to the resulting CNF [40]. The encoding can be done in multiple ways [40, 147], yet we focus on one particular encoding [40] in this section to illustrate the reduction technique.

We will now discuss the CNF encoding for the Bayesian network in Fig. 11.7. We first define the CNF variables which are in one-to-one correspondence with evidence indicators and network parameters as defined in Section 11.3.4, but treated as propositional variables in this case. The CNF is then obtained by processing network variables and CPTs, writing corresponding clauses as follows:

V ariable¬λa1 ∨ ¬λa2

Variable¬λb1 ∨ ¬λb2

Variable¬λc1 ∨ ¬λc2

CPT for

CPT forb_{1}|a_{1 }λa ∧ λb ⇔ θb |a_{1 }λ ∧ λb_{1 }⇔ θb_{1}|a_{2 }⇔ | 2 CPT for_{1 1 }λa ∧ λc ⇔ θc |a_{1 }a ∧ c ⇔ c_{1}|a_{2 }2 ⇔ 2| 2

The clauses for variables are simply asserting that exactly one evidence indicator must be true. The clauses for CPTs are establishing an equivalence between each network parameter and its corresponding indicators. This resulting CNF has two important properties. First, its size is linear in the network size. Second, its models are in one-toone correspondence with the instantiations of network variables. Table 11.2 illustrates the variable instantiations and corresponding CNF models for the previous example.

We can now either apply a model counter to the CNF queries [147], or compile the CNF to obtain an arithmetic circuit for the Bayesian network [40]. If we want to apply a model counter to the CNF, we must first assign weights to the CNF variables (hence, we will be performing weighted model counting). All literals of the form λ_{x}, ¬λ_{x }and ¬θ_{x}|**u **get weight 1, while literals of the form θ_{x}|**u **get a weight equal to the value of parameter θ_{x}|**u **as defined by the Bayesian network; see Table 11.2. To compute the probability of any event α, all we need to do then is computed the weighted model count of .

This reduction of probabilistic inference to logical inference is currently the most effective technique for exploiting certain types of parametric structure, including determinism and parameter equality. It also provides a very effective framework for exploiting evidence computationally and for accommodating general types evidence [25, 24, 147, 27].

Table 11.2. Illustrating the models and corresponding weights of a CNF encoding a Bayesian network

Network instantiation | CNF model | ωi sets these CNF vars to true and all others to false | Model weight |

a1b1c1 | ω0 | λa λb λc θa θb a θc a | |

a1b2c2 a2b1c1 | ω 1 ω2 ω3 ω4 ω5 ω6 ω7 | | c |a 2 2| 1 2| λa2 | · · = 0.9 · 0.8 · 0.8 = 0.576 |

##### 11.3.6 Additional Inference Techniques

We discuss in this section some additional inference techniques which can be crucial in certain circumstances.

First, all of the methods discussed earlier are immediately applicable to DBNs. However, the specific, recurrent structure of these networks calls for some special attention. For example, PR queries can be further refined depending on the location of evidence and query variables within the network structure, leading to specialized queries, such as *monitoring*. Here, the evidence is restricted to network slices t = 0, ...,t = i and the query variables are restricted to slice t = i. In such a case, and by using restricted elimination orders, one can perform inference in space which is better than linear in the network size [13, 97, 12]. This is important for DBNs as a linear space complexity can be unpractical if we have too many slices.

Second, depending on the given evidence and query variables, a network can potentially be pruned before inference is performed. In particular, one can always remove edges outgoing from evidence variables [156]. One can also remove leaf nodes in the network as long as they do not correspond to evidence or query variables [155]. This process of node removal can be repeated, possibly simplifying the network structure considerably. More sophisticated pruning techniques are also possible [107].

T hird, we have so far considered only simple evidence corresponding to the instantiation **e **of some variables **E**. If evidence corresponds to a general event α, we can add an auxiliary node X_{α }to the network, making it a child of all variables **U **appearing in α, setting the CPT Θ_{Xα}|**U **based on α, and asserting evidence on X_{α }[130]. A more effective solution to this problem can be achieved in the context of approaches that reduce the problem to logical inference. Here, we can simply add the event α to the encoded CNF before we apply logical inference [147, 24]. Another type of evidence we did not consider is *soft evidence*. This can be specified in two forms. We can declare that the evidence changes the probability of some variable X from Pr(X) to Pr(X). Or we can assert that the new evidence on X changes its odds by a given factor k, known as the Bayes factor: O. Both types of evidence can be handled by adding an auxiliary child X_{e }for node X, setting its CPT Θ_{Xe}|X depending on the strength of soft evidence, and finally simulating the soft evidence by hard evidence on X_{e }[130, 22].

#### 11.4 Approximate Inference

All exact inference algorithms we have discussed for PR have a complexity which is exponential in the network treewidth. Approximate inference algorithms are generally not sensitive to treewidth, however, and can be quite efficient regardless of the network topology. The issue with these methods is related to the quality of answers the compute, which for some algorithms is quite related to the amount of time budgeted by the algorithm. We discuss two major classes of approximate inference algorithms in this section. The first and more classical class is based on sampling. The second and more recent class of methods can be understood in terms of a reduction to optimization problems. We note, however, that none of these algorithms offer general guarantees on the quality of approximations they produce, which is not surprising since the problem of approximating inference to any desired precision is known to be NP-hard [36].

##### 11.4.1 Inference by Stochastic Sampling

Sampling from a probability distribution Pr(**X**) is a process of generating complete instantiations **x**_{1},...,**x**_{n }of variables **X**. A key property of a sampling process is its *consistency: *generating samples **x **with a frequency that converges to their probability Pr(**x**) as the number of samples approaches infinity. By generating such consistent samples, one can approximate the probability of some event α, Pr(α), in terms of the fractions of samples that satisfy . This approximated probability will then converge to the true probability as the number of samples reaches infinity. Hence, the precision of sampling methods will generally increase with the number of samples, where the complexity of generating a sample is linear in the size of the network, and is usually only weakly dependent on its topology.

Indeed, one can easily generate consistent samples from a distribution Pr that is induced by a Bayesian network (G,Θ), using time that is linear in the network size to generate each sample. This can be done by visiting the network nodes in topological order, parents before children, choosing a value for each node X by sampling from the distribution Pr(X|**u**) = Θ_{X}|**u**, where **u **is the chosen values for X’s parents **U**. The key question with sampling methods is therefore related to the speed of convergence (as opposed to the speed of generating samples), which is usually affected by two major factors: the query at hand (whether it has a low probability) and the specific network parameters (whether they are extreme).

C onsider, for example, approximating the query Pr by approximating Pr(α,**e**) and Pr(**e**) and then computing | = Pr(α,**e**)/Pr(**e**) according to the above sampling method, known as *logic sampling *[76]. If the evidence **e **has a low probability, the fraction of samples that satisfy **e **(and α,**e **for that matter) will be small, decreasing exponentially in the number of variables instantiated by evidence **e**, and correspondingly increasing the convergence time. The fundamental problem here is that we are generating samples based on the original distribution Pr(**X**), where we ideally want to generate samples based on the posterior distribution Pr(**X**|**e**), which can be shown to be the optimal choice in a precise sense [28]. The problem, however, is that Pr(**X**|**e**) is not readily available to sample from. Hence, more sophisticated approaches for sampling attempt to sample from distributions that are meant to be close to Pr(**X**|**e**), possibly changing the sampling distribution (also known as an importance function) as the sampling process proceeds and more information is gained. This includes the methods of *likelihood weighting *[154, 63], *self-importance sampling *[154], *heuristic importance *[154], *adaptive importance sampling *[28], and *evidence pre-propagation importance sampling *(EPIS-BN) algorithm [179]. Likelihood weighing is perhaps the simplest of these methods. It works by generating samples that are guaranteed to be consistent with evidence **e**, by avoiding to sample values for variables **E**, always setting them to **e **instead. It also assigns a weight of $θ_{e}_{|}**u**:e_{u}∼**x **θ_{e}|**u **to each sample **x**.

Likelihood weighing will then use these weighted samples for approximating the probabilities of events. The current state of the art for sampling in Bayesian networks is probably the EPIS-BN algorithm, which estimates the optimal importance function using belief propagation (see Section 11.4.2) and then proceeds with sampling.

Another class of sampling methods is based on *Markov Chain Monte Carlo *(*MCMC*) simulation [23, 128]. Procedurally, samples in MCMC are generated by first starting with a random sample **x**_{0 }that is consistent with evidence **e**. A sample **x**_{i }is then generated based on sample **x**_{i}−1 by choosing a new value of some nonevidence variable X by sampling from the distribution Pr(X|**x**_{i }− X). This means that samples **x**_{i }and **x**_{i}+1 will disagree on at most one variable. It also means that the sampling distribution is potentially changed after each sample is generated. MCMC approximations will converge to the true probabilities if the network parameters are strictly positive, yet the algorithm is known to suffer from convergence problems in case the network parameters are extreme. Moreover, the sampling distribution of MCMC will convergence to the optimal one if the network parameters satisfy some (ergodic) properties [178].

One specialized class of sampling methods, known as *particle filtering*, deserves particular attention at it applies to DBNs [93]. In this class, one generates *particles *instead of *samples*, where a particle is an instantiation of the variables at a given time slice t. One starts by a set of n particles for the initial time slice t = 0, and then moves forward generating particles **x**^{t }for time t based on the particles generated for time t − 1. In particular, for each particle **x**^{t}−^{1}, we sample a particle **x **based on the distributions Pr(X^{t}|**x**^{t}−^{1}), in a fashion similar to logic sampling. The particles for time t can then be used to approximate the probabilities of events corresponding to that slice. As with other sampling algorithms, particle filtering needs to deal with the problem of unlikely evidence, a problem that is more exaggerated in the context of DBNs as the evidence pertaining to slices t > i is generally not available when we generate particles for times t i. One simple approach for addressing this problem is to *resample *the particles for time t based on the extent to which they are compatible with the evidence **e**^{t }at time t. In particular, we regenerate n particles for time t from the original set based on the weight Pr(**e**^{t}|**x**^{t}) assigned to each particle **x**^{t}. The family of particle filtering algorithms include other proposals for addressing this problem.

##### 11.4.2 Inference as Optimization

The second class of approximate inference algorithms for PR can be understood in terms of reducing the problem of inference to one of optimization. This class includes *belief propagation *(e.g., [130, 117, 56, 176]) and *variational *methods (e.g., [92, 85]).

Given a Bayesian network which induces a distribution Pr, variational methods work by formulating approximate inference as an optimization problem. For example, say we are interested in searching for an approximate distribution Pr which is more well behaved computationally than Pr. In particular, if Pr is induced by a Bayesian network N which has a manageable treewidth. Typically, one starts by choosing the network N which has a high treewidth, then Pr could possibly be induced by another

structure of network N to meet certain computational constraints and then search

for a parametrization of N that minimizes the KL-divergence between the original distribution Pr and the approximate one :

*K L * | . (w|**e**)

Ideally, we want parameters of network N that minimize this KL-divergence, while possibly satisfying additional constraints. Often, we can simply set to zero the paran iterative search for parameters that solve the resulting system of equations. Note tial derivatives of *KL*(Pr(.|**e**),Pr(.|**e**)) with respect to the parameters, and perform

that the KL-divergence is not symmetric. In fact, one would probably want to minimize *KL* | | instead, but this is not typically done due to computational considerations (see [57, 114] for approaches using this divergence, based on local optimizations).

One of the simplest variational approaches is to choose a completely disconnected

network N , leading to what is known as a *mean-field *approximation [72]. Other variational approaches typically assume a particular structure of the approximate model, such as chains [67], trees [57, 114], disconnected subnetworks [149, 72, 175], or just tractable substructures in general [173, 65]. These methods are typically phrased in the more general setting of graphical models (which includes other representational schemes, such as Markov Networks), but can typically be adapted to Bayesian net-

works as well. We should note here that the choice of approximate network N

should at least permit one to evaluate the KL-divergence between N and the original network N efficiently. As mentioned earlier, such approaches seek minima of the KL-divergence, but typically search for parameters where the partial derivatives of the KL-divergence are zero, i.e., parameters that are stationary points of the KLdivergence. In this sense, variational approaches can reduce the problem of inference to one of optimization. Note that methods identifying stationary points, while convenient, only approximate the optimization problem since stationary points do not necessarily represent minima of the KL-divergence, and even when they do, they do not necessarily represent global minima.

Methods based on belief propagation [130, 117, 56] are similar in the sense that they also can be understood as solving an optimization problem. However, this understanding is more recent and comes as an after fact of having discovered the first belief propagation algorithm, known as loopy belief propagation or iterative belief propafactored form:

gation (IBP). In IBP, the approximate distribution Pr is assumed to have a particular

**e** )

=, (11.5) |

X∈**X**U∈**U **Pr(U|**e**)

where U ∈ **U **are parents of the node X in the original Bayesian network N . This form allows one to decompose the KL-divergence between the original and approximate distributions as follows:

*KL* * *| |

logPr(**e**).

This decomposition of the KL-divergence has important properties. First, the term Pr(**e**) does not depend on the approximate distribution and can be ignored in the optimization process. Second, all other terms are expressed as a function of the approximate marginals | | , in addition to the original network parameters θ_{x}|**u**. In fact, IBP can be interpreted as searching for values of these approximate marginals that correspond to stationary points of the KL-divergence: ones that set to zero the partial derivatives of the divergence with respect to these marginals (under certain constraints). There is a key difference between the variational approaches based on searching for parameters of approximate networks and those based on searching for approximate marginals: The computed marginals may not actually correspond to any particular distribution as the optimization problem solved does not include enough constraints to ensure the global coherence of these marginals (only node marginals are consistent, e.g.,

The quality of approximations found by IBP depends on the extent to which the original distribution can indeed be expressed as given in (11.5). If the original network N has a polytree structure, the original distribution can be expressed as given in (11.5) and the stationary point obtained by IBP corresponds to exact marginals. In fact, the form given in (11.5) is not the only one that allows one to set up an optimization problem as given above. In particular, any factored form that has the structure:

Pr

Pr || , (11.6)

_{S }Pr( )

where **C **and **S **are sets of variables, will permit a similar decomposition of the KLdivergence in terms of marginals | | . This leads to a more general framework for approximate inference, known as *generalized belief propagation *[176]. Note, however, that this more general optimization problem is exponential in the sizes of sets **C **and **S**. In fact, any distribution induced by a Bayesian network N can be expressed in the above form, if the sets **C **and **S **correspond to the clusters and separators of a jointree for network N [130]. In that case, the stationary point of the optimization problem will correspond to exact marginals, yet the size of the optimization problem will be at least exponential in the network treewidth. The form in (11.6) can therefore be viewed as allowing one to trade the complexity of approximate inference with the quality of computed approximations, with IBP and jointree factorizations being two extreme cases on this spectrum. Methods for exploring this spectrum include joingraphs (which generalize jointrees) [1, 48], region graphs [176, 169, 170], and partially ordered sets (or posets) [111], which are structured methods for generating factorizations with interesting properties.

The above optimization perspective on belief propagation algorithms is only meant to expose the semantics behind these methods. In general, belief propagation algorithms do not set up an explicit optimization problem as discussed above. Instead, they operate by passing messages in a Bayesian network (as is done by IBP), a joingraph, or some other structure such as a region graph. For example, in a Bayesian network, the message sent from a node X to its neighbor Y is based on the messages that node X receives from its other neighbors Z = Y. Messages are typically initialized according to some fixed strategy, and then propagated according to some message passing schedule. For example, one may update messages in parallel or sequentially [168, 164]. Additional techniques are used to fine tune the propagation method, including message dampening [117, 78]. When message propagation converges (if it does), the computed marginals are known to correspond to stationary points of the KL-divergence as discussed above [176, 79]. There are methods that seek to optimize the divergence directly, but they may be slow to converge [180, 171, 94, 174].

Statistical physics happens to be the source of inspiration for many of these methods and perspectives. In particular, we can reformulate the optimization of the KLdivergence in terms of optimizing a *variational free energy *that approximates a free energy (e.g., in thermodynamics). The free energy approximation corresponding to IBP and Eq. (11.5) is often referred to as the *Bethe free energy *[176]. Other free energy approximations in physics that improve on, or generalize, the Bethe free energy have indeed lent themselves to generalizing belief propagation. Among them is the *Kikuchi free energy *[177], which led to *region-based free energy *approximations for generalized belief propagation algorithms [176].

#### 11.5 Constructing Bayesian Networks

Bayesian networks can be constructed in a variety of methods. Traditionally, Bayesian networks have been constructed by knowledge engineers in collaboration with domain experts, mostly in the domain of medical diagnosis. In more recent applications, Bayesian networks are typically synthesized from high level specifications, or learned from data. We will review each of these approaches in the following sections.

##### 11.5.1 Knowledge Engineering

The construction of Bayesian networks using traditional knowledge engineering techniques has been most prevalent in medical reasoning, which also constitute some of the first significant applications of Bayesian networks to real-world problems. Some of the notable examples in this regard include: The Quick Medical Reference (QMR) model [113] which was later reformulated as a Bayesian network model [159] that covers more than 600 diseases and 4000 symptoms; the CPCS-PM network [137, 125], which simulates patient scenarios in the medical field of hepatobiliary disease; and the MUNIN model for diagnosing neuromuscular disorders from data acquired by electromyographic (EMG) examinations [7, 5, 6], which covers 8 nerves and 6 muscles.

The construction of Bayesian networks using traditional knowledge engineering techniques has been recently made more effective through progress on the subject of *sensitivity analysis*: a form of analysis which focuses on understanding the relationship between local network parameters and global conclusions drawn from the network [102, 18, 90, 98, 19–21]. These results have lead to the creation of efficient sensitivity analysis tools which allow experts to assess the significance of network parameters, and to easily isolate problematic parameters when obtaining counterintuitive results to posed queries.

##### 11.5.2 High-Level Specifications

The manual construction of large Bayesian networks can be laborious and error-prone. In many domains, however, these networks tend to exhibit regular and repetitive structures, with the regularities manifesting themselves both at the level of individual CPTs and at the level of network structure. We have already seen in Section 11.2.4 how regularities in a CPT can reduce the specification of a large CPT to the specification of a few parameters. A similar situation can arise in the specification of a whole Bayesian network, allowing one to synthesize a large Bayesian network automatically from a compact, high-level specification that encodes probabilistic dependencies among network nodes, in addition to network parameters.

This general *knowledge-based model construction *paradigm [172] has given rise to many concrete high-level specification frameworks, with a variety of representation styles. All of these frameworks afford a certain degree of modularity, thus facilitating the adaptation of existing specifications to changing domains. A further benefit of high-level specifications lies in the fact that the smaller number of parameters they contain can often be learned from empirical data with higher accuracy than the larger number of parameters found in the full Bayesian network [59, 96]. We next describe some fundamental paradigms for high-level representation languages, where we distinguish between two main paradigms: template-based and programming-based. It must be acknowledged, however, that this simple distinction is hardly adequate to account for the whole variety of existing representation languages.

###### Template-based representations

The prototypical example of template-based representations is the dynamic Bayesian network described in Section 11.2.6. In this case, one specifies a DBN having an arbitrary number of slices using only two templates: one for the initial time slice, and one for all subsequent slices. By further specifying the number of required slices t, a Bayesian network of arbitrary size can be compiled from the given templates and temporal horizon t.

One can similarly specify other types of large Bayesian networks that are composed of identical, recurring segments. In general, the template-based approach requires two components for specifying a Bayesian network: a set of network templates whose instantiation leads to network segments, and a specification of which segments to generate and how to connect them together. Fig. 11.8 depicts three templates from the domain of genetics, involving two classes of variables: genotypes (*gt*) and phenotypes (*pt*). Each template contains nodes of two kinds: nodes representing random variables that are created by instantiating the template (solid circles, annotated with CPTs), and nodes for input variables (dashed circles). Given these templates, together with a pedigree which enumerates particular individuals with their parental relationships, one can then generate a concrete Bayesian network by instantiating one genotype template and one phenotype template for each individual, and then connecting the resulting segments depending on the pedigree structure. The particular genotype template instantiated for an individual will depend on whether the individual is a founder (has no parents) in the pedigree.

The most basic type of template-based representations, such as the one in Fig. 11.8, is quite rigid as all generated segments will have exactly the same structure. More

gt(X) | |||

gt(p1(X)) | gt(p2(X)) | AA Aa | aa |

AA | AA | 1.0 0.0 | 0.0 |

AA | Aa | 0.5 0.5 | 0.0 |

... | ... | ... ... | ... |

aa | aa | 0.0 0.0 | 1.0 |

gt(X)

AA Aa aa

0.49 0.42 0.09

*pt*(X)gt(X) affected not affected AA 0.0 1.0 Aa 0.0 1.0 aa 1.0 0.0

Figure 11.8: Templates for specifying a Bayesian network in the domain of genetics. The templates assume three possible genotypes (AA, Aa, aa) and two possible phenotypes (affected, not affected).

sophisticated template-based representations add flexibility to the specification in various ways. *Network fragments *[103] allow nodes in a template to have an unspecified number of parents. The CPT for such nodes must then be specified by generic rules. *Object oriented Bayesian networks *[99] introduce abstract classes of network templates that are defined by their interface with other templates. *Probabilistic relational models *enhance the template approach with elements of relational database concepts [59, 66], by allowing one to define probabilities conditional on aggregates of the values of an unspecified number of parents. For example, one might include nodes *life_expectancy*(X) and *age_at_death*(X) into a template for individuals X, and condition the distribution of *life_expectancy*(X) on the average value of the nodes *age_at_death*(Y) for all ancestors Y of X.

###### Programming-based representations

Frameworks in this group contain some of the earliest high-level representation languages. They use procedural or declarative specifications, which are not as directly connected to graphical representations as template-based representations. Many are based on logic programming languages [17, 132, 71, 118, 96]; others resemble functional programming [86] or deductive database [69] languages. Compared to templatebased approaches, programming-based representations can sometimes allow more modular and intuitive representations of high-level probabilistic knowledge. On the other hand, the compilation of the Bayesian network from the high-level specification

Table 11.3. A probabilistic Horn clause specification

*alarm*(X) ← *burglary*(X): 0.95 *alarm*(X) ← *quake*(Y), *lives*_*in*(X,Y): 0.8 *call*(X,Z) ← *alarm*(X), *neighbor*(X,Z): 0.7 *call*(X,Z) ← *prankster*(Z), *neighbor*(X,Z): 0.1

comb(*alarm*): noisy-or comb(*call*): noisy-or

Table 11.4. CPT for ground atom *alarm*(Holmes)

*burglary*(Holmes) *quake*(LA) *lives_in*(Holmes, LA) *quake*(SF) *lives_in*(Holmes, SF) *alarm*(Holmes)

is usually not as straightforward, and part of the semantics of the specification can be hidden in the details of the compilation process.

Table 11.3 shows a basic version of a representation based on probabilistic Horn clauses [71]. The logical atoms *alarm*(X), *burglary*(X),... represent generic random variables. Ground instances of these atoms, e.g., *alarm*(Holmes), *alarm*(Watson), become the actual nodes in the constructed Bayesian network. Each clause in the probabilistic rule base is a partial specification of the CPT for (ground instances of) the atom in the head of the clause. The second clause in Table 11.3, for example, stipulates that *alarm*(X) depends on variables *quake*(Y) and *lives*_*in*(X,Y). The parameters associated with the clauses, together with the *combination rules *associated with each relation determine how a full CPT is to be constructed for a ground atom. Table 11.4 depicts part of the CPT constructed for *alarm*(Holmes) when Table 11.3 is instantiated over a domain containing an individual Holmes and two cities LA and SF. The basic probabilistic Horn clause paradigm illustrated in Table 11.3 can be extended and modified in many ways; see for example *Context-sensitive probabilistic knowledge bases *[118] and *Relational Bayesian networks *[86].

Specifications such as the one in Table 11.3 need not necessarily be seen as highlevel specifications of Bayesian networks. Provided the representation language is equipped with a well-defined probabilistic semantics that is not defined procedurally in terms of the compilation process, such high-level specifications are also stand-alone probabilistic knowledge representation languages. It is not surprising, therefore, that some closely related representation languages have been developed which were not intended as high-level Bayesian network specifications [148, 116, 135, 140].

###### Inference

Inference on Bayesian networks generated from high-level specifications can be performed using standard inference algorithms discussed earlier. Note, however, that the

Figure 11.9: A Bayesian network structure for medical diagnosis.

Table 11.5. A data set for learning the structure in Fig. 11.9

Case | Cold? | Flu? | Tonsillitis? | Chilling? | Body ache? | Sore throat? | Fever? |

1 | true | false | ? | true | false | false | false |

2 | false | true | false | true | true | false | true |

3 | ? | ? | true | false | ? | true | false |

. . . | . . . | . . . | . . . | . . . | . . . | . . . | . . . |

generated networks can be very large and very connected (large treewidth), and therefore often pose particular challenges to inference algorithms. As an example, observe that the size of the CPT for *alarm*(Holmes) in Table 11.4 grows exponentially in the number of cities in the domain. Approximate inference techniques, as described in Section 11.4, are therefore particularly important for Bayesian networks generated from high-level specifications. One can also optimize some of these algorithms, such as sampling methods, for Bayesian networks compiled from these specifications [126]. It should also be noted that such Bayesian networks can sometimes be rich with local structure, allowing exact inference even when the network treewidth is quite high [27].

Exact inference algorithms that operate directly on high-level specifications have also been investigated. Theoretical complexity results show that in the worst case one cannot hope to obtain more efficient algorithms than standard exact inference on the compiled network [87]. This does not, however, preclude the possibility that high-level inference methods can be developed that are more efficient for particular applications and particular queries [133, 43].

##### 11.5.3 Learning Bayesian Networks

A Bayesian network over variables X_{1},...,X_{n }can be learned from a data set over these variables, which is a table with each row representing a partial instantiation of variables X_{1},...,X_{n}. Table 11.5 depicts an example data set for the network in Fig. 11.9.

Each row in the above table represents a medical case of a particular patient, where ? indicates the unavailability of corresponding data for that patient. It is typically assumed that when variables have missing values, one cannot conclude anything from that fact that the values are missing (e.g., a patient did not take an X-ray because the X-ray happened to be unavailable that day) [108].

There are two orthogonal dimensions that define the process of learning a Bayesian network from data: the task for which the Bayesian network will be used, and the amount of information available to the learning process. The first dimension decides the criteria by which we judge the quality of a learned network, that is, it decides the objective function that the learning process will need to optimize. This dimension calls for distinguishing between learning *generative *versus *discriminative *Bayesian networks. To make this distinction more concrete, consider again the data set shown in Table 11.5. A good *generative *Bayesian network is one that correctly models all of the correlations among the variables. This model could be used to accurately answer any query, such as the correlations between *Chilling*? and *BodyAche*?, as well as whether a patient has *Tonsilitis *given any other (partial) description of that patient. On the other hand, a discriminative Bayesian network is one that is intended to be used only as a classifier: determining the value of one particular variable, called the *class variable*, given the values of some other variables, called the *attributes *or *features*. When learning a discriminative network, we will therefore optimize the classification power of the learned network, without necessarily insisting on the global quality of the distribution it induces. Hence, the answers that the network may generate for other types of queries may not be meaningful. This section will focus on generative learning of networks; for information on discriminative learning of networks, see [84, 70]. The second dimension calls for distinguishing between four cases:

*Known network structure, complete data*. Here, the goal is only to learn the parameters Θ of a Bayesian network as the structure G is given as input to the learning process. Moreover, the given data is complete in the sense that each row in the data set provides a value for each network variable.*Known network structure, incomplete data*. This is similar to the above case, but some of the rows may not have values for some of the network variables; see Table 11.5.*Unknown network structure, complete data*. The goal here is to learn both the network structure and parameters, from complete data.*Unknown network structure, incomplete data*. This is similar to Case 3 above, but where the data is incomplete.

In the following discussion, we will only consider the learning of Bayesian networks in which CPTs have tabular representations, but see [60] for results on learning networks with structured CPT representations.

###### Learning network parameters

W e will now consider the task of learning Bayesian networks whose structure is already known and then discuss the case of unknown structure. Suppose that we have a complete data set D over variables **X **= . The first observation here is to view this data set as defining a probability distribution Pr over these variables, where *count*(**x**,D)/|D| is simply the percentage of rows in D that contain the instantiation **x**. Suppose now that we have a Bayesian network structure G and our goal is to learn the parameters Θ of this network given the data set D. This is done by choosing parameters so that the network (G,Θ) will induce a distribution Pr_{Θ }that is as close to Pr as possible, according to the KL-divergence. That is, the goal is to minimize:

**x**

**x x**

∈

Note that parameters which maximize the above quantity will also maximize the probability of data,A number of observations are in order about this method of learning. First, there is$d∈D Pr_{Θ}(d) and are known as *maximum likelihood parameters*. a unique set of parameters Θ = {θ_{x}|**u**} that satisfy the above property, defined as follows: θ_{x}|**u **= *count*(x**u**,D)*/count*(**u**,D) (e.g., see [115]). Second, this method may have problems when the data set does not contain enough cases, leading possibly to *count*(**u**,D) = 0 and a division by zero. This is usually handled by using (something like) a Laplacian correction; using, say

D , (11.8)

| + **u**,

where |X| is the number of values for variable X. We will refer to these parameters as Θ(ˆ G,D) from now on.

When the data is incomplete, the situation is not as simple for a number of reasons. First, we may have multiple sets of maximum likelihood parameters. Second, the two most commonly used methods that search for such parameters are not optimal, and both can be computationally intensive. Both methods are based on observing, from Eq. (11.7), that we are trying to optimize a function g(Θ) of the network parameters. The first method tries to optimize this function using standard gradient ascent techniques [146]. That is, we first compute the gradient which happens to have the following form:

∂g

, (11.9)

x|**u **_{d}∈_{D }x|

and then use it to drive a gradient ascent procedure that attempts to find a local maxima of the function g. This method will start with some initial parameter Θ^{0}, leading to an initial Bayesian network (G,Θ^{0}) with distribution Pr^{0}_{Θ}. It will then use Eq. (11.9)

rametersto compute the gradientΘ , with corresponding network∂g/∂θx|**u**(Θ0), which is then used to find the next set of pa-1) and distribution Pr1. The process

1 (G,Θ

^{6}We are treating a data set as a multi-set, which can include repeated elements.

continues, computing a new set of parameters Θ^{i }based on the previous set Θ^{i}−^{1}, until some convergence criteria is satisfied. Standard techniques of gradient ascent all are applicable in this case, including conjugate gradient, line search and random restarts [14].

A more commonly used method in this case is the *expectation maximization *(*EM*) algorithm [104, 112], which works as follows. The method starts with some initial parameters Θ^{0}, leading to an initial distribution Pr^{0}_{Θ}. It then uses the distribution to complete the data set D as follows. If d is a row in D for which some variable values are missing, the algorithm will (conceptually) consider every completion d of this row and assign it a weight of Pr | . The algorithm will then pretend as if it had a complete (but weighted) data set, and use the method for complete data to compute a new set of parameters Θ^{1}, leading to a new distribution Pr^{1}_{Θ}. This process continues, computing a new set of parameters Θ^{i }based on the previous set Θ^{i}−^{1}, until some convergence criteria is satisfied. This method has a number of interesting properties.

First, the value of parameters at iteration i have the following closed from:

θx^{i}|**u ** | ,

which has the same complexity as the gradient ascent method (see Eq. (11.9)). Second, the probability of the data set is guaranteed to never decrease after each iteration of the method. There are many techniques to make this algorithm even more efficient; see [112].

###### Learning network structure

We now turn to the problem of learning a network *structure *(as well as the associated parameters), given complete data. As this task is NP-hard in general [30], the main algorithms are iterative, starting with a single structure (perhaps the empty graph), and incrementally modifying this structure, until reaching some termination condition. There are two main classes of algorithms, score-based and independence-based.

As the name suggests, the algorithms based on independence will basically run a set of independence tests, between perhaps every pair of currently-unconnected nodes in the current graph, to see if the data set supports the claim that they are independent given the rest of the graph structure; see [68, 127].

Score-based algorithms will typically employ local search, although systematic search has been used in some cases too (e.g., [165]). Local search algorithms will evaluate the current structure, as well as every structure formed by some simple modification—such as adding one addition arc, or deleting one existing arc, or changing the direction of one arc [29]—and climb to the new structure with the highest score. One plausible score is based on favoring structures that lead to higher probability of the data:

(d). (11.10)

Unfortunately, this does not always work. To understand why, consider the simpler problem of fitting a polynomial to some pairs of real numbers. If we do not fix the degree of the polynomial, we would probably end up fitting the data perfectly by

Figure 11.10: Modeling intervention on causal networks.

selecting a high degree polynomial. Even though this may lead to a perfect fit over the given data points, the learned polynomial may not generalize the data well, and so do poorly at labeling other novel data points. The same phenomena, called *overfitting *[141], shows up in learning Bayesian networks, as it means we would favor a fully connected network, as clearly this complete graph would maximize the probability of data due to its large set of parameters (maximal degrees of freedom). To deal with this overfitting problem, other scoring functions are used, many explicitly including a penalty term for complex structure. This includes the Minimum Description Length (MDL) score [142, 62, 101, 163], the Akaike Information Criterion (AIC) score [16], and the “Bayesian Dirichlet, equal” (BDe) [33, 75, 74]. For example, the MDL score is given by:

*MDL*

where m is the size of data set D and k(G) is the number of independent network parameters (this also corresponds to the Bayesian Information Criterion (BIC) [151]). Each of these scores is asymptotically correct in that it will identify the correct structures in the limit as the data increases.

The above discussion has focused on learning arbitrary network structures. There are also efficient algorithms for computing the optimal structures, for some restricted class of structures, including trees [31] and polytrees [131].

If the data is incomplete, learning structures becomes much more complicated as we have two nested optimization problems: one for choosing the structure, which can again be accomplished by either greedy or optimal search, and one for choosing the parameters for a given structure, which can be accomplished using methods like EM [75]. One can improve the double search problem by using techniques such as *structural EM *[58], which uses particular data structures that allow computational results to be used across the different iterations of the algorithm.

#### 11.6 Causality and Intervention

The directed nature of Bayesian networks can be used to provide causal semantics for these networks, based on the notion of *intervention *[127], leading to models that not only represent probability distributions, but also permit one to induce new probability distributions that result from intervention. In particular, a *causal network*, intuitively speaking, is a Bayesian network with the added property that the parents of each node are its direct causes. For example, *Cold *→ *HeadAche *is a causal network whereas *HeadAche *→ *Cold *is not, even though both networks are equally capable of representing any joint distribution on the two variables. Causal networks can be used to compute the result of intervention as illustrated in Fig. 11.10. In this example, we want to compute the probability distribution that results from having set the value of variable D by *intervention*, as opposed to having observed the value of D. This can be done by deactivating the current causal mechanism for D—by disconnecting D from its direct causes A and B—and then conditioning the modified causal model on the set value of D. Note how different this process is from the classical operation of *Bayes conditioning *(Eq. (11.1)), which is appropriate for modeling observations but not immediately for intervention. For example, intervening on variable D in Fig. 11.10 would have no effect on the probability associated with F, while measurements taken on variable D would affect the probability associated with F.^{73}^{ }Causal networks are more properly defined, then, as Bayesian networks in which each parents–child family represents a stable causal mechanism. These mechanisms may be reconfigured locally by interventions, but remain invariant to other observations and manipulations.

Causal networks and their semantics based on intervention can then be used to answer additional types of queries that are beyond the scope of general Bayesian networks. This includes determining the truth of counterfactual sentences of the form α → β | γ, which read: “Given that we have observed γ, if α were true, then β would have been true”. The counterfactual antecedent α consists of a conjunction of value assignments to variables that are forced to hold true by external intervention. Typically, to justify being called “counterfactual”, α conflicts with γ. The truth (or probability) of a counterfactual conditional α → β | γ requires a causal model. For example, the probability that “the patient would be alive had he not taken the drug” cannot be computed from the information provided in a Bayesian network, but requires a functional causal networks, where each variable is functionally determined by its parents (plus noise factors). This more refined specification allows one to assign unique probabilities to all counterfactual statements. Other types of queries that have been formulated with respect to functional causal networks include ones for distinguishing between direct and indirect causes and for determining the sufficiency and necessity aspects of causation [127].

#### Acknowledgements

Marek Druzdzel contributed to Section 11.4.1, Arthur Choi contributed to Section 11.4.2, Manfred Jaeger contributed to Section 11.5.2, Russ Greiner contributed to Section 11.5.3, and Judea Pearl contributed to Section 11.6. Mark Chavira, Arthur Choi, Rina Dechter, and David Poole provided valuable comments on different versions of this chapter.

#### Bibliography

- S.M. Aji and R.J. McEliece. The generalized distributive law and free energy minimization. In
*Proceedings of the 39th Allerton Conference on Communication, Control and Computing*, pages 672–681, 2001. - D. Allen and A. Darwiche. New advances in inference by recursive conditioning. In
*Proceedings of the Conference on Uncertainty in Artificial Intelligence*, pages 2–10, 2003. - D. Allen and A. Darwiche. Optimal time–space tradeoff in probabilistic inference. In
*Proc. International Joint Conference on Artificial Intelligence (IJCAI)*, pages 969–975, 2003. - D. Allen and A. Darwiche. Advances in Bayesian networks. In
*Studies in Fuzziness and Soft Computing*, vol. 146, pages 39–55. Springer-Verlag, New York, 2004 (chapter*Optimal Time–Space Tradeoff in Probabilistic Inference*). - S. Andreassen, F.V. Jensen, S.K. Andersen, B. Falck, U. Kjærulff, M. Woldbye, A.R. Sorensen, A. Rosenfalck, and F. Jensen. MUNIN—an expert EMG assistant. In J.E. Desmedt, editor.
*Computer-Aided Electromyography and Expert Systems*. Elsevier Science Publishers, Amsterdam, 1989 (Chapter 21). - S. Andreassen, M. Suojanen, B. Falck, and K.G. Olesen. Improving the diagnostic performance of MUNIN by remodelling of the diseases. In
*Proceedings of the 8th Conference on AI in Medicine in Europe*, pages 167–176. SpringerVerlag, 2001. - S. Andreassen, M. Woldbye, B. Falck, and S.K. Andersen. Munin—a causal probabilistic network for interpretation of electromyographic findings. In J. McDermott, editor.
*Proceedings of the 10th International Joint Conference on Artificial Intelligence (IJCAI-87)*, pages 366–372. Morgan Kaufmann Publishers, 1987. - S.A. Arnborg. Efficient algorithms for combinatorial problems on graphs with bounded decomposability—a survey.
*BIT*, 25:2–23, 1985. - S. Arnborg, D.G. Corneil, and A. Proskurowski. Complexity of finding embeddings in a k-tree.
*SIAM J. Algebraic and Discrete Methods*, 8:277–284, 1987. - F. Bacchus, S. Dalmao, and T. Pitassi. Value elimination: Bayesian inference via backtracking search. In
*Proceedings of the 19th Annual Conference on Uncertainty in Artificial Intelligence (UAI-03)*, pages 20–28. Morgan Kaufmann Publishers, San Francisco, CA, 2003. - A. Becker, R. Bar-Yehuda, and D. Geiger. Random algorithms for the loop cutset problem. In
*Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI)*, 1999. - J. Bilmes and C. Bartels. Triangulating dynamic graphical models. In
*Uncertainty in Artificial Intelligence: Proceedings of the Nineteenth Conference*, pages 47–56, 2003. - J. Binder, K. Murphy, and S. Russell. Space-efficient inference in dynamic probabilistic networks. In
*Proc. International Joint Conference on Artificial Intelligence (IJCAI)*, 1997. - C. Bishop. Neural Networks for Pattern Recognition. Oxford University Press, Oxford, 1998.
- C. Boutilier, N. Friedman, M. Goldszmidt, and D. Koller. Context-specific independence in Bayesian networks. In
*Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence (UAI)*, pages 115–123, 1996. - H. Bozdogan. Model selection and Akaike’s Information Criterion (AIC): the general theory and its analytical extensions.
*Psychometrica*, 52:345–370, 1987. - J.S. Breese. Construction of belief and decision networks.
*Computational Intelligence*, 8(4):624–647, 1992. - E. Castillo, J.M. Gutiérrez, and A.S. Hadi. Sensitivity analysis in discrete Bayesian networks.
*IEEE Transactions on Systems, Man, and Cybernetics*, 27:412–423, 1997. - H. Chan and A. Darwiche. When do numbers really matter?
*Journal of Artificial Intelligence Research*, 17:265–287, 2002. - H. Chan and A. Darwiche. Sensitivity analysis in Bayesian networks: From single to multiple parameters. In
*Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI)*, pages 67–75. AUAI Press, Arlington, VA, 2004. - H. Chan and A. Darwiche. A distance measure for bounding probabilistic belief change.
*International Journal of Approximate Reasoning*, 38:149–174, 2005. - H. Chan and A. Darwiche. On the revision of probabilistic beliefs using uncertain evidence.
*Artificial Intelligence*, 163:67–90, 2005. - M.R. Chavez and G.F. Cooper. A randomized approximation algorithm for probabilistic inference on Bayesian belief networks.
*Networks*, 20(5):661–685, 1990. - M. Chavira, D. Allen, and A. Darwiche. Exploiting evidence in probabilistic inference. In
*Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence (UAI)*, pages 112–119, 2005. - M. Chavira and A. Darwiche. Compiling Bayesian networks with local structure. In
*Proceedings of the 19th International Joint Conference on Artificial Intelligence (IJCAI)*, pages 1306–1312, 2005. - M. Chavira and A. Darwiche. Compiling Bayesian networks using variable elimination. In
*Proceedings of the 20th International Joint Conference on Artificial Intelligence (IJCAI)*, pages 2443–2449, 2007. - M. Chavira, A. Darwiche, and M. Jaeger. Compiling relational Bayesian networks for exact inference.
*International Journal of Approximate Reasoning*, 42(1–2):4–20, May 2006. - J. Cheng and M.J. Druzdzel. BN-AIS: An adaptive importance sampling algorithm for evidential reasoning in large Bayesian networks.
*Journal of Artificial Intelligence Research*, 13:155–188, 2000. - D.M. Chickering. Optimal structure identification with greedy search.
*JMLR*, 2002. - D.M. Chickering and D. Heckerman. Large-sample learning of Bayesian networks is NP-hard.
*JMLR*, 2004. - C.K. Chow and C.N. Lui. Approximating discrete probability distributions with dependence trees.
*IEEE Transactions on Information Theory*, 14(3):462–467, 1968. - F.G. Cooper. The computational complexity of probabilistic inference using Bayesian belief networks.
*Artificial Intelligence*, 42(2–3):393–405, 1990. - G. Cooper and E. Herskovits. A Bayesian method for the induction of probabilistic networks from data.
*MLJ*, 9:309–347, 1992. - R. Cowell, A. Dawid, S. Lauritzen, and D. Spiegelhalter.
*Probabilistic Networks and Expert Systems*. Springer, 1999. - T. Verma, D. Geiger, and J. Pearl. d-separation: from theorems to algorithms. In
*Proceedings of the Sixth Conference on Uncertainty in Artificial Intelligence (UAI)*, pages 139–148, 1990. - P. Dagum and M. Luby. Approximating probabilistic inference in Bayesian belief networks is NP-hard.
*Artificial Intelligence*, 60(1):141–153, 1993. - A. Darwiche. Conditioning algorithms for exact and approximate inference in causal networks. In
*Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI)*, pages 99–107, 1995. - A. Darwiche. Any-space probabilistic inference. In
*Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI)*, pages 133–142, 2000. - A. Darwiche. Recursive conditioning.
*Artificial Intelligence*, 126(1–2):5–41, 2001. - A. Darwiche. A logical approach to factoring belief networks. In
*Proceedings of KR*, pages 409–420, 2002. - A. Darwiche. A differential approach to inference in Bayesian networks.
*Journal of the ACM*, 50(3):280–305, 2003. - A. Darwiche and M. Hopkins. Using recursive decomposition to construct elimination orders, jointrees and dtrees. In
*Trends in Artificial Intelligence*,*Lecture Notes in Artificial Intelligence*, vol. 2143, pages 180–191. Springer-Verlag, 2001. - R. de Salvo Braz, E. Amir, and D. Roth. Lifted first-order probabilistic inference. In
*Proceedings of the Nineteenth Int. Joint Conf. on Artificial Intelligence (IJCAI-05)*, pages 1319–1325, 2005. - T. Dean and K. Kanazawa. A model for reasoning about persistence and causation.
*Computational Intelligence*, 5(3):142–150, 1989. - R. Dechter. Bucket elimination: A unifying framework for reasoning.
*Artificial Intelligence*, 113:41–85, 1999. - R. Dechter and R. Mateescu. Mixtures of deterministic-probabilistic networks and their and/or search space. In
*Proceedings of the Twentieth Conference on Uncertainty in Artificial Intelligence (UAI’04)*, pages 120–129, 2004. - R. Dechter and Y. El Fattah. Topological parameters for time-space tradeoff.
*Artificial Intelligence*, 125(1–2):93–118, 2001. - R. Dechter, K. Kask, and R. Mateescu. Iterative join-graph propagation. In
*Proceedings of the Conference on Uncertainty in Artificial Intelligence*, pages 128– 136, 2002. - R. Dechter and D. Larkin. Hybrid processing of beliefs and constraints. In
*Uncertainty in Artificial Intelligence: Proceedings of the Seventeenth Conference (UAI-2001)*, pages 112–119. Morgan Kaufmann Publishers, San Francisco, CA, 2001. - R. Dechter and D. Larkin. Bayesian inference in the presence of determinism. In C.M. Bishop and B.J. Frey, editors, In
*Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics*, Key West, FL, 2003. - F.J. Díez. Parameter adjustment in Bayesian networks: the generalized noisyor gate. In
*Proceedings of the Ninth Conference on Uncertainty in Artificial Intelligence (UAI)*, 1993. - F.J. Díez. Local conditioning in Bayesian networks.
*Artificial Intelligence*, 87(1):1–20, 1996. - F.J. Díez and S.F. Galán. An efficient factorization for the noisy MAX.
*International Journal of Intelligent Systems*, 18:165–177, 2003. - M. Fishelson and D. Geiger. Exact genetic linkage computations for general pedigrees.
*Bioinformatics*, 18(1):189–198, 2002. - M. Fishelson and D. Geiger. Optimizing exact genetic linkage computations. In
*RECOMB’03*, 2003. - B.J. Frey and D.J.C. MacKay. A revolution: Belief propagation in graphs with cycles. In
*NIPS*, pages 479–485, 1997. - B.J. Frey, R. Patrascu, T. Jaakkola, and J. Moran. Sequentially fitting “inclusive” trees for inference in noisy-or networks. In
*NIPS*, pages 493–499, 2000. - N. Friedman. The Bayesian structural EM algorithm. In
*Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI)*, 1998. - N. Friedman, L. Getoor, D. Koller, and A. Pfeffer. Learning probabilistic relational models. In
*Proceedings of the 16th International Joint Conference on Artificial Intelligence (IJCAI-99)*, 1999. - N. Friedman and M. Goldszmidt. Learning Bayesian networks with local structure. In
*Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence (UAI)*, 1996. - N. Friedman and M. Goldszmidt. Learning Bayesian networks with local structure. In
*Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence (UAI)*, pages 252–262, 1996. - N. Friedman and Z. Yakhini. On the sample complexity of learning Bayesian networks. In
*Proceedings of the 12th Conference on Uncertainty in Artificial Intelligence (UAI)*, 1996. - R. Fung and K.-C. Chang. Weighing and integrating evidence for stochastic simulation in Bayesian networks. In M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer, editors.
*Uncertainty in Artificial Intelligence*, vol. 5, pages 209– 219. Elsevier Science Publishing Company, Inc., New York, NY, 1989. - D. Geiger, T. Verma, and J. Pearl. Identifying independence in Bayesian networks.
*Networks*:507–534, 1990. - D. Geiger and C. Meek. Structured variational inference procedures and their realizations. In
*Proceedings of Tenth International Workshop on Artificial Intelligence and Statistics*. The Society for Artificial Intelligence and Statistics, The Barbados, January 2005. - L. Getoor, N. Friedman, D. Koller, and B. Taskar. Learning probabilistic models of relational structure. In
*Proceedings of the 18th International Conference on Machine Learning*, pages 170–177, 2001. - Z. Ghahramani and M.I. Jordan. Factorial hidden Markov models.
*Machine Learning*, 29(2–3):245–273, 1997. - C. Glymour, R. Scheines, P. Spirtes, and K. Kelly.
*Discovering Causal Structure*. Academic Press, Inc., London, 1987. - R.P. Goldman and E. Charniak. Dynamic construction of belief networks. In P.P. Bonissone, M. Henrion, L.N. Kanal, and J.F. Lemmer, editors,
*Uncertainty in Artificial Intelligence*, vol. 6, pages 171–184, Elsevier Science, 1991. - Y. Guo and R. Greiner. Discriminative model selection for belief net structures. In
*Twentieth National Conference on Artificial Intelligence (AAAI-05)*, pages 770–776, Pittsburgh, July 2005. - P. Haddawy. Generating Bayesian networks from probability logic knowledge bases. In
*Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence (UAI-94)*, pages 262–269, 1994. - M. Haft, R. Hofmann, and V. Tresp. Model-independent mean-field theory as a local method for approximate propagation of information.
*Network: Computation in Neural Systems*, 10:93–105, 1999. - D. Heckerman. Causal independence for knowledge acquisition and inference. In D. Heckerman and A. Mamdani, editors,
*Proc. of the Ninth Conf. on Uncertainty in AI*, pages 122–127, 1993. - D. Heckerman, D. Geiger, and D. Chickering. Learning Bayesian networks: The combination of knowledge and statistical data.
*Machine Learning*, 20:197–243, 1995. - D.E. Heckerman. A tutorial on learning with Bayesian networks. In M.I. Jordan, editor.
*Learning in Graphical Models*. MIT Press, 1998. - M. Henrion. Propagating uncertainty in Bayesian networks by probabilistic logic sampling. In
*Uncertainty in Artificial Intelligence*, vol. 2, pages 149–163. Elsevier Science Publishing Company, Inc., New York, NY, 1988. - M. Henrion. Some practical issues in constructing belief networks. In L.N. Kanal, T.S. Levitt, and J.F. Lemmer, editors.
*Uncertainty in Artificial Intelligence*, vol. 3, pages 161–173. Elsevier Science Publishers B.V., North-Holland, 1989. - T. Heskes. Stable fixed points of loopy belief propagation are local minima of the Bethe free energy. In
*NIPS*, pages 343–350, 2002. - T. Heskes. On the uniqueness of loopy belief propagation fixed points.
*Neural Computation*, 16(11):2379–2413, 2004. - J. Hoey, R. St-Aubin, A. Hu, and G. Boutilier. SPUDD: Stochastic planning using decision diagrams. In
*Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence (UAI)*, pages 279–288, 1999. - E.J. Horvitz, H.J. Suermondt, and G.F. Cooper. Bounded conditioning: Flexible inference for decisions under scarce resources. In
*Proceedings of Conference on Uncertainty in Artificial Intelligence, Windsor, ON*, pages 182–193. Association for Uncertainty in Artificial Intelligence, Mountain View, CA, August 1989. - J. Hastad. Tensor rank is NP-complete.
*Journal of Algorithms*, 11:644–654, 1990. - C. Huang and A. Darwiche. Inference in belief networks: A procedural guide.
*International Journal of Approximate Reasoning*, 15(3):225–263, 1996. - I. Inza, P. Larranaga, J. Lozano, and J. Pena.
*Machine Learning Journal*, 59, June 2005 (Special Issue: Probabilistic Graphical Models for Classification). - T. Jaakkola.
*Advanced Mean Field Methods—Theory and Practice*. MIT Press, 2000 (chapter*Tutorial on Variational Approximation Methods*). - M. Jaeger. Relational Bayesian networks. In D. Geiger and P.P. Shenoy, editors.
*Proceedings of the 13th Conference of Uncertainty in Artificial Intelligence (UAI-13)*, pages 266–273. Morgan Kaufmann, Providence, USA, 1997. - M. Jaeger. On the complexity of inference about probabilistic relational models.
*Artificial Intelligence*, 117:297–308, 2000. - R. Jeffrey.
*The Logic of Decision*. McGraw-Hill, New York, 1965. - F.V. Jensen, S.L. Lauritzen, and K.G. Olesen. Bayesian updating in recursive graphical models by local computation.
*Computational Statistics Quarterly*, 4:269–282, 1990. - F.V. Jensen. Gradient descent training of Bayesian networks. In
*Proceedings of the Fifth European Conference on Symbolic and Quantitative Approaches to Reasoning with Uncertainty (ECSQARU)*, pages 5–9, 1999. - F.V. Jensen.
*Bayesian Networks and Decision Graphs*. Springer-Verlag, 2001. - M.I. Jordan, Z. Ghahramani, T. Jaakkola, and L.K. Saul. An introduction to variational methods for graphical models.
*Machine Learning*, 37(2):183–233, 1999. - K. Kanazawa, D. Koller, and S.J. Russell. Stochastic simulation algorithms for dynamic probabilistic networks. In
*Uncertainty in Artificial Intelligence: Proceedings of the Eleventh Conference*, pages 346–351, 1995. - H.J. Kappen and W. Wiegerinck. Novel iteration schemes for the cluster variation method. In
*NIPS*, pages 415–422, 2001. - K. Kask and R. Dechter. A general scheme for automatic generation of search heuristics from specification dependencies.
*Artificial Intelligence*, 129:91–131, 2001. - K. Kersting and L. De Raedt. Towards combining inductive logic programming and Bayesian networks. In
*Proceedings of the Eleventh International Conference on Inductive Logic Programming (ILP-2001)*,*Springer Lecture Notes in AI*, vol. 2157. Springer, 2001. - U. Kjaerulff. A computational scheme for reasoning in dynamic probabilistic networks. In
*Uncertainty in Artificial Intelligence: Proceedings of the Eight Conference*, pages 121–129, 1992. - U. Kjaerulff and L.C. van der Gaag. Making sensitivity analysis computationally efficient. In
*Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI)*, 2000. - D. Koller and A. Pfeffer. Object-oriented Bayesian networks. In
*Proceedings of the Thirteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI–97)*, pages 302–313. Morgan Kaufmann Publishers, San Francisco, CA, 1997. - S. Kullback and R.A. Leibler. On information and sufficiency.
*Annals of Mathematical Statistics*, 22:79–86, 1951. - W. Lam and F. Bacchus. Learning Bayesian belief networks: An approach based on the MDL principle.
*Computation Intelligence*, 10(4):269–293, 1994. - K.B. Laskey. Sensitivity analysis for probability assessments in Bayesian networks.
*IEEE Transactions on Systems, Man, and Cybernetics*, 25:901–909, 1995. - K.B. Laskey and S.M. Mahoney. Network fragments: Representing knowledge for constructing probabilistic models. In
*Proceedings of the Thirteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI–97)*, pages 334–341. San Morgan Kaufmann Publishers, Francisco, CA, 1997. - S.L. Lauritzen. The EM algorithm for graphical association models with missing data.
*Computational Statistics and Data Analysis*, 19:191–201, 1995. - S.L. Lauritzen and D.J. Spiegelhalter. Local computations with probabilities on graphical structures and their application to expert systems.
*Journal of Royal Statistics Society, Series B*, 50(2):157–224, 1988. - V. Lepar and P.P. Shenoy. A comparison of Lauritzen–Spiegelhalter, Hugin, and Shenoy–Shafer architectures for computing marginals of probability distributions. In
*Proceedings of the Fourteenth Annual Conference on Uncertainty in Artificial Intelligence (UAI-98)*, pages 328–337. Morgan Kaufmann Publishers, San Francisco, CA, 1998. - Y. Lin and M. Druzdzel. Computational advantages of relevance reasoning in Bayesian belief networks. In
*Proceedings of the 13th Annual Conference on Uncertainty in Artificial Intelligence (UAI-97)*, pages 342–350, 1997. - J.A. Little and D.B. Rubin.
*Statistical Analysis with Missing Data*. Wiley, New York, 1987. - D. Maier.
*The Theory of Relational Databases*. Computer Science Press, Rockville, MD, 1983. - R. Marinescu and R. Dechter. And/or branch-and-bound for graphical models. In
*Proceedings of International Joint Conference on Artificial Intelligence (IJCAI)*, 2005. - R.J. McEliece and M. Yildirim. Belief propagation on partially ordered sets. In J. Rosenthal and D.S. Gilliam, editors,
*Mathematical Systems Theory in Biology, Communications, Computation and Finance*. - G.J. McLachlan and T. Krishnan.
*The EM Algorithm and Extensions*.*Wiley Series in Probability and Statistics. Applied Probability and Statistics*. Wiley, New York, 1997. - R.A. Miller, F.E. Fasarie, and J.D. Myers. Quick medical reference (QMR) for diagnostic assistance.
*Medical Computing*, 3:34–48, 1986. - T.P. Minka and Y.(A.) Qi. Tree-structured approximations by expectation propagation. In
*Proceedings of the Annual Conference on Neural Information Processing Systems*, 2003. - T.M. Mitchell.
*Machine Learning*. McGraw-Hill, 1997. - S. Muggleton. Stochastic logic programs. In L. de Raedt, editor.
*Advances in Inductive Logic Programming*, pages 254–264. IOS Press, 1996. - K.P. Murphy, Y. Weiss, and M.I. Jordan. Loopy belief propagation for approximate inference: An empirical study. In
*Proceedings of the Conference on Uncertainty in Artificial Intelligence*, pages 467–475, 1999. - L. Ngo and P. Haddawy. Answering queries from context-sensitive probabilistic knowledge bases.
*Theoretical Computer Science*, 171:147–177, 1997. - A. Nicholson and J.M. Brady. The data association problem when monitoring robot vehicles using dynamic belief networks. In
*10th European Conference on Artificial Intelligence Proceedings*, pages 689–693, 1992. - T. Nielsen, P. Wuillemin, F. Jensen, and U. Kjaerulff. Using ROBDDs for inference in Bayesian networks with troubleshooting as an example. In
*Proceedings of the 16th Conference on Uncertainty in Artificial Intelligence (UAI)*, pages 426–435, 2000. - J.D. Park and A. Darwiche. Solving MAP exactly using systematic search. In
*Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence (UAI–03)*, pages 459–468. Morgan Kaufmann Publishers, San Francisco, CA, 2003. - J. Park and A. Darwiche. Morphing the Hugin and Shenoy–Shafer architectures. In
*Trends in Artificial Intelligence*,*Lecture Notes in AI*, vol. 2711, pages 149– 160. Springer-Verlag, 2003. - J. Park and A. Darwiche. Complexity results and approximation strategies for map explanations.
*Journal of Artificial Intelligence Research*, 21:101–133, 2004. - J. Park and A. Darwiche. A differential semantics for jointree algorithms.
*Artificial Intelligence*, 156:197–216, 2004. - R.C. Parker and R.A. Miller. Using causal knowledge to create simulated patient cases: The CPCS project as an extension of Internist-1. In
*Proceedings of the Eleventh Annual Symposium on Computer Applications in Medical Care*, pages 473–480. IEEE Comp. Soc. Press, 1987. - H. Pasula and S. Russell. Approximate inference for first-order probabilistic languages. In
*Proceedings of IJCAI-01*, pages 741–748, 2001. - J. Pearl.
*Causality: Models, Reasoning, and Inference*. Cambridge University Press, New York, 2000. - J. Pearl. Evidential reasoning using stochastic simulation of causal models.
*Artificial Intelligence*, 32(2):245–257, 1987. - J. Pearl. Fusion, propagation and structuring in belief networks.
*Artificial Intelligence*, 29(3):241–288, 1986. - J. Pearl.
*Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference*. Morgan Kaufmann Publishers, Inc., San Mateo, CA, 1988. - J. Pearl.
*Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference*. Morgan Kaufmann, San Mateo, CA, 1988. - D. Poole. Probabilistic horn abduction and Bayesian networks.
*Artificial Intelligence*, 64:81–129, 1993. - D. Poole. First-order probabilistic inference. In
*Proceedings of International Joint Conference on Artificial Intelligence (IJCAI)*, 2003. - D. Poole and N.L. Zhang. Exploiting contextual independence in probabilistic inference.
*Journal of Artificial Intelligence*, 18:263–313, 2003. - D. Poole. The independent choice logic for modelling multiple agents under uncertainty.
*Artificial Intelligence*, 94(1–2):7–56, 1997. - D. Poole. Context-specific approximation in probabilistic inference. In
*Proceedings of the 14th Conference on Uncertainty in Artificial Intelligence (UAI)*, pages 447–454, 1998. - M. Pradhan, G. Provan, B. Middleton, and M. Henrion. Knowledge engineering for large belief networks. In
*Uncertainty in Artificial Intelligence: Proceedings of the Tenth Conference (UAI-94)*, pages 484–490. Morgan Kaufmann Publishers, San Francisco, CA, 1994. - A. Krogh, R. Durbin, S. Eddy, and G. Mitchison.
*Biological Sequence Analysis: Probabilistic Models of Proteins and Nucleic Acids*. Cambridge University Press, 1998. - R.I. Bahar, E.A. Frohm, C.M. Gaona, G.D. Hachtel, E. Macii, A. Pardo, and F. Somenzi. Algebraic decision diagrams and their applications. In
*IEEE /ACM International Conference on CAD*, pages 188–191. IEEE Computer Society Press, Santa Clara, CA, 1993. - M. Richardson and P. Domingos. Markov logic networks.
*Machine Learning*, 62(1–2):107–136, 2006. - B. Ripley.
*Pattern Recognition and Neural Networks*. Cambridge University Press, Cambridge, UK, 1996. - J. Rissanen.
*Stochastic Complexity in Statistical Inquiry*. World Scientific, 1989. - S.L. Lauritzen, D.J. Spiegelhalter, R.G. Cowell, and A.P. Dawid.
*Probabilistic Networks and Expert Systems*. Springer, 1999. - N. Robertson and P.D. Seymour. Graph minors II: Algorithmic aspects of treewidth.
*J. Algorithms*, 7:309–322, 1986. - D. Roth.
*On the hardness of approximate reasoning*.*Artificial Intelligence*, 82(1–2):273–302, April 1996. - S. Russell, J. Binder, D. Koller, and K. Kanazawa. Local learning in probabilistic networks with hidden variables. In
*Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence (UAI)*, pages 1146–1152, 1995. - T. Sang, P. Beame, and H. Kautz. Solving Bayesian networks by weighted model counting. In
*Proceedings of the Twentieth National Conference on Artificial Intelligence (AAAI-05)*, vol. 1, pages 475–482. AAAI Press, 2005. - T. Sato. A statistical learning method for logic programs with distribution semantics. In
*Proceedings of the 12th International Conference on Logic Programming (ICLP’95)*, pages 715–729, 1995. - L.K. Saul and M.I. Jordan. Exploiting tractable substructures in intractable networks. In
*NIPS*, pages 486–492, 1995. - P. Savicky and J. Vomlel. Tensor rank-one decomposition of probability tables. In
*Proceedings of the Eleventh Conference on Information Processing and Management of Uncertainty in Knowledge-based Systems (IPMU)*, pages 2292– 2299, 2006. - G. Schwartz. Estimating the dimension of a model.
*Annals of Statistics*, 6:461– 464, 1978. - R. Shachter, S.K. Andersen, and P. Szolovits. Global Conditioning for Probabilistic Inference in Belief Networks. In
*Proc. Tenth Conference on Uncertainty in AI*, pages 514–522, Seattle WA, 1994. - R. Shachter, B.D. D’Ambrosio, and B. del Favero. Symbolic probabilistic inference in belief networks. In
*Proc. Conf. on Uncertainty in AI*, pages 126–131, 1990. - R.D. Shachter and M.A. Peot. Simulation approaches to general probabilistic inference on belief networks. In M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer, editors.
*Uncertainty in Artificial Intelligence*, vol. 5, pages 221– 231. Elsevier Science Publishing Company, Inc., New York, NY, 1989. - R. Shachter. Evaluating influence diagrams.
*Operations Research*, 34(6):871– 882, 1986. - R. Shachter. Evidence absorption and propagation through evidence reversals. In M. Henrion, R.D. Shachter, L.N. Kanal, and J.F. Lemmer, editors,
*Uncertainty in Artificial Intelligence*, vol. 5, pages 173–189, Elsvier Science, 1990. - P.P. Shenoy and G. Shafer. Propagating belief functions with local computations.
*IEEE Expert*, 1(3):43–52, 1986. - S.E. Shimony. Finding MAPs for belief networks is NP-hard.
*Artificial Intelligence*, 68:399–410, 1994. - M. Shwe, B. Middleton, D. Heckerman, M. Henrion, E. Horvitz, H. Lehmann, and G. Cooper. Probabilistic diagnosis using a reformulation of the INTERNIST-1/QMR knowledge base I. The probabilistic model and inference algorithms.
*Methods of Information in Medicine*, 30:241–255, 1991. - P. Smyth, D. Heckerman, and M.I. Jordan. Probabilistic independence networks for hidden Markov probability models.
*Neural Computation*, 9(2):227–269, 1997. - S. Srinivas. A generalization of the noisy-or model. In
*Proceedings of the Ninth Conference on Uncertainty in Artificial Intelligence (UAI)*, 1993. - H.J. Suermondt, G.F. Cooper, and D.E. Heckerman. A combination of cutset conditioning with clique-tree propagation in the Pathfinder system. In
*Proceedings of the 6th Annual Conference on Uncertainty in Artificial Intelligence (UAI-91)*, pages 245–253. Elsvier Science, New York, NY, 1991. - J. Suzuki. Learning Bayesian belief networks based on the MDL principle: An efficient algorithm using the branch and bound technique.
*Annals of Statistics*, 6, 1978. - M.F. Tappen and W.T. Freeman. Comparison of graph cuts with belief propagation for stereo, using identical MRF parameters. In
*ICCV*, pages 900–907, 2003. - J. Tian. A branch-and-bound algorithm for MDL learning Bayesian networks. In C. Boutilier and M. Goldszmidt, editors,
*Proceedings of the Sixteenth Conference on Uncertainty in Artificial Intelligence*, Stanford, CA, pages 580–588, 2000. - T. Verma and J. Pearl. Causal networks: Semantics and expressiveness. In
*Proceedings of the 4th Workshop on Uncertainty in AI*, pages 352–359, Minneapolis, MN, 1988. - J. Vomlel. Exploiting functional dependence in Bayesian network inference. In
*Proceedings of the Eighteenth Conference on Uncertainty in Artificial Intelligence (UAI)*, pages 528–535. Morgan Kaufmann Publishers, 2002. - M.J. Wainwright, T. Jaakkola, and A.S. Willsky. Tree-based reparameterization for approximate inference on loopy graphs. In
*Proceedings of the Annual Conference on Neural Information Processing Systems*, pages 1001–1008, 2001. - M. Welling. On the choice of regions for generalized belief propagation. In
*Proceedings of the Conference on Uncertainty in Artificial Intelligence*, page 585. AUAI Press, Arlington, VA, 2004. - M. Welling, T.P. Minka, and Y.W. Teh. Structured region graphs: morphing EP into GBP. In
*Proceedings of the Conference on Uncertainty in Artificial Intelligence*, 2005. - M. Welling and Y.W. Teh. Belief optimization for binary networks: A stable alternative to loopy belief propagation. In
*Proceedings of the Conference on Uncertainty in Artificial Intelligence*, pages 554–561, 2001. - M.P. Wellman, J.S. Breese, and R.P. Goldman. From knowledge bases to decision models.
*The Knowledge Engineering Review*, 7(1):35–53, 1992. - W. Wiegerinck. Variational approximations between mean field theory and the junction tree algorithm. In
*UAI*, pages 626–633, 2000. - W. Wiegerinck and T. Heskes. Fractional belief propagation. In
*NIPS*, pages 438–445, 2002. - E.P. Xing, M.I. Jordan, and S.J. Russell. A generalized mean field algorithm for variational inference in exponential families. In
*UAI*, pages 583–591, 2003. - J. Yedidia, W. Freeman, and Y. Weiss. Constructing free-energy approximations and generalized belief propagation algorithms.
*IEEE Transactions on Information Theory*, 51(7):2282–2312, 2005. - J.S. Yedidia, W.T. Freeman, and Y. Weiss. Understanding belief propagation and its generalizations. Technical Report TR-2001-022, MERL, 2001. Available online at http://www.merl.com/publications/TR2001-022/.
- J. York. Use of the Gibbs sampler in expert systems.
*Artificial Intelligence*, 56(1):115–130, 1992, http://dx.doi.org/10.1016/0004-3702(92)90066-7. - C. Yuan and M.J. Druzdzel. An importance sampling algorithm based on evidence pre-propagation. In
*Proceedings of the 19th Conference on Uncertainty in Artificial Intelligence (UAI-03)*, pages 624–631. Morgan Kaufmann Publishers, San Francisco, CA, 2003. - A.L. Yuille. Cccp algorithms to minimize the Bethe and Kikuchi free energies: Convergent alternatives to belief propagation.
*Neural Computation*, 14(7):1691–1722, 2002. - N.L. Zhang and D. Poole. A simple approach to Bayesian network computations. In
*Proceedings of the Tenth Conference on Uncertainty in Artificial Intelligence (UAI)*, pages 171–178, 1994. - N.L. Zhang and D. Poole. Exploiting causal independence in Bayesian network inference.
*Journal of Artificial Intelligence Research*, 5:301–328, 1996.

Several figures in this chapter are from Modeling and Reasoning with Bayesian Networks, published by Cambridge University Press, copyright Adnan Darwiche 2008, reprinted with permission.

This page intentionally left blank