A Bayesian Approach to Pathway Analysis by Integrating Gene–Gene Functional Directions and Microarray Data

Yifang Zhao; Ming-Hui Chen; Baikang Pei; David Rowe; Dong-Guk Shin; Wangang Xie; Fang Yu; Lynn Kuo

doi:10.1007/s12561-011-9046-1

. Author manuscript; available in PMC: 2013 Mar 9.

Published in final edited form as: Stat Biosci. 2011 Dec 29;4(1):105–131. doi: 10.1007/s12561-011-9046-1

A Bayesian Approach to Pathway Analysis by Integrating Gene–Gene Functional Directions and Microarray Data

Yifang Zhao ¹, Ming-Hui Chen ², Baikang Pei ³, David Rowe ⁴, Dong-Guk Shin ⁵, Wangang Xie ⁶, Fang Yu ⁷, Lynn Kuo ⁸

PMCID: PMC3592971 NIHMSID: NIHMS392341 PMID: 23482678

Abstract

Many statistical methods have been developed to screen for differentially expressed genes associated with specific phenotypes in the microarray data. However, it remains a major challenge to synthesize the observed expression patterns with abundant biological knowledge for more complete understanding of the biological functions among genes. Various methods including clustering analysis on genes, neural network, Bayesian network and pathway analysis have been developed toward this goal. In most of these procedures, the activation and inhibition relationships among genes have hardly been utilized in the modeling steps. We propose two novel Bayesian models to integrate the microarray data with the putative pathway structures obtained from the KEGG database and the directional gene–gene interactions in the medical literature. We define the symmetric Kullback–Leibler divergence of a pathway, and use it to identify the pathway(s) most supported by the microarray data. Monte Carlo Markov Chain sampling algorithm is given for posterior computation in the hierarchical model. The proposed method is shown to select the most supported pathway in an illustrative example. Finally, we apply the methodology to a real microarray data set to understand the gene expression profile of osteoblast lineage at defined stages of differentiation. We observe that our method correctly identifies the pathways that are reported to play essential roles in modulating bone mass.

Keywords: Bayesian belief network, Bayesian model selection, KEGG pathways, Microarray data, Prior construction, Symmetric Kullback–Leibler divergence

1 Introduction

Genome informatics was born to cope with the vast amount of data generated by the genomic studies, in particular, to support experimental projects. The challenges for post-genome informatics are on synthesis of biological knowledge from genomic information toward understanding of general principles of life. So post-genome informatics has to be coupled with systematic experiments in functional genomics. However, the coupling is in a different direction where informatics plays a dominant role in designing experiments and prediction.

High-throughput gene analysis technology such as cDNA microarray and oligonucleotide arrays has enabled parallel analysis of thousands of genes simultaneously. Numerous statistical methods have been developed to screen for differentially expressed genes, either up- or down-regulated, in these experiments. While these projects rapidly determine gene catalogs for an increasing number of organisms, functional annotation of individual genes is still largely incomplete. It would be essential to have knowledge on coregulated genes and their interactions. Consequently, various methods have been developed toward these goals. The methods include clustering analysis on genes, neural network, Bayesian network (BN), and pathway analysis. In this paper, we will focus on pathway and Bayesian network approaches.

There are multiple sources of knowledge on pathway and gene interaction. Kyoto Encyclopedia of Genes and Genomes (KEGG) database [17] was initiated by Japanese human genome program in 1995 to link genomic information with higher order functional information by computerizing current knowledge on cellular processes and by standardizing gene annotations. These databases are often called meta-data, which means data about data. KEGG consists of three databases: PATHWAY for representing higher order functions in terms of the network of interacting molecules, GENES for the collection of gene catalogs for all completely sequenced genomes and some partial genomes, and LIGAND for the collection of chemical compounds in the cell, enzyme molecules and enzymatic reactions. A pathway is a collection of graphical diagrams of interacting molecules obtained from many years of intensive biomedical research representing the present knowledge on various cellular or physiologic functions. It is supposed to be a computer representation of the biological system, so it can be used as part of the systems biology approach.

In addition to KEGG, such databases also include gene ontology (GO) database (Gene Ontology Consortium, 2001), and BioCarta (www.biocarta.com). We are focusing on the KEGG pathway database because it contains the directional relationship (activation or inhibition) between genes that is extremely useful in the system biology approach. Moreover, it provides a rich set of possible structures on the gene to gene relationships. The KEGG pathway can be expanded to a general pathway database to include recently developed and published pathways.

Another valuable source of biological knowledge is the gene-to-gene activation or inhibition knowledge aggregated from past experiments or literature search. We deposit these directional relationships among genes in a database called PrimeDB. So from this database, we can search for evidence of gene-to-gene activation or inhibition measured by the number of journals reporting these interactions.

Our goal is to investigate gene to gene interactions by integrating the following three components: the structure information of putative pathways available from pathway networks, the gene relations uncovered by literature mining as in PrimeDB and the microarray gene expression data. The first two components can be obtained before the microarray experiments, so we consider them as prior information. We will describe how we could revise our prior opinion on pathway after seeing the microarray results using Bayesian methods. Moreover, we develop methods for ranking pathways in terms of their degree of agreement with the microarray data. Figure 1 provides a schematic summary of the database integration.

Fig. 1 — Flowgram of data sets integration

Current statistics methods on pathway activities are mostly in the area of gene set enrichment analysis (GSEA) where a set of regulated genes in a pathway are compared to the regulated gene set in the microarray studies to determine whether the set is particular enriched by the pathway. Curtis et al. [3] have provided a table of software, annotation, and statistical methods used in each software. In general, Gene ontology (GO), GenMapp, KEGG, and Biocarta have been used for the annotations. Efron and Tibshirani [4] and Newton et al. [23] have provided more improved methods for gene enrichment analysis. However, all these methods are restricted to counting methods where the number of regulated genes is counted in each pathway. They have not incorporated the putative information on activation and inhibition relationships given in the KEGG database and PrimeDB.

There are also several Bayesian papers studying pathways or networks. Friedman et al. [7] propose an adaptive iterative search algorithm for an optimal Bayesian network (BN) while search is restricted to the most promising candidate parents of each gene based on some local statistics (such as correlation). Their learning algorithm uses no prior biological knowledge nor constraints. Hartemink et al. [10] extend the BN by adding edge annotation, which allows representation of additional information about the dependence relationships among genes. Sachs et al. [24] outline modeling the cell signaling pathways using BN. Sebastiani et al. [25] show the application of BN to the analysis of various types of genomic data including genomic markers and gene expression data. They also introduce the Generalized Gamma Networks to depict the possibly nonlinear parent–child dependencies. Werhli and Husmeier [29] use Bayesian networks to reconstruct gene regulatory network by integrating microarray data and multiple sources of prior knowledge such as KEGG pathways and promoter motifs. The prior probability of a network is modeled by a Gibb distribution, in which each source is encoded by a separate energy function. Shen and West [26] develop probability pathway annotation (PROPA) to match outcomes in gene expression to multiple biological pathway gene sets from curated databases. Monni and Li [22] show how to utilize the prior genetic pathway and network information in the analysis of genomic data in order to obtain a more interpretable list of genes that are associated with the genotypes. Ellis and Wong [5] have examined computational algorithms for determining BN structures from experimental data. However, as far as we know, the activation and inhibition relationships among genes have hardly been utilized in the modeling steps of these procedures, except Ellis and Wong.

In this paper, we consider the pathways given in KEGG as possible models. Each pathway is a weighted graph that includes the activation and inhibition binary relationships. Given our microarray data, we are interested in knowing which pathway, or a set of pathways are most agreeable to our microarray experiment results. In the frame work of model selection, we are essentially asking which pathway is most supported by the microarray data. Instead of the best one, we can also select a few pathways that are most supported by the data.

Our approach considers the results of microarray analysis as data. So we start with the selected regulated (significant) gene list from microarray analysis. The outcome for each selected genes is modeled as a discrete random variable taking values of 1 and 0, representing up-regulated and down-regulated, respectively. Then we consider a set of putative pathways (say 80 for example) from KEGG that needs to be studied. We first modify them slightly so each pathway can be considered as a BN with a directed acyclic graph. Then for each pathway structure in the set, we write down the local prior distribution for each node in the pathway. The hyperparameters in the prior information representing not only the propensities for a gene to have an activation or an inhibition effect on other genes, but also indicating the strength of this prior belief. They may have a big influence on the decision making of ranking pathways. So the formulations of them are guided by the PrimeDB which are being constructed by the informatics group of the authors here. We propose two possible solutions to the choice of hyperparameters: (A) Use the prior information obtained from PrimeDB to formulate good estimates of these hyperparameters. Then we just plug in these estimates for the hyperparameters. (B) Treat these hyperparameters as random variables to build another layer of hierarchial model sensibly guided by the PrimeDB and to achieve more robust results. We will develop both methods and examine their effects. We update these local distributions from KEGG and PrimeDB information conditioning on the data using the Bayes theorem. Then we rank the pathways by using the symmetric Kullback–Leibler divergence [19]. The most likely pathway has the smallest symmetric Kullback–Leibler divergence between the prior and the posterior distributions. We also extend our methodology to include all the genes in the microarray as data as given in Sect. 5.

This rest of this paper is organized as follows: Sect. 2 describes the simple Bayesian model which uses PrimeDB to specify prior directly, and defines the symmetric Kullback–Leibler divergence to measure pathway activities; Sect. 3 extends this divergence measure to the multilevel model, in which the second level prior governs our prior belief on the activation or inhibition effect aggregated by PrimeDB, A Markov chain Monte Carlo (MCMC) algorithm is proposed for posterior estimates computation; Sect. 4 demonstrates our method with a simple example, and Sect. 5 evaluates model performance through an osteoblast microarray study. We conclude the paper with a brief discussion in Sect. 6.

2 Simple Bayesian Model Using PrimeDB Directly

The easiest way to represent a network is to use a graph, which is a collection of vertices (nodes) and edges that connect vertices. The vertices can be genes, transcription factors, proteins, ligands, etc. All vertices need not be connected in a graph. A directed graph has one-way edges (arcs) that can represent an irreversible molecular reaction. In a weighted graph, weights (costs) are assigned to the edges, for example to distinguish between activation and inhibition in a signal transduction pathway.

We will first treat a pathway as a BN. A BN comprises two components: a directed acyclic graph (DAG) and a probability distribution. It is a graph with no path that starts and ends at the same node. The nodes in DAG depict stochastic variables. So the node can represent the outcome of a gene after a microarray experiment. The arcs in the DAG display directed dependencies among variables that are quantified by conditional probability distributions. Lack of arcs between two nodes indicates conditional independence. Heckerman [11] provides an excellent tutorial on the BN. We are highlighting the key points here.

Let Y = (Y₁, . . . , Y_Q) denote the outcomes of Q nodes in a BN. A BN consists of:

a network structure S that encodes a set of conditional independence assertions about the variables in Y;
a set of local probability distributions associated with each node.

In our application of BN on pathways, S comprises not only the set of conditional independence assertions among genes, but also the activation and inhibition effect among them. Note that in model derivation, we focus on the pathways in which each gene has a single parent, either activator or inhibitor. We also restrict our attention to a binary BN, where for each variable y_i takes on only two values, with y_i = 1, 0 for representing gene i being up- or down-regulated, respectively. In Sect. 5, we demonstrate that our model is readily extended to pathways which consist of equivalently expressed genes and/or genes with multiple parents.

2.1 Notations and Transition Probabilities

We classify a gene in a pathway into three categories: (1) with no parents, (2) with an activator parent, and (3) with an inhibitor parent. In particular, we have the following notations for the three categories:

$\bar{Pa} = {i : i \in (1, 2, \dots, Q), such that g_{i} has no parents .}$ : the index set of genes without parents.
$A = {i : i \in (1, 2, \dots, Q) \ \bar{Pa}, and g_{i} is activated by its parent .}$ : the index set of genes with Activator parents in a pathway of Q genes.
$I = {i : i \in (1, 2, \dots, Q) \ \bar{Pa}, and g_{i} is inhibited by its parent .}$ : the index set of genes with Inhibitor parents in a pathway of Q genes.

We use ${\overset{‒}{θ}}_{i}$ to denote the probability of gene i being up-regulated, for $i \in \bar{Pa}$ . Given we assume a gene can be either up- or down-regulated, so a Bernoulli distribution with probability ${\overset{‒}{θ}}_{i}$ suffices to model this outcome. We use ${\overset{‒}{θ}}_{s} = {{\overset{‒}{θ}}_{i} : i \in \bar{Pa}}$ to describe the set of initial states of a pathway, that is, for the gene(s) without parents.

For genes with parents, we need to define their transition probabilities. Let Y_i denote the outcome for g_i with its parent to be g_j. Use symbols ∪ for being up-regulated, and ∩ being down-regulated. Then we define the transition probabilities for the connected genes as in Tables 1 and 2. That is: If g_j activates g_i, then we assume $\Pr (Y_{i} = \cup ∣ Y_{j} = \cup) = \Pr (Y_{i} = \cap ∣ Y_{j} = \cap) = θ_{i ∣ j}$ . Consequently, $\Pr (Y_{i} = \cap ∣ Y_{j} = \cup) = \Pr (Y_{i} = \cup ∣ Y_{j} = \cap) = 1 - θ_{i ∣ j}$ . If g_j inhibits g_i, then we assume $\Pr (Y_{i} = \cap ∣ Y_{j} = \cup) = \Pr (Y_{i} = \cup ∣ Y_{j} = \cap) = ϕ_{i ∣ j}$ . Consequently, $\Pr (Y_{i} = \cup ∣ Y_{j} = \cup) = \Pr (Y_{i} = \cap ∣ Y_{j} = \cap) = 1 - ϕ_{i ∣ j}$ . Observe that θ_i|j represents an activation effect and ϕ_i|j an inhibition effect from gene j to i. So the transition probabilities define the local distribution of each node (gene) in the pathway, which is a collection of Bernoulli distributions. Observe if we let ϕ_i|j = 1 – θ_i|j, then can collapse Tables 1 and 2 into one table.

Table 1.

Transition probabilities for g_j to activate g_i

graphic file with name nihms-392341-f0002.jpg

Open in a new tab

Table 2.

Transition probabilities for g_jto inhibit g_i

graphic file with name nihms-392341-f0003.jpg

Open in a new tab

Let $θ_{s} = (\overset{‒}{θ} s, θ_{s}^{*}, ϕ_{2})$ be the parameter vector of pathway S, where ${\overset{‒}{θ}}_{s} = {{\overset{‒}{θ}}_{i} : i \in \bar{Pa}}$ is the vector of up-regulated genes with no parents, and $θ_{s}^{*} = {θ_{i ∣ j} : i \in A}$ and $ϕ_{s} = {ϕ_{i ∣ j} : i \in I}$ are the vectors of transition probabilities for genes being activated and inhibited, respectively. Let D denote the data, which are assumed to be a random sample from the joint distribution of Y. So D consists of (1) n, the total number of microarray experiments being analyzed, (2) n̄_i, the count for being up-regulated in n experiments for each $i \in \bar{Pa}$ , and (3) n_i|j denotes the number of concordant pairs (( $\cap \cap$ ) or ( $\cup, \cup$ )) for gene j and gene i ordered pair in n experiments where gene j is a parent, and gene $i \in A$ . If gene $i \in I$ , then n_ij denotes the number of discordant pairs (( $\cap \cup$ ) or ( $\cup, \cap$ )) for the gene j and gene i pair. By the local Markov property of the BN which says that each node is independent of its non-descendants given the parent nodes, the likelihood function, L(θ_s|D), of a given pathway S is the product of the local distributions over all the genes in it. It is given as

L (θ_{s} ∣ D) = \prod_{i \in \bar{Pa}} {\overset{‒}{θ}}_{i}^{{\overset{‒}{n}}_{i}} {(1 - {\overset{‒}{θ}}_{i})}^{n - {\overset{‒}{n}}_{i}} \prod_{i \in A} θ_{i ∣ j}^{n_{i ∣ j}} {(1 - θ_{i ∣ j})}^{n - n_{i ∣ j}} \prod_{i \in I} ϕ_{i ∣ j}^{n_{i ∣ j}} {(1 - ϕ_{i ∣ j})}^{n - n_{i ∣ j}} .

2.2 Prior Elicitation and the Posterior Distributions

Suppose in the PrimeDB, there are a_i|j journal articles citing that g_j activates g_i and b_i|j journal articles citing g_j inhibits g_i. To incorporate these prior information, we would first assume Simple Bayesian model using PrimeDB directly. We assume θ_i|j ~ βe(a_i|j, b_i|j) for the activation effect, and ϕ_i|j ~ βe(b_i|j, a_i|j) for the inhibition effect. If PrimeDB does not provide information on the initial state, we can use a vague prior for it, for example, ${\overset{‒}{θ}}_{i} \sim β e (1, 1)$ . Assuming that the parameters are mutually independent over i, the joint prior distribution can be written as:

\begin{matrix} π (θ_{s} ∣ S) & = π ({\overset{‒}{θ}}_{s}, θ_{s}^{*}, ϕ_{s} ∣ S) = π ({\overset{‒}{θ}}_{s} ∣ S) π (θ_{s}^{*} ∣ S) π (ϕ_{s} ∣ S) \\ = \prod_{i \in \bar{Pa}} π ({\overset{‒}{θ}}_{i} ∣ S) \prod_{i \in A} π (θ_{i ∣ j} ∣ S) \prod_{i \in I} π (ϕ_{i ∣ j} ∣ S) . \end{matrix}

(2.1)

Assuming there are no missing data, i.e., for each node in the BN we observe some data, the posterior distribution π(θ_s|D, S) is then given as:

\begin{matrix} π (θ_{s} ∣ D, S) & = π ({\overset{‒}{θ}}_{s}, θ_{s}^{*}, ϕ_{s} ∣ D, S) \\ = \prod_{i \in \bar{Pa}} π ({\overset{‒}{θ}}_{i} ∣ D, S) \prod_{i \in A} π (θ_{i ∣ j} ∣ D, S) \prod_{i \in I} π (ϕ_{i ∣ j} ∣ D, S) . \end{matrix}

(2.2)

2.3 Selection Criterion of Supported Pathways: Symmetric Kullback–Leibler Divergence

We propose to use the symmetric Kullback–Leibler divergence [19] to select the best pathway. The smaller symmetric Kullback–Leibler divergence between the prior and posterior distributions of a given pathway, the more supported by the data this pathway is.

The Kullback–Leibler (KL) divergence is conventionally used to measure the difference between two densities. The KL divergence of the probability distribution f₁(y) from f₂(y) is defined as

K L (f_{1}, f_{2}) = \int \ln (\frac{f_{1} (y)}{f_{2} (y)}) f_{1} (y) d y .

We highlight the key properties of KL divergence as follows. First, the KL divergence is not a distance because it is asymmetric, i.e., KL(f₁, f₂) ≠ KL(f₂, f₁), and it does not satisfy the triangle inequality. Second, using Jensen's inequality, it can be shown that the KL divergence is nonnegative if f₂ is a proper density, and equals zero if and only if f₁ = f₂. Third, the KL divergence measures how much information f₂ carries about f₁, if f₁ is considered the “true” distribution of the data.

For more intuitive interpretation, we adopt the definition of the symmetric KL divergence introduced by Kullback and Leibler (1951)

S K L (f_{1}, f_{2}) = K L (f_{1}, f_{2}) + K L (f_{2}, f_{1}) .

Let us first define the symmetric KL divergence of gene i in the simple model as:

\begin{matrix} S K L (π (γ_{i} ∣ S), π (γ_{i} ∣ D, S)) & ≔ \int \ln [\frac{π (γ_{i} ∣ S)}{π (γ_{i} ∣ D, S)}] π (γ_{i} ∣ S) d γ_{i} + \int \ln [\frac{π (γ_{i} ∣ D, S)}{π (γ_{i} ∣ S)}] π (γ_{i} ∣ D, S) d γ_{i} \\ = \int [\ln p (y_{i} ∣ p a_{i}, γ_{i}, S)] π (γ_{i} ∣ D, S) d γ_{i} - \int [\ln p (y_{i} ∣ p a_{i}, γ_{i}, S)] π (γ_{i} ∣ S) d γ_{i}, \end{matrix}

(2.3)

where

γ_{i} = {\begin{matrix} {\overset{‒}{θ}}_{i} & if i \in \bar{Pa}, \\ θ_{i ∣ j} & if i \in A, \\ ϕ_{i ∣ j} & if i \in I, \end{matrix}

and p(y_i|pa_i, γ_i, S) is the local probability distribution for gene i, and pa_i denotes the configuration of its parent. Note that in our single-parent pathways, it can be an empty set or a singleton set having parent gene j. Equation (2.3) is an immediate consequence of the fact that π(γ_i|S) is a proper prior and its normalized constant is 1.

Because different pathways have different gene sizes, we use the geometric mean of the symmetric KL divergence for individual genes to correct for different dimensions. We thus define the symmetric KL divergence of a pathway S with Q genes as

S K L (S) = \int \ln {[\frac{π (θ_{s} ∣ S)}{π (θ_{s} ∣ D, S)}]}^{1 ∕ Q} π (θ_{s} ∣ S) d θ_{s} + \int \ln {[\frac{π (θ_{s} ∣ D, S)}{π (θ_{s} ∣ S)}]}^{1 ∕ Q} π (θ_{s} ∣ D, S) d θ_{s} .

(2.4)

Let π*(θ_s|S) be the kernel density of the prior, and let C₀(S) and C_D(S) be the normalizing constants of prior and posterior distributions, respectively. We can write

π (θ_{s} ∣ S) = \frac{π^{*} (θ_{s} ∣ S)}{C_{0} (S)},

and

π (θ_{s} ∣ D, S) = \frac{L (θ_{s} ∣ D) π^{*} (θ_{s} ∣ S)}{C_{D} (S)} .

Then (2.4) becomes

S K L (S) = \frac{1}{Q} (\int \ln [L (θ_{s} ∣ D)] π (θ_{s} ∣ D, S) d θ_{s} - \int \ln [L (θ_{s} ∣ D)] π (θ_{s} ∣ S) d θ_{s}),

(2.5)

after canceling out ln(C_D(S)/C₀(S)) and ln(C₀(S)/C_D(S)) in the evaluation. This makes the definition of the symmetric KL divergence more attractive, because the computation of C_D(S)/C₀(S) can be very expensive.

By local Markov property of BN, the likelihood function, L(θ_s|D), is the product of the local probability distributions. Moreover, as shown in (2.1) and (2.2), the prior distribution of pathway S can also be decomposed into the product of local prior distributions of its component genes, so can be the joint posterior distribution. Consequently, we have

\begin{matrix} S K L (S) & = \frac{1}{Q} \sum_{i = 1}^{Q} (\int [\ln p (y_{i} ∣ p a_{i}, γ_{i}, S)] π (γ_{i} ∣ D, S) d γ_{i} π (θ_{s_{(- γ_{i})}} ∣ D, S) d θ_{s_{(- γ_{i})}} - \int [\ln p (y_{i} ∣ p a_{i}, γ_{i}, S)] π (γ_{i} ∣ S) d γ_{i} π (θ_{s (- γ_{i})} ∣ S) d θ_{s_{(- γ_{i})}}) \\ = \frac{1}{Q} \sum_{i = 1}^{Q} (\int \ln p (y_{i} ∣ p a_{i}, γ_{i}, S) (π (γ_{i} ∣ D, S) - π (γ_{i} ∣ S)) d γ_{i}) \\ = \frac{1}{Q} \sum_{i = 1}^{Q} S K L (π (γ_{i} ∣ S), π (γ_{i} ∣ D, S)) . \end{matrix}

Here θ_{s_{(–γ_i)}} denotes the transition probability vector of pathway S without gene i's transition probability γ_i. Note

\int π (θ_{s_{(- γ_{i})}} ∣ S) d θ_{s_{(- γ_{i})}} = 1, and \int π (θ_{s_{(- γ_{i})}} ∣ D, S) d θ_{s_{(- γ_{i})}} = 1 .

Hence, the symmetric KL divergence of a pathway is the average of the symmetric KL divergences of its component genes. This has sensible interpretation. When a pathway is supported by the microarray data, we expect that, on average, the discrepancy between the local conditional distributions of its genes and their prior distributions will be small. And so will be the discrepancy between the local conditional distributions and the posterior distributions, because the prior is part of the posterior.

Since the log likelihood breaks into the sum of three parts: the one of log local distributions of initial states, of activated genes and of inhibited genes, we have

S K L (S) = \frac{1}{Q} (I_{1} + I_{2} + I_{3}),

where

\begin{matrix} I_{1} & = \sum_{i \in \bar{Pa}} \int_{1}^{0} [\ln {\overset{‒}{θ}}_{i}^{{\overset{‒}{n}}_{i}} {(1 - {\overset{‒}{θ}}_{i})}^{n - {\overset{‒}{n}}_{i}}] \frac{1}{B (1 + {\overset{‒}{n}}_{i}, 1 + n - {\overset{‒}{n}}_{i})} {\overset{‒}{θ}}_{i}^{{\overset{‒}{n}}_{i}} {(1 - {\overset{‒}{θ}}_{i})}^{n - {\overset{‒}{n}}_{i}} d {\overset{‒}{θ}}_{i} - \int_{0}^{1} \ln {\overset{‒}{θ}}_{i}^{{\overset{‒}{n}}_{i}} {(1 - {\overset{‒}{θ}}_{i})}^{n - {\overset{‒}{n}}_{i}} d {\overset{‒}{θ}}_{i}, \\ I_{2} & = \sum_{i \in A} \int_{0}^{1} [\ln θ_{i ∣ j}^{n_{i ∣ j}} {(1 - θ_{i ∣ j})}^{n - n_{i ∣ j}}] \frac{θ_{i ∣ j}^{a_{i ∣ j} + n_{i ∣ j} - 1} {(1 - θ_{i ∣ j})}^{b_{i ∣ j} + n - n_{i ∣ j} - 1}}{B (a_{i ∣ j} + n_{i ∣ j}, b_{i ∣ j} + n - n_{i ∣ j})} d θ^{i ∣ j} - \int_{0}^{1} [\ln θ_{i ∣ j}^{n_{i ∣ j}} {(1 - θ_{i ∣ j})}^{n - n_{i ∣ j}}] \frac{θ_{i ∣ j}^{a_{i ∣ j} - 1} {(1 - θ_{i ∣ j})}^{b_{i ∣ j} - 1}}{B (a_{i ∣ j}, b_{i ∣ j})} d θ_{i ∣ j}, \\ I_{3} & = \sum_{i \in I} \int_{0}^{1} [\ln ϕ_{i ∣ j}^{n_{i ∣ j}} {(1 - ϕ_{i ∣ j})}^{n - n_{i ∣ j}}] \frac{ϕ_{i ∣ j}^{b_{i ∣ j} + n_{i ∣ j} - 1} {(1 - ϕ_{i ∣ j})}^{a_{i ∣ j} + n - n_{i ∣ j} - 1}}{B (b_{i ∣ j} + n_{i ∣ j}, a_{i ∣ j} + n - n_{i ∣ j})} d ϕ_{i ∣ j} - \int_{0}^{1} [\ln ϕ_{i ∣ j}^{n_{i ∣ j}} {(1 - ϕ_{i ∣ j})}^{n - n_{i ∣ j}}] \frac{ϕ_{i ∣ j}^{b_{i ∣ j} - 1} {(1 - ϕ_{i ∣ j})}^{a_{i ∣ j} - 1}}{B (b_{i ∣ j}, a_{i ∣ j})} d ϕ_{i ∣ j}, \end{matrix}

and $B (z, w) = \frac{Γ (z) Γ (w)}{Γ (z + w)}$ . To evaluate the I₁, I₂ and I₃, we will make use of the following result.

Proposition 1 If Z ~ βe(α, β), and a, b > 0, then

\int_{0}^{1} [\ln z^{n_{1}} {(1 - z)}^{n_{2}}] \frac{1}{B (α, β)} z^{α - 1} {(1 - z)}^{β - 1} d z = n_{1} ψ (α) + n_{2} ψ (β) - (n_{1} + n_{2}) ψ (α + β),

where ψ(α) is the standard digamma function defined as

ψ (α) = \frac{d}{d α} \ln Γ (α) = \frac{Γ^{'} (α)}{Γ (α)} .

It is straightforward to verify the proposition by interchanging the order of integration (with respect to z) and differentiation (with respect to α, β, or α + β). Consequently, we have

I_{1} = \sum_{i \in \bar{Pa}} ({\overset{‒}{n}}_{i} ψ ({\overset{‒}{n}}_{i}) + (n - {\overset{‒}{n}}_{i}) ψ (n - {\overset{‒}{n}}_{i}) - n ψ (n + 2) + n + 2),

(2.6)

I_{2} = \sum_{i \in A} n_{i ∣ j} [ψ (a_{i ∣ j} + n_{i ∣ j}) - ψ (a_{i ∣ j})] + (n - n_{i ∣ j}) [ψ (b_{i ∣ j} + n - n_{i ∣ j}) - ψ (b_{i ∣ j})] - n [ψ (a_{i ∣ j} + b_{i ∣ j} + n) - ψ (a_{i ∣ j} + b_{i ∣ j})] .

(2.7)

I₃ is similar to I₂ except a_i|j is replaced by b_i|j, and vice versa.

Hence, in the simple model, the computation of SKL(S) boils down to evaluating the sum of a series of the difference of digamma functions, weighted by the gene size of the pathway.

Our average Q value attempts to adjust for the size of the pathways. In spite of the fact that larger pathways would be less sensitive to a few extreme SKL scores on the gene level, it is not necessary that the Q score always gives advantage to larger pathways.

3 Extension of the Symmetric KL Divergence to the Multilevel Model

3.1 Multilevel Model Guided by PrimeDB

An extra level of hierarchical Bayesian model can be constructed to allow information sharing among the same types of gene, one for activation, the other for inhibition. We define the first level prior distribution: for all i and j, θ_i|j|θ ~ βe(a_i|jθ, b_i|jθ), they are independent over all i and j given θ. Similarly, ϕ_i|j|ϕ ~ βe(b_i|jϕ, a_i|jϕ) and are independent given ϕ. The second level prior $θ \sim E (μ)$ is independent of $ϕ \sim E (ν)$ with known μ and ν, where $E (μ)$ denotes an exponential distribution with mean μ. By the way, it is possible to allow unknown μ and ν, so we can build a third level to allow sharing between the types of gene. For the time being, we are only considering two levels with μ and ν known. Note the first level specification yields E(θ_i|j|θ) = a_i|j/(a_i|j + b_i|j). This is the same mean as in the Simple Bayesian model. However Var(θ_i|j|θ) = a_i|jb_i|j/[(a_i|j + b_i|j)²(a_i|jθ + b_i|jθ + 1)]. So this hierarchical model adds one more parameter θ that controls our prior belief of the PrimeDB. The bigger the θ, the stronger the belief on the PrimeDB. Note that when μ = ν = 1, the distributions of transition probabilities conditional on θ and ϕ are the same as in the simple model. All genes with an activation effect suggested by KEGG share a common factor θ that can be learned from the data and PrimeDB. Similar considerations apply to the inhibition parameters.

3.2 Symmetric KL Divergence for Multilevel Model

We first extend the definitions of SKL(π(γ_i|S), π(γ_i|D, S)) and SKL(S) in the multilevel model. Including the hyperparameters that govern our belief on the activation or inhibition effect from PrimeDB, the parameter vector of pathway S becomes $θ_{s} = ({\overset{‒}{θ}}_{s}, θ_{s}^{*}, ϕ_{s}, θ, ϕ)$ . Let

γ_{i} = {\begin{matrix} {\overset{‒}{θ}}_{i} & if i \in \bar{Pa}, \\ (θ_{i ∣ j}, θ) & if i \in A, \\ (ϕ_{i ∣ j}, ϕ) & if i \in I . \end{matrix}

To generalize the definition of SKL(S) to the multilevel model, it is easy to see that we need to modify (2.5) to

S K L (S) = \frac{1}{Q} (\int [\ln L (θ_{s} ∣ D)] π ({\overset{‒}{θ}}_{s}, θ_{s}^{*}, ϕ_{s}, θ, ϕ ∣ D, S) d θ_{s} - \int [\ln L (θ_{s} ∣ D)] π ({\overset{‒}{θ}}_{s}, θ_{s}^{*}, ϕ_{s}, θ, ϕ ∣ S) d θ_{s}) .

(3.1)

Following the same logic in deriving SKL(S) in the simple model, we can verify that in the multilevel model,

S K L (S) = \frac{1}{Q} \sum_{i = 1}^{Q} S K L (π (γ_{i} ∣ S), π (γ_{i} ∣ D, S)) .

Now, the joint prior distribution can be collapsed as

\begin{matrix} π ({\overset{‒}{θ}}_{s}, θ_{s}^{*}, ϕ_{s}, θ, ϕ ∣ S) & = π ({\overset{‒}{θ}}_{s}, θ_{s}^{*}, ϕ_{s} ∣ θ, ϕ, S) π (θ, ϕ ∣ S) \\ = π ({\overset{‒}{θ}}_{s} ∣ S) π (θ_{s}^{*} ∣ θ, S) π (θ ∣ S) π (ϕ_{s} ∣ ϕ, S) π (ϕ ∣ S) \end{matrix}

(3.2)

= {\prod_{i \in \bar{Pa}} π ({\overset{‒}{θ}}_{i} ∣ S) \prod_{i \in A} π (θ_{i ∣ j} ∣ θ, S) \prod_{i \in I} π (ϕ_{i ∣ j} ∣ ϕ, S)} π (θ ∣ S) π (ϕ ∣ S),

(3.3)

where (3.2) follows the facts that ${\overset{‒}{θ}}_{s}$ does not depend on θ or ϕ, $θ_{s}^{*} ∣ θ$ is independent of ϕ, ϕ_s|ϕ is independent of θ, θ and ϕ are independent. Both the assumptions of θ_i|j|θ being independent over i and j for $i \in A$ , and ϕ_i|j|ϕ being independent over i and j for $i \in I$ , yield (3.3).

Similarly, Ithe joint posterior distribution can be collapsed as

π ({\overset{‒}{θ}}_{s}, θ_{s}^{*}, ϕ_{s}, θ, ϕ ∣ D, S) = (\prod_{i \in \bar{Pa}} π ({\overset{‒}{θ}}_{i} ∣ D, S) \prod_{i \in A} π (θ_{i ∣ j} ∣ θ, D, S) \prod_{i \in I} π (ϕ_{i ∣ j} ∣ ϕ, D, S)) π (θ, ϕ ∣ D, S) .

Substituting the specific forms of log likelihood function, the collapsed prior and posterior distributions into (3.1), we can rewrite SKL(S) as a sum of symmetric KL divergence for the genes without parents, activated, and inhibited:

S K L (S) = \frac{1}{Q} (I_{1}^{'} + I_{2}^{'} + I_{3}^{'}),

(3.4)

where

\begin{matrix} I_{1}^{'} & = \sum_{i \in \bar{Pa}} ({\overset{‒}{n}}_{i} ψ ({\overset{‒}{n}}_{i}) + (n - {\overset{‒}{n}}_{i}) ψ (n - {\overset{‒}{n}}_{i}) - n ψ (n + 2) + n + 2), \\ I_{2}^{'} & = \sum_{i \in A} \int g_{i A}^{*} (θ) π (θ, ϕ ∣ D, S) d θ d ϕ - \int g_{i A} (θ) π (θ ∣ S) d θ, \end{matrix}

(3.5)

I_{3}^{'} = \sum_{i \in I} \int g_{i I}^{*} (ϕ) π (θ, ϕ ∣ D, S) d θ d ϕ - \int g_{i I} (ϕ) π (ϕ ∣ S) d ϕ,

(3.6)

with

\begin{matrix} g_{i A}^{*} (θ) & = \int [\ln θ_{i ∣ j}^{n_{i ∣ j}} {(1 - θ_{i ∣ j})}^{n - n_{i ∣ j}}] \frac{θ_{i ∣ j}^{a_{i ∣ j} θ + n_{i ∣ j} - 1} {(1 - θ_{i ∣ j})}^{b_{i ∣ j} θ + n - n_{i ∣ j} - 1}}{B (a_{i ∣ j} θ + n_{i ∣ j}, b_{i ∣ j} θ + n - n_{i ∣ j})} d θ_{i ∣ j} \\ = n_{i ∣ j} ψ (a_{i ∣ j} θ + n_{i ∣ j}) + (n - n_{i ∣ j}) ψ (b_{i ∣ j} θ + n - n_{i ∣ j}) - n ψ (a_{i ∣ j} θ + b_{i ∣ j} θ + n), \end{matrix}

(3.7)

\begin{matrix} g_{i A} (θ) & = \int [\ln θ_{i ∣ j}^{n_{i ∣ j}} {(1 - θ_{i ∣ j})}^{n - n_{i ∣ j}}] \frac{θ_{i ∣ j}^{a_{i ∣ j} θ - 1} {(1 - θ_{i ∣ j})}^{b_{i ∣ j} θ - 1}}{B (a_{i ∣ j} θ, b_{i ∣ j} ϕ)} d θ_{i ∣ j} \\ = n_{i ∣ j} ψ (a_{i ∣ j} θ) + (n - n_{i ∣ j}) ψ (b_{i ∣ j} θ) - n ψ (a_{i ∣ j} θ + b_{i ∣ j} θ), \end{matrix}

(3.8)

\begin{matrix} g_{i I}^{*} (ϕ) & = \int [\ln ϕ_{i ∣ j}^{n_{i ∣ j}} {(1 - ϕ_{i ∣ j})}^{n - n_{i ∣ j}}] \frac{ϕ_{i ∣ j}^{b_{i ∣ j} ϕ + n_{i ∣ j} - 1} {(1 - ϕ_{i ∣ j})}^{a_{i ∣ j} ϕ + n - n_{i ∣ j} - 1}}{B (b_{i ∣ j} ϕ + n_{i ∣ j}, a_{i ∣ j} ϕ + n - n_{i ∣ j})} d ϕ_{i ∣ j} \\ = n_{i ∣ j} ψ (b_{i ∣ j} ϕ + n_{i ∣ j}) + (n - n_{i ∣ j}) ψ (a_{i ∣ j} ϕ + n - n_{i ∣ j}) - n ψ (b_{i ∣ j} ϕ + a_{i ∣ j} ϕ + n), \end{matrix}

(3.9)

and

\begin{matrix} g_{i I} (ϕ) & = \int [\ln ϕ_{i ∣ j}^{n_{i ∣ j}} {(1 - ϕ_{i ∣ j})}^{n - n_{i ∣ j}}] \frac{ϕ_{i ∣ j}^{b_{i ∣ j} ϕ - 1} {(1 - ϕ_{i ∣ j})}^{a_{i ∣ j} ϕ - 1}}{B (b_{i ∣ j} ϕ, a_{i ∣ j} ϕ)} d ϕ_{i ∣ j} \\ = n_{i ∣ j} ψ (b_{i ∣ j} ϕ) + (n - n_{i ∣ j}) ψ (a_{i ∣ j} ϕ) - n ψ (a_{i ∣ j} ϕ + b_{i ∣ j} ϕ) . \end{matrix}

(3.10)

Equations (3.7)–(3.10) are direct results of Proposition 1. It is easy to observe that I₁′ equals to I₁ that is given in the simple model, because ${\overset{‒}{θ}}_{i}$ does not depend on θ or ϕ.

Next, we derive the optimal form for numerically evaluating

\sum_{i \in A} \int g_{i A}^{*} (θ) π (θ, ϕ ∣ D, S) d θ d ϕ and \sum_{i \in I} \int g_{i I}^{*} (ϕ) π (θ, ϕ ∣ D, S) d θ d ϕ .

Notice that

\begin{matrix} π (θ, ϕ ∣ D, S) & = \int π (θ_{s}^{*}, ϕ_{s}, θ, ϕ ∣ D, S) d θ_{s}^{*} d ϕ_{s} \\ = \int \frac{L (θ_{s}^{*}, ϕ_{s} ∣ D) π (θ_{s}^{*}, ϕ_{s}, θ, ϕ ∣ S)}{c^{*}} d θ_{s}^{*} d ϕ_{s}, \end{matrix}

(3.11)

where c* is the normalizing constant for $π (θ_{s}^{*}, ϕ_{s}, θ, ϕ ∣ D, S)$ , the prior $π (θ_{s}^{*}, ϕ_{s}, θ, ϕ ∣ S)$ is a proper density, and,

L (θ_{s}^{*}, ϕ_{s} ∣ D) = \prod_{i \in A} θ_{i ∣ j}^{n_{i ∣ j}} {(1 - θ_{i ∣ j})}^{n - n_{i ∣ j}} \prod_{i \in I} ϕ_{i ∣ j}^{n_{i ∣ j}} {(1 - ϕ_{i ∣ j})}^{n - n_{i ∣ j}} .

Now we can write

c^{*} = \int L (θ_{s}^{*}, ϕ_{s} ∣ D) π (θ_{s}^{*}, ϕ_{s}, θ, ϕ ∣ S) d θ_{s}^{*} d ϕ_{s} d θ d ϕ,

and plug it into (3.11), it follows that

\begin{matrix} π (θ, ϕ ∣ D, S) & = \frac{\int L (θ_{s}^{*}, ϕ_{s} ∣ D) π (θ_{s}^{*}, ϕ_{s}, θ, ϕ ∣ S) d θ_{s}^{*} d ϕ_{s}}{\int L (θ_{s}^{*}, ϕ_{s} ∣ D) π (θ_{s}^{*}, ϕ_{s}, θ, ϕ ∣ S) d θ_{s}^{*} d ϕ_{s} d θ d ϕ} \\ = \frac{h (θ, ϕ) μ e^{- μ θ} ν e^{- ν ϕ}}{\int h (θ, ϕ) μ e^{- μ θ} ν e^{- ν ϕ} d θ d ϕ}, \end{matrix}

(3.12)

where

\begin{matrix} h (θ, ϕ) & = \int L (θ_{s}^{*}, ϕ_{s} ∣ D) π (θ_{s}^{*}, ϕ_{s} ∣ θ, ϕ, S) d θ_{s}^{*} d ϕ_{s} \\ = \prod_{i \in A} \int θ_{i ∣ j}^{n_{i ∣ j}} {(1 - θ_{i ∣ j})}^{n - n_{i ∣ j}} \frac{1}{B (a_{i ∣ j} θ, b_{i ∣ j} θ)} θ_{i ∣ j}^{a_{i ∣ j} θ - 1} {(1 - θ_{i ∣ j})}^{b_{i ∣ j} θ - 1} d θ_{i ∣ j} \times \prod_{i \in I} \int ϕ_{i ∣ j}^{n_{i ∣ j}} {(1 - ϕ_{i ∣ j})}^{n - n_{i ∣ j}} \frac{1}{B (b_{i ∣ j} ϕ, a_{i ∣ j} ϕ)} ϕ_{i ∣ j}^{b_{i ∣ j} ϕ - 1} {(1 - ϕ_{i ∣ j})}^{a_{i ∣ j} ϕ - 1} d ϕ_{i ∣ j} \\ = \prod_{i \in A} \frac{B (a_{i ∣ j} θ + n_{i ∣ j}, b_{i ∣ j} θ + n - n_{i ∣ j})}{B (a_{i ∣ j} θ, b_{i ∣ j} θ)} \prod_{i \in I} \frac{B (b_{i ∣ j} ϕ + n_{i ∣ j}, a_{i ∣ j} ϕ + n - n_{i ∣ j})}{B (b_{i ∣ j} ϕ, a_{i ∣ j} ϕ)} . \end{matrix}

(3.13)

So the first term of I₂′ in (3.5) can be expressed as

\begin{matrix} \sum_{i \in A} \int g_{i A}^{*} (θ) π (θ, ϕ ∣ D, S) d θ d ϕ & = \sum_{i \in A} \frac{\int g_{i A}^{*} (θ) h (θ, ϕ) μ e^{- μ θ} ν e^{- ν ϕ} d θ d ϕ}{\int h (θ, ϕ) μ e^{- μ θ} ν e^{- ν ϕ} d θ d ϕ} \\ = \sum_{i \in A} \frac{\int g_{i A}^{*} (θ) h_{1} (θ) μ e^{- μ θ} d θ}{\int h_{1} (θ) μ e^{- μ θ} d θ} . \end{matrix}

Likewise, the first term of I₃′ in (3.6) can be written as

\sum_{i \in I} \int g_{i I}^{*} (ϕ) π (θ, ϕ ∣ D, S) d θ d ϕ = \sum_{i \in I} \frac{\int g_{i I}^{*} (ϕ) h_{2} (ϕ) ν e^{- ν ϕ} d ϕ}{\int h_{2} (ϕ) ν e^{- ν ϕ} d ϕ} .

Therefore, when we use Monte Carlo methods to numerically evaluate the integrals in I₂′ and I₃′, we sample from the prior distribution only, rather than sample from both prior and posterior distribution of θ and ϕ. This greatly improves the efficiency of computing KL divergence.

We now summarize the algorithm for calculating the symmetric KL divergence in the multilevel model:

Draw θ^(t) from $E (μ)$ , and draw independently ϕ^(t) from $E (ν)$ , for t = 1, . . ., N.
I₂′ is approximated by
${\hat{I}}_{2}^{'} = \sum_{i \in A} {\frac{\sum_{t = 1}^{N} g_{i A}^{*} (θ^{(t)}) h_{1} (θ^{(t)})}{\sum_{t = 1}^{N} h_{1} (θ^{(t)})} - \frac{1}{N} \sum_{t = 1}^{N} g_{i A} (θ^{(t)})},$
and I₃′ is approximated by
${\hat{I}}_{3}^{'} = \sum_{i \in I} {\frac{\sum_{t = 1}^{N} g_{i I}^{*} (ϕ^{(t)}) h_{2} (ϕ^{(t)})}{\sum_{t = 1}^{N} h_{2} (ϕ^{(t)})} - \frac{1}{N} \sum_{t = 1}^{N} g_{i I} (ϕ^{(t)})},$
where $g_{i A}^{*} (θ)$ , $g_{i I}^{*} (ϕ)$ , $g_{i A} (θ)$ , and $g_{i I} (ϕ)$ are given in (3.7), (3.8), (3.9), and (3.10), and
$\begin{matrix} h_{1} (θ) & = \prod_{i \in A} \frac{B (a_{i ∣ j} θ + n_{i ∣ j}, b_{i ∣ j} θ + n - n_{i ∣ j})}{B (a_{i ∣ j} θ, b_{i ∣ j} θ)}, \\ h_{2} (ϕ) & = \prod_{i \in I} \frac{B (b_{i ∣ j} ϕ + n_{i ∣ j}, a_{i ∣ j} ϕ + n - n_{i ∣ j})}{B (b_{i ∣ j} ϕ, a_{i ∣ j} ϕ)} . \end{matrix}$
I₁′ can be directly calculated as
$I_{1}^{'} = \sum_{i \in \bar{Pa}} ({\overset{‒}{n}}_{i} ψ ({\overset{‒}{n}}_{i}) + (n - {\overset{‒}{n}}_{i}) ψ (n - {\overset{‒}{n}}_{i}) - n ψ (n + 2) + n + 2) .$
$S K L (S) = \frac{1}{Q} (I_{1}^{'} + {\hat{I}}_{2}^{'} + {\hat{I}}_{3}^{'})$ .

Remark It is natural to generate two Monte Carlo (MC) samples from π(θ|S) to approximate $r_{i 1} = \frac{\int g_{i A}^{*} (θ) h_{1} (θ) μ e^{- μ θ} d θ}{\int h_{1} (θ) μ e^{- μ θ} d θ}$ for $i \in A$ , so that one sample is used for computing $\int g_{i A}^{*} (θ) h_{1} (θ) μ e^{- μ θ} d θ$ , while the other for $\int h_{1} (θ) μ e^{- μ θ} d θ$ . However, we generate only one MC sample from π(θ|S) to compute r_i1. Chen et al. [2] pointed out that the use of two MC samples in obtaining the MC estimate of r_i1 may not necessarily be more efficient than the use of just one MC sample. They showed that the latter actually reduces the asymptotic variance of the estimate.

3.3 MCMC Algorithm for Sampling from the Posterior Distributions

To update the unknown parameters, we first know that the probabilities of initial states, ( ${\overset{‒}{θ}}_{i} : i \in \bar{Pa}$ ), do not depend on the hyperparameters θ and ϕ, so their posterior distributions are βe(1 + n̄_i, 1 + n – n̄_i). Then we will employ Metropolis [21] within Gibbs sampling algorithm to update the transition probabilities and hyperparameters ( $θ_{s}^{*}, ϕ_{s}, θ, ϕ$ ). Chen et al. [1] provide more details on the algorithm. Using the collapsing technique in drawing the Gibbs sampler proposed by Liu [20],

\begin{matrix} [θ_{s}^{*}, ϕ_{s}, θ, ϕ ∣ D, S] & = [θ_{s}^{*}, ϕ_{s} ∣ θ, ϕ, D, S] [θ, ϕ ∣ D, S] \\ = [θ_{s}^{*} ∣ θ, D, S] [ϕ_{s} ∣ ϕ, D, S] [θ ∣ D, S] [ϕ ∣ D, S] . \end{matrix}

The last step results from conditional independence. Therefore, given the hyperparameters θ and ϕ and data, we update the transition probabilities ( $θ_{s}^{*}, ϕ_{s}$ ) among genes by sampling from the beta distributions. From (3.12) and (3.13), we know that π(θ|D, S) is proportional to h₁(θ)μe^–μθ, and π(ϕ|D, S) is proportional to h₂(ϕ)νe^–νϕ, and they are independent. We use the Metropolis–Hastings algorithm to sample from π(θ|D, S) and π(ϕ|D, S). The MCMC algorithm to sample the posterior distribution can be implemented as follows:

Step 1. Generate θ and ϕ independently given the data using the Metropolis algorithm having the following target densities:

π (θ ∣ D, S) \propto μ e^{- μ θ} \prod_{i \in A} \frac{B (a_{i ∣ j} θ + n_{i ∣ j}, b_{i ∣ j} θ + n - n_{i ∣ j})}{B (a_{i ∣ j} θ, b_{i ∣ j} θ)}

and

π (ϕ ∣ D, S) \propto ν e^{- ν θ} \prod_{i \in I} \frac{B (b_{i ∣ j} ϕ + n_{i ∣ j}, a_{i ∣ j} ϕ + n - n_{i ∣ j})}{B (b_{i ∣ j} ϕ, a_{i ∣ j} ϕ)} .

Since θ > 0, the local Metropolis algorithm in this step is done by sampling ξ = log(θ) instead of θ using the following steps:

1.1
Obtain the conditional density function π(ξ|D, S) by the transformation from π(θ|D, S).
1.2
Obtain the proposal distribution $N (\hat{ξ}, {\hat{σ}}_{ξ}^{2})$ , where $\hat{ξ}$ maximizes the logarithm of π(ξ|D, S) for ξ, and $1 ∕ {\hat{σ}}_{ξ}^{2}$ is minus the second derivative of the logarithm of π(ξ|D, S) with respect to ξ valuated at $\hat{ξ}$ .
1.3
Let θ₀ be the current value of θ. Then ξ has a current value ξ₀ log(θ₀).
1.4
Generate a proposal value ξ from the proposal distribution $N (\hat{ξ}, {\hat{σ}}_{ξ}^{2})$ .
1.5
Update ξ from ξ₀ to ξ₁ with probability $(\frac{π (ξ_{1} ∣ D, S) φ (\frac{ξ_{0} - \hat{ξ}}{{\hat{σ}}_{ξ}})}{π (ξ_{0} ∣ D, S) φ (\frac{ξ_{1} - \hat{ξ}}{{\hat{σ}}_{ξ}})}, 1)$ , where φ is the probability density function of a standard normal variate.
1.6
Calculate θ₁ = exp(ξ₁).

Similarly, we can sample ϕ independently through the above steps 1.1–1.6 by defining ζ = log(ϕ) and using π(ϕ|D, S).

Step 2. Given the current values of θ, ϕ and data, update the transition probabilities:

θ_{i ∣ j} ∣ θ, D, S \sim β e (a_{i ∣ j} θ + n_{i ∣ j}, b_{i ∣ j} θ + n - n_{i ∣ j}) if i \in A

and

ϕ_{i ∣ j} ∣ ϕ, D, S \sim β e (b_{i ∣ j} ϕ + n_{i ∣ j}, a_{i ∣ j} ϕ + n - n_{i ∣ j}) if i \in I .

In step 1.2, we use the optimization program optim in the R stats package. The optimization method is an implementation of the conjugate gradients method based on that by Fletcher and Reeves [6]. It will also return a numerically differentiated Hessian matrix (second derivative for a univariate case) as requested. For convergence diagnostics, we use the R coda package. The Geweke [8] method is applied here. It is based on a test for equality of the means of the first and last part of the samples from the Markov chain. If the samples are drawn from the stationary distribution of the chain, the two means are expected to be equal and Geweke's statistics has an asymptotically standard normal distribution.

4 Examples

Suppose we have the following three pathways S₁, S₂ and S₃ suggested by KEGG. We overlay the information collected by PrimeDB on the pathway structure. For example, we use the (7, 3) plotted on top of the right arrow between g₁ and g₂ to represent there are 7 journal articles having reported that g₁ activates g₂ and 3 articles say the contrary, g₁ inhibits g₂. The KEGG information is given in the structures and weighted graphs as shown. So the right arrows and T (stop) arrows in the graph are suggested by KEGG. For example, in all the pathways, KEGG suggests that g₁ activates g₂ as pictured here with the right arrow. However, we only believe it with a certain degree. So we incorporate the possibility that g₁ may actually inhibit g₂ with a probability which may be small. So this framework would be consistent with the PrimeDB results. We summarize the prior information from both KEGG and PrimeDB as follows:

Pathway 1 (S₁):
Pathway 2 (S₂):
Pathway 3 (S₃):

Let us first consider the simple Bayesian model. Note the prior distribution for S₁ can be described by the product of the following five distributions: θ₁ ~ βe(1, 1), θ_2|1 ~ βe(7, 3), θ_3|2 ~ βe(8, 1), ϕ_4|1 ~ βe(3, 3) and ϕ_5|4 ~ βe(9, 6). Note the PrimeDB for the g₄ to g₅ reaction says (6, 9) indicating 6 journal articles reporting activation and 9 journal articles reporting inhibition. Given ϕ_5|4 is the transition probability for g₄ inhibiting g₅. So we construct ϕ_5|4 ~ βe(9, 6) from the PrimeDB. The prior distribution for S₂ is the product of θ₁ ~ βe(1, 1), θ_2|1 ~ βe(7, 3), θ_3|2 ~ βe(8, 1), ϕ_4|3 ~ βe(10, 1) and ϕ_5|4 ~ βe(9, 6) Note that the parameters in the beta distribution in the latter two components are in reverser order due to the inhibition effect proposed by KEGG. Suppose we have three microarray experiments yielding the following results ( $\cup, \cap, \cup, \cap, \cup$ ), ( $\cap, \cap, \cap, \cup, \cup$ ), ( $\cup, \cap, \cap, \cup, \cap$ ) for g₁, . . . , g₅. Then the likelihood for the data under S₁ is $θ_{1}^{2} (1 - θ_{1}) θ_{2 ∣ 1} {(1 - θ_{2 ∣ 1})}^{2} θ_{3 ∣ 2}^{2} (1 - θ_{3 ∣ 2}) ϕ_{4 ∣ 1}^{2} (1 - ϕ_{4 ∣ 1}) ϕ_{5 ∣ 4}^{2} (1 - ϕ_{5 ∣ 4})$ . So the posterior distribution is the joint distribution of the independent components with θ₁ ~ βe(3, 2), θ_2|1 ~ βe(8, 5), θ_3|2 ~ βe(10, 2), ϕ_4|1 ~ βe(5, 4) and ϕ_5|4 ~ βe(11, 7). And, the likelihood under S₂ is similar to S₁ except $ϕ_{4 ∣ 1}^{2} (1 - ϕ_{4 ∣ 1})$ is replaced by $ϕ_{4 ∣ 3}^{3}$ , so the posterior distribution for S₂ is similar to that of S₁ except the posterior component for ϕ_4|1 is replaced by ϕ_4|3 ~ βe(13, 1).

On the pathway selection, we need to evaluate the symmetric KL divergence for each path. We tabulate the activation and inhibition counts (a_i|j, b_i|j) from PrimeDB, and the concordant and discordant counts (n_i|j, n – n_i|j), from the microarray experiments in Table 3. Use the notation SKL_i|j for the symmetric KL divergence for gene i with parent j. SKL_i|j can be easily calculated using (2.6) and (2.7).

Table 3.

Summary of the directional counts and SKL per gene

Genes	a_i	b_i	n̄_i	n – n̄_i	SKL_i
g ₁	1	1	2	1	2ψ(2) + ψ(1) – 3ψ(5) + 5 = 0.75

Genes	a _i\|j	b _i\|j	n _i\|j	n – n _i\|j	SKL _i\|j
g _2\|1	7	3	1	2	ψ(8) – ψ(7) + 2[ψ(5) – ψ(3)] – 3[ψ(13) – ψ(10)] = 0.4868
g _3\|2	8	1	2	1	2[ψ(10) – ψ(8)] + ψ(2) – ψ(1) – 3[ψ(12) – ψ(9)] = 0.5662
g _4\|1	3	3	2	1	ψ(4) – ψ(3) + 2[ψ(5) – ψ(3)] – 3[ψ(9) – ψ(6)] = 0.1964
g _5\|4	6	9	2	1	ψ(7) – ψ(6) + 2[ψ(11) – ψ(9)] – 3[ψ(18) – ψ(15)] = 0.0249
g _4\|3	1	10	3	0	3[ψ(13) – ψ(10)] – 3[ψ(14) – ψ(11)] = 0.0692

Open in a new tab

We summarize the symmetric KL divergence for the three pathways in Table 4, and conclude that S₂ is best supported by the microarray experiments.

Table 4.

Pathway selection based on the simple model

Pathway SKL(S)
S ₁	⅕(SKL₁ + SKL_2\|1 + SKL_3\|2 + SKL_4\|1 + SKL_5\|4) = 0.4049
S ₂	⅕(SKL₁ + SKL_2\|1 + SKL_3\|2 + SKL_4\|3 + SKL_5\|4) = 0.3794
S ₃	⅓(SKL₁ + SKL_2\|1 + SKL_3\|2) = 0.4777

Open in a new tab

Figure 2 displays the discrepancy between the prior and posterior densities for each transition probability in these three pathways. Among the comparison of the six graphs, the prior and posterior of ϕ_5|4 overlap to the largest extent. For ϕ_4|3, the discrepancy mainly lies in the tiny area underneath the peak, while the moderate difference for ϕ_4|1 ranges from 0.1 to 0.8. The big piece of prior for θ_3|2 protruding its posterior results in larger discrepancy than that of θ_2|1. So the order of differences shown in the figure and the ranking of SKL_i|j in Table 3 are coherent.

Fig. 2 — Prior and posterior densities for transition probabilities

Now we consider the multilevel models with the same data. We first generate θ^(t) and ϕ^(t) independently from $E (μ)$ and $E (ν)$ , for t = 1, . . ., 100000. We then give the g-functions and h-functions in three pathways. For S₁, the activated gene set is $A = (g_{2}, g_{3})$ . So we have

\begin{matrix} g_{2 A}^{*} (θ) & = ψ (7 θ + 1) + 2 ψ (3 θ + 2) - 3 ψ (10 θ + 3), \\ g_{3 A}^{*} (θ) & = 2 ψ (8 θ + 2) + ψ (θ + 1) - 3 ψ (9 θ + 3), \\ g_{2 A} (θ) & = ψ (7 θ) + 2 ψ (3 θ) - 3 ψ (10 θ), \\ g_{3 A} (θ) & = 2 ψ (8 θ) + ψ (θ) - 3 ψ (9 θ), \\ h_{1} (θ) & = \frac{B (7 θ + 1, 3 θ + 2)}{B (7 θ, 3 θ)} \frac{B (8 θ + 2, θ + 1)}{B (8 θ, θ)} . \end{matrix}

Moreover, the inhibited gene set is $I = (g_{4}, g_{5})$ , so

\begin{matrix} g_{4 I}^{*} (ϕ) & = 2 ψ (3 ϕ + 2) + ψ (3 ϕ + 1) - 3 ψ (6 ϕ + 3), \\ g_{5 I}^{*} (ϕ) & = 2 ψ (9 ϕ + 2) + ψ (6 ϕ + 1) - 3 ψ (15 ϕ + 3), \\ g_{4 I} (ϕ) & = 2 ψ (3 ϕ) + ψ (3 ϕ) - 3 ψ (6 ϕ), \\ g_{5 I} (ϕ) & = 2 ψ (9 ϕ) + ψ (6 ϕ) - 3 ψ (15 ϕ), \\ h_{2} (ϕ) & = \frac{B (3 ϕ + 2, 3 ϕ + 1)}{B (3 ϕ, 3 ϕ)} \frac{B (9 ϕ + 2, 6 ϕ + 1)}{B (9 ϕ, 6 ϕ)} . \end{matrix}

Observing the S₂ differs from S₁ only in the arc g₄, which is inhibited by g₃ rather by g₁. So in S₂, we replace $g_{4 I}^{*}$ , $g_{4 I}$ , and h₂(ϕ) with

\begin{matrix} g_{4 I}^{*} (ϕ) & = 3 ψ (10 ϕ + 3) - 3 ψ (11 ϕ), \\ g_{4 I} (ϕ) & = 3 ψ (10 ϕ) - 3 ψ (11 ϕ), \\ h_{2} (ϕ) & = \frac{B (10 ϕ + 3, ϕ)}{B (10 ϕ, ϕ)} \frac{B (9 ϕ + 2, 6 ϕ + 1)}{B (9 ϕ, 6 ϕ)} . \end{matrix}

S₃ is a subgraph of S₁ with g₁ activating g₂ and inhibiting g₄. We use the corresponding g-functions, and change $h_{1} (θ) = \frac{B (7 θ + 1, 3 θ + 2)}{B (7 θ, 3 θ)}$ and $h_{2} (ϕ) = \frac{B (3 ϕ + 2, 3 ϕ + 1)}{B (3 ϕ, 3 ϕ)}$ . In addition to let μ = 1 and ν = 1, in which case transition probabilities conditional on the hyperparameters are simply the same as those in the simple model, we vary the values of hyperparameters. The results in Table 5 show that lowering both the values of μ and ν to 0.5 and 0.25, i.e., lessening our prior belief on activation and inhibition effects indicated by the PrimeDB, will not alter the ranking of the pathways. However, if we put dramatically different weights on the activation and inhibition, the ranking may be changed. In this situation, biologists apply strong expertise information to assign the weights.

Table 5.

SKL(S) based on the multilevel model

Pathways	μ = 1	μ = 0.5	μ = 0.5	μ = 1	μ = 0.25	μ = 1	μ = 0.25
	ν = 1	ν = 0.5	ν = 1	ν = 0.5	ν = 0.25	ν = 0.25	ν = 1
S ₁	0.7194	0.5090	0.5608	0.6680	0.3694	0.6330	0.4540
S ₂	0.6285	0.4535	0.4700	0.6131	0.3376	0.6009	0.3636
S ₃	0.7216	0.5539	0.6225	0.6524	0.4388	0.6070	0.5521

Open in a new tab

The posterior estimates for the parameters for the simple model and the multilevel model with μ = 1 and ν = 1 are listed in Tables 6 and 7, respectively. The results show that the two sets of estimates of the transitional probabilities, one with the simple model and the other with the multilevel model, are very close. Note that in the simple model, the posterior estimates for S₃ are exactly the same as their counterparts in S₁.

Table 6.

Posterior estimates based on the simple model

S ₁			S ₂
Param.	Mean	Std. Error	Param.	Mean	Std. Error
θ _2\|1	0.6154	0.1300	θ _2\|1	0.6154	0.1300
θ _3\|2	0.8333	0.1034	θ _3\|2	0.8333	0.1034
ϕ _4\|1	0.5556	0.1571	ϕ _4\|3	0.9286	0.0665
ϕ _5\|4	0.6111	0.1118	ϕ _5\|4	0.6111	0.1118

Open in a new tab

Table 7.

Posterior estimates based on the multilevel model

S ₁			S ₂			S ₃
Param.	Mean	Std. Error	Param.	Mean	Std. Error	Param.	Mean	Std. Error
θ	1.2642	1.0880	θ	1.2457	1.1083	θ	1.0593	1.0273
ϕ	1.2905	1.0494	ϕ	1.1019	1.0084	ϕ	0.9984	0.9873
θ _2\|1	0.5922	0.1549	θ _2\|1	0.5948	0.1578	θ _2\|1	0.5653	0.1768
θ _3\|2	0.8260	0.1194	θ _3\|2	0.8217	0.1217
ϕ _4\|1	0.5653	0.1676	ϕ _4\|3	0.9381	0.0668	ϕ _4\|1	0.5752	0.1766
ϕ _5\|4	0.6159	0.1245	ϕ _5\|4	0.6141	0.1358

Open in a new tab

5 Application to an Osteoblast Lineage Study

Osteoblast differentiation is regulated by a number of systemic hormones and local factors that induce different signaling pathways in cell within the osteoprogenitor lineage. We use four biological pathways as benchmark in the present study to test the model performance. They include the Wnt signaling pathway, bone morphogenetic protein (BMP) signaling pathway, a specified calcium signaling pathway, and adipocytokine signaling pathway. The pathway structures (i.e., the molecules involved and the interactions among the molecules) can be retrieved from literatures and public databases such as KEGG and BioCarta. Kalajzic et al. [16] report the essential roles of the first two pathways in modulating bone mass. The latter two pathways are considered to be biologically irrelevant to osteoprogenitor cell differentiation. This is pointed out by other biological studies including [9, 12–15, 18, 27, 28].

We simplified the complicated pathway structures by keeping their main trunks and pruning off most of the branches. The “stripped” versions of pathways are shown in Fig. 3. Essentially, only key players along the signaling transduction path from the beginning (ligand) to the end (usually transcription factor) and their direct regulators are kept. We made this simplification for the following reasons. First, a few key players are usually enough to determine whether a pathway is active or not. For example, in the simplest scenario of one-experiment case, knowing the expressions levels of wnt, fzd and tcf being up-regulated, a biologist is inclined to predict the Wnt signaling pathway as active, whereas the three molecules are ligand, receptor and final effecter of the pathway, respectively [16]. Superior to this judgemental call, our models not only allow synthesizing multiple experiment outcomes but also numerically measure pathways’ activities. Second, it is not necessarily true that all molecules in a pathway will behave consistently when it is active. Some of them may have functions not exclusive to the pathway, and hence show no change or even reversed change as the pathway predicts.

Fig. 3 — Four simplified signaling pathways

We validated our models through a microarray study [16] in which the mouse cavarial cultures at day 7 and 17 underwent Affymetrix microarray analysis to understand the gene expression patterns at distinct stages of osteoprogenitor maturation. Within a primary bone cell culture, limited number of cells become mature osteoblasts and represent only a small proportion of the total cell populations. Therefore, it is never certain whether the observed gene expression changes based on a heterogenous cell mixture are associated with fully differentiated osteoblast only or with other cell populations. To overcome this problem, Kalajzic et al. utilized Col1a1 promoter-green fluorescent protein transgenic mouse lines to generate more homogeneous cell populations at the preosteoblastic stage and mature osteoblast stage. They demonstrated the importance for doing this cell separation for valid microarray interpretation. For illustration purpose, we focused on the gene intensities in sorted mature osteoblast only. They were taken from the cells with 2.3GFP^pos and cells with 2.3GFP^neg in the 17-day-old cultures. In this three-replicate data set, we first categorized the genes in the pathways of interest as up-regulated (∪), down-regulated (∩), or equivalently expressed (EE) if their fold changes are greater than 2, less than 1/2, or in between, respectively. If a gene has a single activator parent, we counted the number of coherence (denoted as nceq in Table 8) for the ordered pair (parent, child) with the outcomes to be ( $\cup, \cup$ ), ( $\cap, \cap$ ) or (EE, EE). If a gene has a single inhibitor parent, nceq is for the (parent, child) outcomes being ( $\cup, \cap$ ) ( $\cap, \cup$ ) or (EE, EE). If a gene has odd number of multiple parents, in each replication, we checked coherence for each individual parent related to the child as in a single-parent case, then used majority rule to decide overall being coherent or not. Here, nceq is the count for overall coherence for all replications. Likewise, if a gene has even number of multiple parents and inhibitor parents are present, we used the inhibitor outcomes only to conclude being overall coherent or not per replication. If a gene has even number of single-type parents (i.e., all activators or all inhibitors), we either used majority rule to decide overall coherence per replication in no ties case, or favored being coherent in tie case.

Table 8.

Prior counts and individual SKL scores for the four pathways

Pathway	Parent^a	Child^a	Dtype^b	nceq	a	b	skl.g
Wnt	Wnt5a	Fzd1	1	3	4.5	0.5	0.145
	Dkk1	Lrp6	0	2	1.5	0.5	0.883
	FL^c	Dvl2	2	1	3.5	2.5	0.354
	Dvl2	Gsk3b	0	3	4.5	1.5	0.370
	GAAC^c	Ctnnb1	2	3	5.5	4.5	0.584
	Ctnnb1	Tcf7	1	1	8.5	1.5	1.428
Bmp	Bmp8a	Amhr2	1	3	1.5	0.5	0.807
	AS^c	Smad1	2	2	1.5	0.5	0.883
	Smad1	Smad4	1	2	1.5	0.5	0.883
	ST^c	Smad2	2	3	1.5	4.5	2.754
	Smad2	E2f4	1	3	1.5	0.5	0.807
	E2f4	Myc	1	1	1.5	0.5	2.750
	Smad2	Sp1	1	2	1.5	0.5	0.883
	SM^c	Cdkn2b	2	1	1.5	2.5	0.188
	Tgfb1	Tgfbr1	1	2	7.5	0.5	1.494
Calcium	Htr5a	Gnas	1	3	1.5	0.5	0.807
	Gnas	Adcy8	1	3	4.5	1.5	0.370
	Adcy8	Prkacb	1	3	5.5	1.5	0.270
	Prkacb	Pln	0	0	1.5	0.5	5.950
	Pln	Atp2a1	0	0	1.5	0.5	5.950
	Atp2a1	Calm3	1	1	1.5	1.5	0.450
	Calm3	1500003O03Rik	1	2	2.5	1.5	0.188
	Calm3	Camk2g	1	2	5.5	1.5	0.201
Adipocytokine	Tnfrsf1b	Traf2	1	1	1.5	0.5	2.750
	Traf2	Mtor	1	2	1.5	1.5	0.450
	Traf2	Mapk9	1	2	1.5	2.5	0.683
	Traf2	Ikbkb	1	3	1.5	4.5	2.754
	MMLS^c	Irs1	2	2	3.5	0.5	1.166
	Lepr	Jak2	1	1	2.5	0.5	3.383
	Jak2	Stat3	1	2	2.5	0.5	1.021
	Stat3	Socs3	1	3	2.5	0.5	0.374

Open in a new tab

Parent and Child columns list the official gene symbols

Dtype = 1, 0 and 2 for single activator, single inhibitor and multiple parents, respectively

Multiple parents. FL: Fzd1 and Lrp6; GAAC: Gsk3b, Axin, Apc and Csnk1a1; AS: Amhr2 and Smad1; ST: Smad6 and Tgfbr1; SM: Sp1 and Myc; MMLS: Mtor, Mapk9, Ikbkb and Socs3

Given the PrimeDB is still under construction, we use the KEGG database as a proxy for the PrimeDB in this example. We queried the biological pathway information stored in KEGG database to get the prior knowledge of the pathways. Each pairwise interaction in the pathways was checked against all the pathways in KEGG database. For a gene with single activator, we counted the number of pathways a, where parent–child activation interaction exists. Parameter a is then defined as a + 0.5, the number 0.5 is added to all prior counts to avoid improper prior caused by zero counts. We also defined parameter b from the number of pathways where both parent and child exist but have no activation association between. Likewise, the activation interaction was replaced with inhibition in defining a and b for genes with single inhibitor parent. For a multiple-parent case, we defined parameter a similarly from the number of pathways where at least one of the parents has interaction pointing to the child, and parameter b from the number of pathways where at least one parent exists together with the child but have no interaction between any parent– child pair. The interaction feature between parents and child, i.e., activation or inhibition, was determined by majority rule. All the KEGG pathway information was downloaded and stored in a relational database, such that the prior parameters can be retrieved automatically.

Table 8 lists the parent–child directional interaction (denoted as Dtype), the prior parameters, number of coherence, and individual SKL score per child (skl.g) based on the simple model for the four signaling pathways. Dtype takes value of 1, 0 and 2 to stand for activation, inhibition and multiple-parent case, respectively. Table 9 gives the SKL scores for the four pathways based on the simple model and multilevel model. The two signaling pathways that play essential role in the osteoblast lineage progression, Wnt and BMP, have smaller SKL scores than the other two pathways. In multilevel models 1–3, we set the values of the hyperparameters that govern our belief on the prior counts acquired from KEGG, at all equal to 1, 2 or 0.5. The ranking of the pathways is not sensitive to the choice values of hyperparameters.

Table 9.

SKL(S) of the four pathways

Pathways	Simple Model	Multilevel Model1	Multilevel Model2	Multilevel Model3
Wnt	1.0922	1.0673	1.1565	0.9828
Bmp	1.3916	1.2471	1.4976	1.0197
Calcium	1.8263	1.7778	2.0835	1.4889
Adipocytokine	1.7081	1.4572	1.7602	1.1823

Open in a new tab

It is worthwhile noting that the above extension to multiple parents does not seem to follow along the line of a Bayesian network. It is possible to have the usual Bayesian network extension to construct conditional distribution given multiple parental nodes. However, lack of prior information from the literature search in practice has made us resolve to a more realistic approach, as presented here.

6 Discussion

We have proposed a novel methodology to integrate the high-throughput data, pathway structure and medical literature regarding the gene-gene directional interactions (PrimeDB). We construct BN from a pathway database, and use PrimeDB to guide the choices of prior parameters or hyperparameters in the BN. Then we show how to update these information using high-throughput genomics experiments. Our method numerically measures the strength of agreement between each pathway and the experiment using the symmetric Kullback Leibler measure incorporating the activation/inhibition association permeated in the biomedical literature. So we can rank the importance of these pathways in terms of their relatedness to the biological experiments to gain further knowledge in system biology. When a pathway agrees with the experimental data structurally, we would expect the pathway has a small SKL divergence measure. However, this cannot be guaranteed if the prior belief is terrible. So our method also relies on good choices of the prior distribution that should be flat and has huge support. In the illustration using real data, we have chosen the KEGG pathway database as the primary source of pathway structure and its gene-gene interaction as representative of medical journal counts (PrimeDB). The result might rely on the extant knowledge from a single data repository. However, our methodology is general enough that can be applied to any good databases that include gene-gene direction relationships as a substitute of the PrimeDB.

When a pathway is identified, it suggests the pathway is most agreeable with the data presented. On the other hand, the pathway with the highest K–L divergence suggests the pathway structure is not well supported by the data. This may be caused by (1) unexpected data, (2) dubious pathway structure, or both. So it has potential to provide more insights into the pathways.

We have used small sample size in our simulated and real examples, primarily small sample size is common in microarray experiments. Nevertheless, our method can handle any sample size as well. Microarray techniques are known to be noisy for biologists, so their results are often questioned by the biologists. Small sample exacerbates this situation. So incorporating a Bayesian frame work and borrowing information from the literature will add credibility to the microarray studies. Moreover, the multilevel model provides a more robust framework for sharing information among similar genes in evaluating the pathways.

Our method can handle large networks quite efficiently. It first computes the SKL score for each gene independently, then takes their averages as the pathway SKL score. In the simple model, the gene-specific SKL score reduces to a linear function of digamma functions. It hence can readily handle large pathways at fast speed. For the multilevel model, we use a collapsing technique and as a result sample from the prior distribution only instead of from both prior and posterior distributions of θ and ϕ, the two parameters that govern prior belief of the literature count. This greatly improves the efficiency in the numerical evaluation of the integrals in I₂′ and I₃′, the sums of the SKL scores for activated genes and inhibited genes. R programs have been developed based on this method. In the osteoblast lineage study, the four pathways have 11, 12, 9 and 10 genes, respectively. It takes less than 13 seconds in total to compute their SKL scores of the multilevel model using R 2.13.0 on a laptop computer with 2nd generation Intel Core i5-2410M processor 2.30 GHz. So it should not be a burden to compute SKL for large pathways which consist of about a hundred nodes at most as shown in KEGG.

Starting with the gene expression data from microarray, we first applied some statistical tests to classify the genes as up-regulated, down-regulated, or equivalently expressed. Fold change was used in our real data analysis. Then we construct Bayesian network for each pathway. So our study applies to the continuous gene expressions. However, we simplify the agreement assessment by discretizing the data. It would be interesting to extend our method to directly using the continuous gene expression. Nevertheless, we do not think the extension will be straightforward especially on a realistic prior construction.

We think the strength of this paper is on its ability to handle directed graphs with directed prior information on gene functions. On the other hand, our method can be modified to handle undirected graphs. In that, we will not differentiate activation or inhibition direction, change all of them into connection, then our method with reduced parameters can handle the undirected graph. For mixed graphs with directed and undirected edges, we need to add the connected part for the direction-unknown edges, then we can handle them similarly.

In this paper, we also assume that microarray outcomes are available for all the genes considered in the pathway (BN). However, it is often in practice so, that the pathway includes genes that microarray may not explore. So this falls into the missing data problem in BN. Conditional inference with incomplete data and model selection can still be carried out using Expectation and Maximization (EM) algorithm or MCMC. Further investigation on this issue should be worthwhile.

Acknowledgements

The work of Yifang Zhao, Baikang Pei, David Rowe, Dong-Guk Shin, Wangang Xie, Fang Yu, and Lynn Kuo was partially supported by Grants NIH/NIGMS P20GM65764, NIH/NIDCR U24DE016495, and State of Connecticut Stem Cell Initiative 06SCC04. Ming-Hui Chen's work was partially supported by NIH grants GM70335 and CA74015.

Contributor Information

Yifang Zhao, Department of Statistics, University of Connecticut, Storrs, CT 06269, USA yifang.zhao@gmail.com.

Ming-Hui Chen, Department of Statistics, University of Connecticut, Storrs, CT 06269, USA ming-hui.chen@uconn.edu.

Baikang Pei, MYSM School of Medicine, Yale University, New Haven, USA baikang.pei@yale.edu.

David Rowe, School of Dental Medicine, University of Connecticut Health Center, Farmington, CT 06030, USA rowe@neuron.uchc.edu.

Dong-Guk Shin, Computer Science & Engineering, University of Connecticut, Storrs, CT 06269, USA shin@engr.uconn.edu.

Wangang Xie, Abbott Lab, Chicago, USA wangang.xie@abbott.com.

Fang Yu, Department of Biostatistics, University of Nebraska Medical Center, Omaha, USA fangyu@unmc.edu.

Lynn Kuo, Department of Statistics, University of Connecticut, Storrs, CT 06269, USA lynn.kuo@uconn.edu.

References

1.Chen M-H, Shao Q-M, Ibrahim JG. Monte Carlo methods in Bayesian computation. Springer; New York: 2000. [Google Scholar]
2.Chen M-H, Huang L, Ibrahim JG, Kim S. Bayesian variable selection and computation for generalized linear models with conjugate priors. Bayesian Anal. 2008;3:585–614. doi: 10.1214/08-BA323. [DOI] [PMC free article] [PubMed] [Google Scholar]
3.Curtis RK, Oresic M, Vidal-Puig A. Pathways to the analysis of microarray data. Trends Biotechnol. 2005;23(8):429–435. doi: 10.1016/j.tibtech.2005.05.011. [DOI] [PubMed] [Google Scholar]
4.Efron B, Tibshirani R. On testing the significance of sets of genes. Ann Appl Stat. 2007;1:107–129. [Google Scholar]
5.Ellis B, Wong WH. Learning causal Bayesian network structures from experimental data. J Am Stat Assoc. 2008;103:778–789. [Google Scholar]
6.Fletcher R, Reeves CM. Function minimization by conjugate gradients. Comput J. 1964;7:148–154. [Google Scholar]
7.Friedman N, Linial M, Nachman I, Pe'er D. Using Bayesian networks to analyze expression data. J Comput Biol. 2000;7(3-4):601–620. doi: 10.1089/106652700750050961. [DOI] [PubMed] [Google Scholar]
8.Geweke J. Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In: Bernado JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian statistics 4. Clarendon; Oxford: 1992. [Google Scholar]
9.Hartmann C. A Wnt canon orchestrating osteoblastogenesis. Trends Cell Biol. 2006;16(3):151–158. doi: 10.1016/j.tcb.2006.01.001. [DOI] [PubMed] [Google Scholar]
10.Hartemink A, Gifford DK, Jaakkola TS, Young RA. Bayesian methods for elucidating genetic regulatory networks. IEEE Intell Syst Biol. 2002;17(2):37–43. [Google Scholar]
11.Heckerman D. A tutorial on learning Bayesian networks. 1995. Technical Report MSR-TR-95-06, Microsoft Research.
12.Hoffmann A, Gross G. BMP signaling pathways in cartilage and bone formation. Crit Rev Eucar Gene Expr. 2001;11(1-3):23–46. [PubMed] [Google Scholar]
13.Ishii M, Kurachi Y. Muscarinic acetylcholine receptors. Curr Pharm Des. 2006;12(28):3573–3581. doi: 10.2174/138161206778522056. [DOI] [PubMed] [Google Scholar]
14.Jensen ED, Gopalakrishnan R, Westendorf JJ. Regulation of gene expression in osteoblasts. BioFactors. 2010;36(1):25–32. doi: 10.1002/biof.72. [DOI] [PMC free article] [PubMed] [Google Scholar]
15.Jimia E, Hirataa S, Shina M, Yamazakia M, Fukushimaa H. Molecular mechanisms of BMP-induced bone formation: Cross-talk between BMP and NF-?B signaling pathways in osteoblastogenesis. Jpn Dent Sci Rev. 2010;46(1):33–42. [Google Scholar]
16.Kalajzic I, Staale A, Yang W-P, Wu Y, Johnson SE, Feyen JHM, Krueger W, Maye P, Yu F, Zhao Y, Kuo L, Gupta RR, Achenie LEK, Wang H-W, Shin D-G, Rowe DW. Expression profile of osteoblast lineage at defined stages of differentiation. J Biol Chem. 2005;280:24618–24626. doi: 10.1074/jbc.M413834200. [DOI] [PubMed] [Google Scholar]
17.Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]
18.Kay GG, Abou-Donia MB, Messer WS, Murphy DG, Tsao JW, Ouslander JG. Antimuscarinic drugs for overactive bladder and their potential effects on cognitive function in older patients. J Am Geriatr Soc. 2005;53(12):2195–2201. doi: 10.1111/j.1532-5415.2005.00537.x. [DOI] [PubMed] [Google Scholar]
19.Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79–86. [Google Scholar]
20.Liu JS. The collapsed Gibbs sampler with applications to a gene regulation problem. J Am Stat Assoc. 1994;89:958–966. [Google Scholar]
21.Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equation of state calculations by fast computing machines. J Chem Phys. 1953;21:1087–1092. [Google Scholar]
22.Monni S, Li H. Bayesian methods for network-structures genomic data. In: Chen MH, Dey DK, Muller P, Sun D, Ye K, editors. Frontiers of statistical decision making and Bayesian analysis: In honor of James O. Berger. Springer; New York: 2010. pp. 303–315. [Google Scholar]
23.Newton M, Quintana F, Den Boon J, Sengupta S, Ahlquist P. Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann Appl Stat. 2007;1:85–106. [Google Scholar]
24.Sachs K, Gifford D, Jaakkola T, Sorger P, Lauffenburger DA. Bayesian network approach to cell signaling pathway modeling. Sci Signal Transduct Knowl Environ. 2002;148:e38. doi: 10.1126/stke.2002.148.pe38. [DOI] [PubMed] [Google Scholar]
25.Sebastiani P, Abad M, Ramoni M. Bayesian networks for genomic analysis. In: Dougherty ER, Shmulevich I, Chen J, Wang ZJ, editors. Genomic signal processing and statistics. Hindawi Publishing Corporation; New York: 2004. pp. 281–320. [Google Scholar]
26.Shen H, West M. Bayesian modeling for biological annotation of gene expression pathway signatures. In: Chen MH, Dey DK, Muller P, Sun D, Ye K, editors. Frontiers of statistical decision making and Bayesian analysis: In honor of James O. Berger. Springer; New York: 2010. pp. 285–302. [Google Scholar]
27.Tilg H, Moschen AR. Adipocytokines: mediators linking adipose tissue, inflammation and immunity. Nat Rev Immunol. 2006;6:772–783. doi: 10.1038/nri1937. [DOI] [PubMed] [Google Scholar]
28.van Amerongen R, Nusse R. Towards an integrated view of Wnt signaling in development. Development. 2009;136(19):3205–3214. doi: 10.1242/dev.033910. [DOI] [PubMed] [Google Scholar]
29.Werhli A, Husmeier D. Reconstructing gene regulatory networks with Bayesian network by combining expression data with multiple sources of prior knowledge. Stat Appl Genet Mol Biol. 2007;6(1):1–45. doi: 10.2202/1544-6115.1282. [DOI] [PubMed] [Google Scholar]

[R1] 1.Chen M-H, Shao Q-M, Ibrahim JG. Monte Carlo methods in Bayesian computation. Springer; New York: 2000. [Google Scholar]

[R2] 2.Chen M-H, Huang L, Ibrahim JG, Kim S. Bayesian variable selection and computation for generalized linear models with conjugate priors. Bayesian Anal. 2008;3:585–614. doi: 10.1214/08-BA323. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R3] 3.Curtis RK, Oresic M, Vidal-Puig A. Pathways to the analysis of microarray data. Trends Biotechnol. 2005;23(8):429–435. doi: 10.1016/j.tibtech.2005.05.011. [DOI] [PubMed] [Google Scholar]

[R4] 4.Efron B, Tibshirani R. On testing the significance of sets of genes. Ann Appl Stat. 2007;1:107–129. [Google Scholar]

[R5] 5.Ellis B, Wong WH. Learning causal Bayesian network structures from experimental data. J Am Stat Assoc. 2008;103:778–789. [Google Scholar]

[R6] 6.Fletcher R, Reeves CM. Function minimization by conjugate gradients. Comput J. 1964;7:148–154. [Google Scholar]

[R7] 7.Friedman N, Linial M, Nachman I, Pe'er D. Using Bayesian networks to analyze expression data. J Comput Biol. 2000;7(3-4):601–620. doi: 10.1089/106652700750050961. [DOI] [PubMed] [Google Scholar]

[R8] 8.Geweke J. Evaluating the accuracy of sampling-based approaches to calculating posterior moments. In: Bernado JM, Berger JO, Dawid AP, Smith AFM, editors. Bayesian statistics 4. Clarendon; Oxford: 1992. [Google Scholar]

[R9] 9.Hartmann C. A Wnt canon orchestrating osteoblastogenesis. Trends Cell Biol. 2006;16(3):151–158. doi: 10.1016/j.tcb.2006.01.001. [DOI] [PubMed] [Google Scholar]

[R10] 10.Hartemink A, Gifford DK, Jaakkola TS, Young RA. Bayesian methods for elucidating genetic regulatory networks. IEEE Intell Syst Biol. 2002;17(2):37–43. [Google Scholar]

[R11] 11.Heckerman D. A tutorial on learning Bayesian networks. 1995. Technical Report MSR-TR-95-06, Microsoft Research.

[R12] 12.Hoffmann A, Gross G. BMP signaling pathways in cartilage and bone formation. Crit Rev Eucar Gene Expr. 2001;11(1-3):23–46. [PubMed] [Google Scholar]

[R13] 13.Ishii M, Kurachi Y. Muscarinic acetylcholine receptors. Curr Pharm Des. 2006;12(28):3573–3581. doi: 10.2174/138161206778522056. [DOI] [PubMed] [Google Scholar]

[R14] 14.Jensen ED, Gopalakrishnan R, Westendorf JJ. Regulation of gene expression in osteoblasts. BioFactors. 2010;36(1):25–32. doi: 10.1002/biof.72. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R15] 15.Jimia E, Hirataa S, Shina M, Yamazakia M, Fukushimaa H. Molecular mechanisms of BMP-induced bone formation: Cross-talk between BMP and NF-?B signaling pathways in osteoblastogenesis. Jpn Dent Sci Rev. 2010;46(1):33–42. [Google Scholar]

[R16] 16.Kalajzic I, Staale A, Yang W-P, Wu Y, Johnson SE, Feyen JHM, Krueger W, Maye P, Yu F, Zhao Y, Kuo L, Gupta RR, Achenie LEK, Wang H-W, Shin D-G, Rowe DW. Expression profile of osteoblast lineage at defined stages of differentiation. J Biol Chem. 2005;280:24618–24626. doi: 10.1074/jbc.M413834200. [DOI] [PubMed] [Google Scholar]

[R17] 17.Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30. doi: 10.1093/nar/28.1.27. [DOI] [PMC free article] [PubMed] [Google Scholar]

[R18] 18.Kay GG, Abou-Donia MB, Messer WS, Murphy DG, Tsao JW, Ouslander JG. Antimuscarinic drugs for overactive bladder and their potential effects on cognitive function in older patients. J Am Geriatr Soc. 2005;53(12):2195–2201. doi: 10.1111/j.1532-5415.2005.00537.x. [DOI] [PubMed] [Google Scholar]

[R19] 19.Kullback S, Leibler RA. On information and sufficiency. Ann Math Stat. 1951;22(1):79–86. [Google Scholar]

[R20] 20.Liu JS. The collapsed Gibbs sampler with applications to a gene regulation problem. J Am Stat Assoc. 1994;89:958–966. [Google Scholar]

[R21] 21.Metropolis N, Rosenbluth AW, Rosenbluth MN, Teller AH, Teller E. Equation of state calculations by fast computing machines. J Chem Phys. 1953;21:1087–1092. [Google Scholar]

[R22] 22.Monni S, Li H. Bayesian methods for network-structures genomic data. In: Chen MH, Dey DK, Muller P, Sun D, Ye K, editors. Frontiers of statistical decision making and Bayesian analysis: In honor of James O. Berger. Springer; New York: 2010. pp. 303–315. [Google Scholar]

[R23] 23.Newton M, Quintana F, Den Boon J, Sengupta S, Ahlquist P. Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis. Ann Appl Stat. 2007;1:85–106. [Google Scholar]

[R24] 24.Sachs K, Gifford D, Jaakkola T, Sorger P, Lauffenburger DA. Bayesian network approach to cell signaling pathway modeling. Sci Signal Transduct Knowl Environ. 2002;148:e38. doi: 10.1126/stke.2002.148.pe38. [DOI] [PubMed] [Google Scholar]

[R25] 25.Sebastiani P, Abad M, Ramoni M. Bayesian networks for genomic analysis. In: Dougherty ER, Shmulevich I, Chen J, Wang ZJ, editors. Genomic signal processing and statistics. Hindawi Publishing Corporation; New York: 2004. pp. 281–320. [Google Scholar]

[R26] 26.Shen H, West M. Bayesian modeling for biological annotation of gene expression pathway signatures. In: Chen MH, Dey DK, Muller P, Sun D, Ye K, editors. Frontiers of statistical decision making and Bayesian analysis: In honor of James O. Berger. Springer; New York: 2010. pp. 285–302. [Google Scholar]

[R27] 27.Tilg H, Moschen AR. Adipocytokines: mediators linking adipose tissue, inflammation and immunity. Nat Rev Immunol. 2006;6:772–783. doi: 10.1038/nri1937. [DOI] [PubMed] [Google Scholar]

[R28] 28.van Amerongen R, Nusse R. Towards an integrated view of Wnt signaling in development. Development. 2009;136(19):3205–3214. doi: 10.1242/dev.033910. [DOI] [PubMed] [Google Scholar]

[R29] 29.Werhli A, Husmeier D. Reconstructing gene regulatory networks with Bayesian network by combining expression data with multiple sources of prior knowledge. Stat Appl Genet Mol Biol. 2007;6(1):1–45. doi: 10.2202/1544-6115.1282. [DOI] [PubMed] [Google Scholar]

Genes	a _i\|j	b _i\|j	n _i\|j	n – n _i\|j	SKL _i\|j
g _2\|1	7	3	1	2	ψ(8) – ψ(7) + 2[ψ(5) – ψ(3)] – 3[ψ(13) – ψ(10)] = 0.4868
g _3\|2	8	1	2	1	2[ψ(10) – ψ(8)] + ψ(2) – ψ(1) – 3[ψ(12) – ψ(9)] = 0.5662
g _4\|1	3	3	2	1	ψ(4) – ψ(3) + 2[ψ(5) – ψ(3)] – 3[ψ(9) – ψ(6)] = 0.1964
g _5\|4	6	9	2	1	ψ(7) – ψ(6) + 2[ψ(11) – ψ(9)] – 3[ψ(18) – ψ(15)] = 0.0249
g _4\|3	1	10	3	0	3[ψ(13) – ψ(10)] – 3[ψ(14) – ψ(11)] = 0.0692

PERMALINK

A Bayesian Approach to Pathway Analysis by Integrating Gene–Gene Functional Directions and Microarray Data

Yifang Zhao

Ming-Hui Chen

Baikang Pei

David Rowe

Dong-Guk Shin

Wangang Xie

Fang Yu

Lynn Kuo

Abstract

1 Introduction

Fig. 1.

2 Simple Bayesian Model Using PrimeDB Directly

2.1 Notations and Transition Probabilities

Table 1.

Table 2.

2.2 Prior Elicitation and the Posterior Distributions

2.3 Selection Criterion of Supported Pathways: Symmetric Kullback–Leibler Divergence

3 Extension of the Symmetric KL Divergence to the Multilevel Model

3.1 Multilevel Model Guided by PrimeDB

3.2 Symmetric KL Divergence for Multilevel Model

3.3 MCMC Algorithm for Sampling from the Posterior Distributions

4 Examples

Table 3.

Table 4.

Fig. 2.

Table 5.

Table 6.

Table 7.

5 Application to an Osteoblast Lineage Study

Fig. 3.

Table 8.

Table 9.

6 Discussion

Acknowledgements

Contributor Information

References

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases