Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2017 Dec 1.
Published in final edited form as: IEEE Trans Signal Process. 2016 Sep 1;64(23):6243–6253. doi: 10.1109/TSP.2016.2605072

Bayesian Regression with Network Prior: Optimal Bayesian Filtering Perspective

Xiaoning Qian 1, Edward R Dougherty 2
PMCID: PMC5560447  NIHMSID: NIHMS821740  PMID: 28824268

Abstract

The recently introduced intrinsically Bayesian robust filter (IBRF) provides fully optimal filtering relative to a prior distribution over an uncertainty class ofjoint random process models, whereas formerly the theory was limited to model-constrained Bayesian robust filters, for which optimization was limited to the filters that are optimal for models in the uncertainty class. This paper extends the IBRF theory to the situation where there are both a prior on the uncertainty class and sample data. The result is optimal Bayesian filtering (OBF), where optimality is relative to the posterior distribution derived from the prior and the data. The IBRF theories for effective characteristics and canonical expansions extend to the OBF setting. A salient focus of the present work is to demonstrate the advantages of Bayesian regression within the OBF setting over the classical Bayesian approach in the context otlinear Gaussian models.

I. Introduction

Owing to increased computational ability, there is a growing desire to model complex systems. Unfortunately, rarely is there sufficient data to obtain satisfactory system identification. The problem exists across many areas of science, engineering, economics, and social science. Even when a system is not overly complex, accurate system identification is often impossible due to a paucity of data. Hence, there is a growing need to model system uncertainty in the sense that there is uncertainty class of models, to which the underlying model is a member, and then to design robust operatorthat is optimal over the uncertainty class with respect to some cost function.

Consider biology, where there is interaction across tens of thousands of genes and proteins in a single cell and billions of cells interacting in a multi-scale fashion within a single organism [1]–[3]. One would like to model gene regulatory networks in order to design classifiers for disease diagnosis or intervention strategies for treatment [4]–[6]. Owing to model complexity, intervention may need to take uncertainty into account [7]–[9].

In signal processing, robust filtering dates back to the 1970s in the design of optimal linear filters with uncertain power spectra and minimax optimization [10]–[14]. Minimax robust filters have been studied for various kinds of filters, including matched [15] and Kalman filters [16]. Bayesian robust filtering was proposed first for binary filtering [17] and then for linear filtering [18]. In the Bayesian framework, a prior distribution is assumed over the uncertainty class and optimization is relative to that distribution. In [17] and [18], optimization is constrained to the filters that are optimal for some model in the uncertainty class. In [19] that restriction is lifted.

In this paper, we extend the robust filtering framework in [19] by considering posterior distributions resulting from incorporating new data with the prior distribution in the Bayesian framework. In particular, we are interested in the regression formulation from a Bayesian perspective. This will allow us to systematically quantify the uncertainty of the prior knowledge on the system under study when regression accuracy is the operational objective ofinterest. As we would like to have closed-form results in this paper, we make simplified model assumptions when needed.

The model formulation of optimal Bayesian regression in this paper is fundamentally different from the existing Bayesian regression literature, in which regression functions or models are pre-defined in certain functional forms, for example, Bayesian linear regression (BLR) [20]–[24]. Existing BLR models take the pre-defined functional formsY = ψ(X ; β) = βT X from covariates X (denoting measurement data) to predict outcome response Y (denoting system characteristics). They typically enforce prior distribution assumptions on model coefficients β, including traditional multivariate Gaussian prior assumptions [20]–[23] as well as other sparse prior assumptions, including spike-and-slab prior, Bernoulli-Gaussian prior, Laplace prior, and other structured sparse prior assumptions arising from LASSO, group LASSO, and graphical LASSO [24]–[31]. In these existing BLR models, the connection of the regression functions and prior assumptions with the underlying systems or random processes is vague. There is a scientific gap in constructing functional models and making prior assumptions on model parameters when the actual uncertainty applies to the underlying random processes. Thus, given partial prior knowledge for the underlying random processes, the prior distribution should be placed directly on the random processes themselves, which is the approach taken in [32], [19], and here.

Our main contribution in this paper is to formulate Bayesian regression from the perspective of optimal Bayesian filtering (OBF), directly modeling the problem based on the joint distribution of covariatesX and outcome responseY . Instead of directly imposing model assumptions on model coefficients without systematic modeling ofinterdependence relationships of the involved random variables, as in classical Bayesian regression methods, our new optimal Bayesian regression, based on optimal Bayesian filtering theory, models the joint distribution of involved random variables, both covariates and responses. We emphasize that the prior on interdependence relationships in the OBF framework, rather than directly enforcing ad hoc prior assumptions on model coefficients of the pre-defined regression functions as in classical Bayesian regression, can help improve prediction accuracy when data are limited.

The remainder of the paper is organized as follows. In Section II, we review the robust filtering theory in [19]. Section III derives the general framework for optimal Bayesian filtering by extending the robust filtering theory. Section IV describes the proposed optimal Bayesian regression (OBR) via the OBF theory, with the linear Gaussian model being the network prior as an example. In Section V, we provide experimental results with simulated data as well as a network example motivated by modeling colon cancer to demonstrate that the OBR can naturally integrate the network prior and improve the regression accuracy when there are not enough data. Finally, we conclude the paper in Section VI.

II. Intrinsically Bayesian Robust Filters

Intrinsically Bayesian robust filtering extends classical filtering by optimizing over random processes in the presence of uncertainty regarding those processes. Optimal Bayesian filtering extends Bayesian robust filtering by including data and can directly apply results for Bayesian robust filtering to its own setting. This section briefly reviews some basics of Bayesian robust filtering as developed in [19].

The basic filtering problem involves jointly distributed observation and signal random processes, (X(t), Y(v)), tT, vV , with T and V being index sets, possessing a joint probability distribution function:

F(x,y;t,v)=P(X(t)x,Y(v)y) (1)

If X is an p-dimensional random vector, then Xx(t) is replaced by p component inequalities. To make notations simple, we write F(x, y; t, v) in either case. It is commonplace for the distribution of the process not to be known, in which case we have known or partially known statistical properties. Optimal signal estimation involves estimating a signal Y(v) at time v via a filter ψ given observations {X(t)}tT, namely, Y^(υ)=ψ(X)(υ). Optimization is relative to a family of filters, , where ψ is a mapping ψ : 𝒪 → ℂ and 𝒪 is the space of possible observed signals, and a cost function C(Y(υ),Y^(υ)). If an optimal filter (OF) exists for a fixed vV (with finite error), then it can be expressed as

ψOF(X)(v)=argminψC(Y(v),ψ(X)(v)), (2)

where the minimum may be achieved by more than a single ψ. For the mean-square error (MSE) cost function,

C(Y(v),ψ(X)(v))=EF[Y(v)ψ(X)(v)2], (3)

where EF denotes the expectation with respect to F and an optimal filter is referred to as a minimum-mean-square-error (MMSE) filter.

When the model is not known with certainty, we assume that the joint process belongs to an uncertainty class of processes defined by the parameter set Θ, each θ ∈ Θ corresponding to a distribution Fθ(x, y; t, v), and optimality is defined relative to Θ. As characterized in [19], given an uncertainty class Θ, a prior distribution π(θ), a cost function C, and a filter class , an intrinsically Bayesian robust filter (IBRF) in relative to π is defined by

ψIBRF(X)(v)=argminψEπ[C(Yθ(v),ψ(Xθ)(v))]. (4)

The adverb “intrinsically” refers to the fact that optimization is over all filters in , as opposed to only being taken with respect to filters that are optimal for some θ ∈ Θ, in which case an optimal filter is called a model-constrained Bayesian robust filter (MCBRF) [17], [18]. In the terminology of [33], the IBRF Bayesian framework is M-closed because the uncertainty class Θ consists of a parametric class of possible models.

A. Effective Characteristics

A key to finding an IBRF is that finding an optimal filter relative to a prior distribution over an uncertainty class of joint random processes often reduces to finding a single effective joint random process whose optimal filter is the IBRF for the uncertainty class [19].

Definition 1: Let C : ℂ × ℂ → ℝ+ be a measurable cost function of two variables, where ℝ+ is the set of non-negative real numbers. Let be a function class, where each ψ maps from a signal space, 𝒪, to ℂ. An observation and signal pair, (X(t), Y(v)), is solvable under cost C and function class ℱ if there exists a function ψOF minimizing C(Y(v), ψ(X)(v)) over all ψ.

Definition 2: Let Θ be an uncertainty class of process pairs having prior distribution π. An observation and signal pair (Xeff(t), Yeff(v)) is an effective process under cost C, function class ℱ, and uncertainty class Θ if for all ψ, both Eπ[C(Yθ(v), ψ(Xθ)(v))] and C(Yeff(v), ψ(Xeff)(v)) exist and

Eπ[C(Yθ(v),ψ(Xθ)(v))]=C(Yeff(v),ψ(Xeff)(v)). (5)

Theorem 1: Let Θ be an uncertainty class of process pairs having the prior distribution π. If there exists a solvable effective process, (Xeff(t), Yeff(v)), with optimal filter ψeff, then ψIBRF = ψeff.

The effective process may or may not belong to the uncertainty class. The general procedure is to establish a class, Φ, of solvable processes possessing known optimal filters and find a process {Xeff(t), Yeff(s)} ∈ Φ, for which (5) is satisfied. According to Theorem 1, an IBR filter is given by the optimal filter corresponding to the effective process.

Given a class of filters, filter error and optimality often depend on only certain characteristics, a characteristic being a deterministic function derived from the random process. For instance, the MMSE linear filter depends on only second-order moments. In addition to effective processes, an IBRF can often be expressed in the same closed form as a model-specific optimal filter with the original characteristics replaced by effective characteristics [19].

Definition 3: A class, Λ, of process pairs, {Xλ(t), Yλ(s)}, is reducible under cost C and function class ℱ if there exists a cost functional 𝒢 such that for each λ ∈ Λ and ψ,

C(Yλ(s),ψ(Xλ)(s))=𝒢(ωλ,κψ), (6)

where ωλ is a collection of process characteristics and κψ represents parameters for filter ψ.

Definition 4: A collection of characteristics, ω, is solvable in the weak sense under cost functional 𝒢 and function class ℱ if there exists a solution to

ψweak=argminψ𝒢(ω,κψ). (7)

Given a set of characteristics, ω, that are solvable in the weak sense, there is an optimal filter, ψweak, that possesses a functional, 𝒢(ω, κψweak).

Definition 5: Let Θ be an uncertainty class of process pairs contained in a reducible class. The characteristic ωeff is an effective characteristic in the weak sense under cost functional 𝒢, function class ℱ and uncertainty class Θ if for all ψ, both EΘ[𝒢(ωθ , κψ)] and 𝒢(ωeff , κψ ) exist and

Eϴ[𝒢(ωθ,κψ)]=𝒢(ωeff,κψ). (8)

Theorem 2: Let Θ be an uncertainty class of process pairs contained in a reducible class. If there exist weak-sense solvable weak-sense effective characteristics, ωeff, with optimal filter ψeff, then ψIBRF = ψeff.

If there exists an effective process providing the effective characteristics, then these are said to be effective in the strong sense; otherwise, they are effective in the weak sense.

B. Canonical Expansions

Many problems in random processes reduce to finding an integral canonical expansion of a random process X(t). Examples include optimal filtering, signal detection, and Karhunen-Loéve compression. Assuming zero-mean processes, a canonical expansion takes the form

X(t)=ΦZ(ξ)x(t,ξ)dξ, (9)

where Z(ξ) is the white noise over Φ (the domain of ξ) and the coordinate functions x(t, ξ) are deterministic. Examples include optimal filtering, signal detection, and Karhunen-Loéve compression. The covariance function of continuous white noise is the generalized function I(ξ)δ(ξξ′), where I(ξ) is the intensity of the white noise and the theory of integral representation is interpreted in the generalized sense. X(t) has the auto-correlation function

RX(t,t)=ΦI(ξ)x(t,ξ)x(t,ξ)¯dξ (10)

and x(t, ξ) = RXZ(t, ξ)I(ξ)−1, where RXZ(t, ξ) is the cross-correlation between X(t) and Z(ξ).

Integral canonical expansions are formed via a kernel a(t, ξ) by defining

Z(ξ)=TX(t)a(t,ξ)¯dt. (11)

Three conditions are necessary and sufficient for a canonical expansion to result [34], [35]. In [19], these conditions are extended to IBR filtering by taking the expectations of the involved covariances with respect to the prior distribution and take the form

x(t,ξ)=1I(ξ)Ta(v,ξ)Eπ[RXθ(t,v)]dv, (12)
Ta(t,ξ)¯x(t,ξ)dt=δ(ξξ), (13)
Φx(t,ξ)a(t,ξ)¯dξ=δ(tt), (14)

where the intensity of the white noise is given by

I(ξ)=ΦTTEπ[RXθ(t,t)]a(t,ξ)¯a(t,ξ)dtdtdξ. (15)

Consider linear filters of the form

ψ(X)(v)=Tg(v,t)X(t)dt, (16)

where the integral has a finite second moment. For effective process (Xeff(t), Yeff(v)), if there exists an optimal MSE linear filter to estimate Yeff via Xeff, then it is given by

Y^eff(v)=Tg^(v,t)Xeff(t)dt,

where

g^(v,t)=Ξa(t,ξ)I(ξ)¯TEπ[RYθXθ(v,u)]a(u,ξ)dudξ. (18)

III. Optimal Bayesian Filters

As opposed to the setting in [19], suppose that, in addition to an uncertainty class and prior distribution, data are drawn from the joint random process to produce the random sample S = {(x(t1), y(v1)), … , (x(tn), y(vn))}, and π(θ, S) is the joint distribution over Θ and the sampling process. We refer to the operator providing the minimum expectation conditioned on the data as an optimal Bayesian filter (OBF):

ψOBF(X,S)(v)=argminψEπ[C(Yθ(v),ψ(Xθ)(v))S]. (19)

The conditional expectation is relative to the posterior density π*(θ) = π(θ|S). When expressing the OBF in terms of the posterior, the expectation Eπ [ · |S] in (26) is changed to the expectation Eπ* [ · ]. The dependence of (X, Y) on θ is made explicit by writing (Xθ , Yθ); should the parameterization not be explicit, then the expectation takes the form Eπ[C(Y(v), ψ(X)(v))|θ|S] or Eπ* [C(Y(v), ψ(X)(v))|θ]. The OBF terminology is analogous to the corresponding problem in classification, where an optimal classifier relative to the posterior over an uncertainty class of feature-label distributions is called an “optimal Bayesian classifier” [32].

Note that we are not postulating a filtering model from X(t) to Y(v) and placing a prior on the model parameters. Since the family {Fθ (x, y; t, v) : θ ∈ Θ} represents our incomplete knowledge regarding the processes, prior knowledge relates directly to the joint distributions in the uncertainty class. If there are no data, then the OBF reduces to the IBRF. Moreover, the OBF for π and S is the IBRF for the posterior π*.

For the MSE cost function,

ψOBF(X,S)(v)=argminψEπEF[Yθ(v)ψ(Xθ)(v)2]. (20)

If consists of all measurable functions, then the MSE optimal filter is given by the conditional expectation. Thus,

ψOBF(X,S)(v)=EπEF[Y(v)X,θ]=EπEF[Y(v)Xθ]=Eπ[ψOF(X)(v)θ]. (21)

Hence, the OBF can be found by taking the optimal filter (conditional expectation) for each θ and then averaging over the posterior.

A different view of the MSE OBF can be obtained by rewriting (21):

ψOBF(X,S)(v)=EπEF[Y(v)Xθ]ϴydFθ(yX)π(θ)dθydϴFθ(yX)π(θ)dθ. (22)

Here the OBF is found by averaging the conditional distribution for Y given X over the posterior and then taking the conditional expectation of Y relative to that average. This average is called the effective conditional distribution for Y given X, relative to the posterior distribution π*. It is denoted by FeffS(yX) and (22) becomes

ψOBF(X,S)(v)=ydFeffS(yX)=EFeffS[YX], (23)

which is simply the usual conditional expectation form of the optimal MSE filter, but relative to the effective conditional distribution. The optimal MSE filter can be derived for any uncertainty class of models F.

To compute required statistical properties such as moments, we define the effective joint distribution of (X, Y) by

FeffS(x,y)=ϴFθ(x,y)π(θ)dθ. (24)

In general, the expectation of the p-q moment relative to the posterior distribution can be computed by

EπEF[Xp(ta)Yq(vb)θ]=ϴxp(ta)yq(vb)dFθ(x,y)π(θ)dθ=xp(ta)yq(vb)dFeffS(x,y)=EFeffS[Xp(ta)Yq(vb)], (25)

which is the joint moment relative to the effective joint distribution.

Given the random sample S, the posterior distribution is given by

π(θ)=π(θS)π(θ)π(Sθ)=π(θ)i=1nfθ(xi,yi,;ti;ti,vi), (26)

where we have assumed that Fθ(xi, yi; ti, vi) possesses the density fθ(xi, yi; ti, vi). The constant of proportionality can be found by normalizing the integral of π*(θ) to 1. It is important to recognize that in (26) the density fθ(xi, yi; ti, vi) is not fixed in general and depends on the index i (unless the joint process happens to be stationary).

In general, π(θ) is not required to be valid density function. The priors are called “improper” if the integral of π(θ) is infinite, i.e., if π(θ) induces a σ-finite measure but not a finite probability measure. Such priors can be used to represent uniform weight for all parameters in an unbounded range, rather than truncating the range of each parameter to a finite range. The use of improper priors can lead to improper posterior distributions. When improper priors are used, Bayes’ rule does not apply, and (26) is taken as a definition, posterior distributions being normalized to possess unit integral. Hence, it is important to check that the normalizing integral does not diverge.

As an illustration, consider the simple one-dimensional case where T = V = {0} and Fθ(x, y; 0, 0) is Gaussian for all θ ∈ Θ as it is related to our latter derivation of optimal Bayesian regression with the network prior. Then

EF[Y(0)X(0)=xθ]=μYθx=μYθ+ρθσYθσXθ(xμXθ)

is the classical linear Gaussian regression, where μXθ and μYθ are the means and σXθ and σYθ are the standard deviations of Xθ and Yθ, respectively. Hence, (21) yields

ψOBF(X,S)(0)=Eπ[μYθ]Eπ[ρθμXθσYθσXθ]+Eπ[ρθσYθσXθ]x,

which is still a linear function of x. In the OBF setting, the resulting regression function requires computing the effective statistical properties, which are the moments with respect to the posterior in this special case.

We emphasize that here the resulting linear regression function for predicting Y naturally comes from the Gaussian joint distribution assumption on F, which is fundamentally different from the classical Bayesian Linear Regression (BLR) that starts by assuming a linear regression function. In the classical BLR approach for this setup, a linear function of the form ψ(x) = βTx = β0 + β1x is assumed with a prior distribution on the parameter pair (β0, β1), thereby leading to a posterior distribution. The weakness of classical BLR is that the prior distribution is ad hoc relative to the underlying process Fθ(x, y; 0, 0), which in this case is a Gaussian. No matter the choice of prior for (β0, β1), the resulting regression can be no better than the OBF since, according to (21), the OBF is optimal. Structurally, the OBF is based on the uncertainty in the distribution of (X, Y ), which is characterized via π(θ), and the comparative performance of the BLR filter depends on the degree to which the prior on (β0, β1) happens to capture the knowledge in π(θ).

A. Canonical Expansions: Optimal Bayesian Wiener Filter

The IBRF theory for canonical expansions extends to IBRFs by replacing the prior π in (11) through (15) with the posterior π*. We illustrate the use of canonical representations by deriving the optimal Bayesian Wiener filter via the theory of effective processes with respect to the posterior distribution. Let be the class of all linear integral operators of the form (16). The solvable class consists of all process pairs, (X, Y), such that ψ(X)(v) has a finite second moment for any g(v, t) and there exists g^(υ,t) for which the Wiener-Hopf equation is satisfied. Define RΘ,Y(v, v) = Eπ* [RY#x03B8;(v, v)], RΘ,X(t, u) = Eπ* [RXθ (t, u)], and RΘ,Y X(v, t) = Eπ* [RYθXθ(v, t)]. Since RΘ,X(t, u) satisfies the properties of a valid auto-correlation function, there exists a zero-mean Gaussian process Xeff with auto-correlation function RΘ,X(t, u). Similarly there exists a zero-mean Gaussian process Yeff with auto-correlation function RΘ,Y (v, v) at v and cross-correlation function RΘ,Y X(v, t) at v for all tT. Since (5) is satisfied, (Xeff(t), Yeff(v)) is an effective process. Thus, if (Xeff, Yeff) is a solvable process, then by Theorem 1 the OBF is found via g^(υ,t) in (18) with RΘ,Y X(v, u) in place of Eπ [RYθXθ(v, u)].

Consider the usual Wiener setting in which the processes are wide-sense stationary (WSS) with auto-correlation functions rXθ (τ) and rY (τ), power spectral densities SXθ (ω) and SY (ω), and cross-correlation function rYθ Xθ (τ) having Fourier transform SYθ Xθ (ω). The effective process is solvable and has auto-correlation function rΘ,X (τ) = Eπ* [rXθ (τ)] and cross-correlation function rΘ,Y X (τ) = Eπ* [rYθXθ (τ)]. The Fourier transforms of these are denoted by SΘ,X (ω) and SΘ,Y X (ω), respectively, and are called the effective power spectra. The classical theory applies to the effective process, so we know that it is solvable.

Letting T = (−∞, ∞) and a(t, ω) = ejωt, (11) defines Zeff(ω) in terms of Xeff(t). Equations (12) through (14) are satisfied for the effective process with π* in place of π and Xeff(t) has the integral canonical expansion (9)

Xeff(t)=12πZeff(ω)eiωtdω. (27)

From (15), Zeff(ω) has intensity IΘ(ω) = 2πSΘ,X(ω). Substitution into (18) with τ = vt yields

g^ϴ(τ)=F1[Sϴ,YX(ω)Sϴ,X(ω)](τ), (28)

where F denotes the Fourier transform. According to Theorem 1, g^ϴ(τ) defines an OBF for Θ and is called the optimal Bayesian Wiener filter. It is of the same form as the classical Wiener filter but with the power spectra replaced by the effective power spectra relative to the posterior distribution, these being the effective characteristics for the OBF.

IV. Optimal Bayesian Regression for Networks

As discussed in Section III, when we consider regressing the outcome responses Y(v) on covariates X(t) given the available sample data S, the optimal Bayesian regression (OBR) with respect to MMSE is the conditional expectation Eπ* EF[Y(v)|X; θ]. We approach the problem from the perspective of a linear Gaussian model as our underlying system.

A. Linear Gaussian Models

Assume that the underlying regulations of a given network drive the network dynamics according to

X(t+1)=AX(t)+ϵ, (29)

in which t denotes discrete time points and ϵ ~ 𝒩 (0, Σ) is the noise. Note that the generality of A allows loops/cycles in the directed network model. At each time point t, the network dynamic evolves according to (29). This is a well-known time-homogeneous linear Gaussian model and X(t) is a first-order Gauss-Markov random process [36]. The main problem of this model is that, depending on the eigen-system structure of A, the network dynamic state X(t) may shrink or explode exponentially in the direction of the leading eigenvector of A. Hence, given a network topology, how can we can guarantee that X(t) has a steady-state distribution as t → ∞? When the leading eigenvalue is 1, the long-run behavior of the system is determined by the eigenvectors with the corresponding unit eigenvalue [36]. If we could have that guarantee, then X(t) follows a Gaussian distribution determined by A and Σt = cov(X(t)). To be specific,

cov(X(t+1))=Acov(X(t))AT+=AtAT+. (30)

If the system is stable, then the covariance also converges and there exists a positive semi-definite matrix Σ such that limt→∞ Σt = Σ, which is the unique solution of the discrete-time Lyapunov equation

=AAT+ (31)

from the previous recursive equation [36]. Therefore, with prior knowledge on a given linear Gaussian model, the system’s steady-state behavior x = X(∞) follows a Gaussian distribution with the converging covariance matrix Σ.

Finally, we assume that in real-world applications we are only interested in some outcome characteristics determined by the system’s steady-state behavior x:

y=Y()=βTX()+ϵ=βTx+ϵ. (32)

Often β is a sparse vector. With this setup, we can derive the joint distribution F (x, y; ∞, ∞) and apply the OBF for optimal Bayesian regression (OBR) to estimate y given x.

We note here that we focus on regression problems with linear Gaussian models due to the closed-form OBR solutions resulting from the model assumption. However, the proposed OBR in the optimal Bayesian filtering framework can be generalized to non-Gaussian models for any family of joint distributions F(x, y; ∞, ∞) if they can be derived based on the prior systems knowledge, including the exponential family [37]. For example, as in Poisson regression, y is assumed to follow the Poisson distribution: y ~ Poisson(λ) with the rate parameter λ dependent on x through the canonical link function λ = exp(βT x). However, even such a modest extension from linear Gaussian models will make the computation of the OBF difficult as it is well-known that the posterior can not be obtained in closed form for Bayesian Poisson Regression [38], [39]. We have to resort Monte Carlo Markov Chain (MCMC) methods to estimate effective characteristics for finding the OBF. It is more critical when we study multivariate exponential family joint distributions F(x, y; ∞, ∞), for which some caution may be exercised as they may not have closed and/or tractable forms, for example multivariate Poisson distributions [40]. In this paper, we focus on linear Gaussian models to illustrate the idea of OBF-based regression.

B. How does a network prior help regression?

We examine whether, and how, a priori network prior, such as the linear Gaussian model assumption, on X(t) provides better regression on predicting y considering the steady-state behavior when t → ∞?

1) Bayesian Linear Regression (BLR)

First consider the commonly adopted Bayesian linear regression (BLR) framework [21]–[30]. In BLR, given the sample S = {(x1, y1), … , (xn, yn)}, β is typically assumed to follow a prior distribution. For taking a conjugate prior, β is assumed to follow a Gaussian distribution, for example, p(β) ~ 𝒩 (0, σ2I). BLR computes the posterior

p(βS)p(β)p(Sβ)p(β)ip(yi,xiβ)p(β)ip(yixi;β)p(xiβ). (33)

As xi and β are independent, it is clear that the prior distributions p(β) and p(xi) factorize. Hence, the prior knowledge on x, such as the linear Gaussian model assumption, does not affect the posterior distribution p(β|S); hence, it does not lead to better prediction.

2) Optimal Bayesian Regression (OBR) through OBF

Now consider OBR. Instead of blindly imposing the linear model and the prior distribution assumptions on the regression model coefficients β as in BLR, we directly investigate the joint distribution F(x, y; t, v) and derive the estimates based on it. As we are interested in steady-state behavior by the joint distribution p(x, y) = F(x, y; ∞, ∞), based on the linear Gaussian model, we know p(x) ~ 𝒩 (μ(X(0), A), Σ(A)) with both the mean and covariance depending on the network model A. On the other hand p(y|x) ~ 𝒩 (βT x, σ′2). Hence the joint distribution p(x, y) ~ 𝒩 (μ, Σ), in which θ = {μ, Σ} parametrizes the uncertainty class constrained by β, A, and X(0). To be more specific, based on the linear Gaussian model we can derive [36]

μ=[μx,μy]T=[x,βTx]T;=[ΣxxΣxyΣyxΣyy]=[ΣΣββTΣβTΣβ+σ2],

in which x is determined by the eigen-system of A and X(0) of the linear Gaussian model, and Σ is the converging covariance matrix for the stable linear Gaussian model, satisfying the linear equation (31). Note that in practice we do not have these relationships between Σ and the model parameters. The prior knowledge for the joint distribution p(x, y) = F(x, y; ∞, ∞) will be directly on θ = {μ, Σ} rather than individual model parameters, including β, A, and X(0). The prior model with the uncertainty class of joint distributions governed by μ and Σ may come from accumulated scientific understanding of the system of interest, for example, the underlying systems model. In the present work, to get closed-form solutions, we focus on linear Gaussian models and demonstrate that our OBR through the OBF theory can naturally incorporate such network prior knowledge and improve the prediction accuracy compared to the classical Bayesian regression methods, especially when there are not enough data.

Similarly as the derived simple one-dimensional case in Section III, we have the following conditional expectation as the MSE optimal filter for multivariate Gaussian processes with p(x, y) being jointly Gaussian:

EF[yx;θ]=μy+yxxx1(xμx), (34)

which is a linear function of x. We emphasize here again that such a linear OBR naturally emerges from the linear Gaussian model assumption. As derived in Section III, the MMSE OBR with the uncertainty class by θ = {μ, Σ} is the expectation

ψOBR(x)=ψOBF(x)=EπEF[yx;θ]. (35)

Hence,

ψOBR(x)=Eπ[μy]Eπ[yxxx1μx]+Eπ[yxxx1]X. (36)

This resulting OBR ψOBR(x) under the linear Gaussian model assumption is indeed a linear regression function with respect to x with the regression model parameter estimates β^T=Eπ[yxxx1]. The task here is to derive the moments with respect to the posterior distribution of uncertain models as effective characteristics. Clearly, the prior knowledge regarding the covariance of x may help derive better estimates Eπ[yxxx1] for a better regressor. As an extreme case without observations, the correct network model may lead to better estimates of involved parameters and hence give better prediction.

The main message of this paper is to illustrate how prior network knowledge may help derive better regression models when there is a limited number of samples. To illustrate this main idea, we specifically focus on the uncertainty parameter Σxx without making assumptions on either β or Σyx directly. Hence, with the previous linear Gaussian model assumption, OBR resorts to the Bayesian estimates of the precision matrix [xx1] in classical Bayesian learning [20], [21], [24]. With x being multivariate Gaussian, a common conjugate prior for the model parameters for a multivariate Gaussian is to assume that the mean μx and the precision matrix Λ=xx1 are Gaussian-Wishart [21]. Without loss of generality, we focus on the Bayesian estimate of the precision matrix Eπ[Λ]=Eπ[xx1]. When we have the network prior on x as the linear Gaussian model x~𝒩(x,) we can take the following Wishart distribution as the prior for the precision matrix Λ [21], [24]:

Λ~Wi(Ξ,ν)=Λνp122νp2Γp(ν2)Ξν2exp(12tr(ΛΞ1)) (37)

in which the conjugate prior Wishart distribution has two hyper-parameters, ν (degree of freedom) and Ξ (scale matrix), and p is the dimension of x with ν > p − 1. With the linear Gaussian network prior with the converging covariance matrix Σ, Ξ can be set to (νp − 1)Σ. Given the sample data S with n sample points with zero mean, the posterior distribution of the precision matrix can be derived as another Wishart distribution with the new parameters υ+n2 and Ξ+n2ΞS:

Λπ~Wi(Ξ+n2ΞS,ν+n2), (38)

where ΞS=ixixiT=DTD is the estimated scatter matrix and D is the design matrix with the stacked xi’s from the training samples S. Via this posterior distribution, we can now estimate the desired effective characteristics [21]:

Eπ[xx1]=Eπ[Λ]=(νp1+n2)inv(Ξ+n2ΞS). (40)

This leads to the corresponding OBR assuming non-informative priors on the other model parameters in θ = {μ, Σ}:

ψOBR(x)[inv(Ξ+n2ΞS)DTy]Tx, (40)

with absorbed corresponding constants based on the selection of the degree-of-freedom parameter ν, and here y denotes the stacked response vector of yi’s from the training samples S. Clearly, in the OBR, the linear Gaussian network prior on x contributes to the final regression model and is expected to help enhance the prediction accuracy, especially when there are a limited number of training samples.

With the multivariate Gaussian assumption on the joint distribution p(x, y) = F(x, y; ∞, ∞), the OBR ends up with the linear regression model. To better connect the OBR under the Gaussian assumption and classical BLR methods, we study the desired regression model by directly investigating the characteristics of the outcome response, y = βT x + ϵ′, to estimate Eπ* EF [yx] and Eπ* EF [(βT x+ϵ′)x] with respect to both sides of the model equation. Denoting cov*(y, x) = Eπ* EF [yx] as our effective characteristics, we have

cov(y,x)=cov(βTx+ϵ,x). (41)

With a relatively weak assumption that EF [ϵ′|x] = 0, it is obvious that cov*(ϵ′, x) = 0. Therefore,

cov(y,x)=cov(βTx+ϵ,x)=cov(βTx,x)+cov(ϵ,x)=cov(x,x)β. (42)

Hence, by straightforward algebraic manipulations,

β=inv(cov(x,x))cov(y,x). (43)

Along this direction, so far we have not imposed any distributional assumptions on x. What interests us is whether any prior knowledge on x helps better prediction as in our OBR (40). In (43), we see clearly that the parameter estimates of β depend on whether the covariance estimates cov*(y, x) and cov*(x, x) are accurate, which again turns towards the estimates of the second-order moments with respect to the posterior as effective characteristics. Hence, better estimates of cov*(x, x) may help better model inference and therefore lead to better prediction accuracy. Especially, with small sample size compared to an extremely high number of covariates x, it is difficult to achieve accurate and robust estimates of the covariance matrix of x due to the quadratic number of the involved parameters with respect to the dimension of x. Hence, we expect that prior knowledge on the covariance matrix can help provide more stable and accurate estimates of β.

From the previous reasoning, the critical step for the OBR is to estimate

ψOBR(x)=β^Tx=Eπ[yxxx1]x. (44)

As discussed, when we only impose prior knowledge with respect to Σxx and assume the zero-mean joint Gaussian distribution with μx and μy being zero, the MMSE OBR with the uncertainty class by θ = {Σxx} is the expectation ψOBR(x)=β^Tx given by (40). Based on (42), deriving β^ requires accurate estimation of the inverse of the expected covariance matrix

inv(cov(x,x))=Eπ1[xx]=π1. (45)

Due to the inherent relationships between Wishart distributions and Inverse-Wishart (IW) distributions [21], [24], it can be shown that we will have an equivalent regression model of (40) through (42). Indeed, the derivation of β^ resorts to the Bayesian Covariance Estimates (BCE) in classical Bayesian learning [20], [21], [24] with the previous linear Gaussian model assumption. For BCE, we assume that the conjugate prior for the model parameters μx and Σxx is Gaussian-Inverse-Wishart (GIW). Focusing on the covariance matrix Σxx, we can take the following Inverse-Wishart distribution as the prior:

Σ~IW(Ξ,ν)=Ξν22νp2Γp(ν2)Σν+p+12exp(12tr(Ξ1)), (46)

with two hyper-parameters ν and Ξ. Similarly, Ξ can be set to incorporate the available network prior on converging co-variance Σ. The posterior distribution of Σxx given training sample S can be derived as IW(Ξ+n2ΞS,υ+n2).

We can now first estimate the desired effective characteristics, cov*(x, x) = Σπ*. For example, either the Maximum A Posterior (MAP or mode) estimate

map=Ξ+n2ΞSν+p+1+n2 (47)

or the mean estimate

mean=Ξ+n2ΞSνp1+n2 (48)

can be used as the appropriate estimates for Σπ*. We then can derive the corresponding parameter estimate for the OBR under linear Gaussian models:

β^inv(n2ΞS+Ξ)DTy=inv(n2DTD+Ξ)DTy, (49)

with absorbed corresponding constants depending on the choice of ν and whether we adopt the mode estimate Σmap in (47) or mean estimate Σmean in (48). It is clear that the OBR (40) and (49) are equivalent with appropriate hyper-parameter setups, specifically, by setting the degree-of-freedom parameter ν to make sure the contributions from the prior knowledge Σ and training samples ΞS have the same weights. Note that taking the prior scatter matrix in the Inverse-Wishart conjugate prior as a diagonal matrix λI yields the equivalent solution to the BLR with Gaussian prior of the specific diagonal covariance assumption, whose Maximum Likelihood Estimate (MLE) results in the ridge regression formulation.

In summary, with appropriate prior knowledge on the linear Gaussian model for x, we expect better model estimates for the regression problem, especially with small sample sizes.

V. Experiments and Discussion

A. Simulations with Linear Gaussian Networks

Simulation experiments based on stable linear Gaussian models have been implemented to demonstrate the improvement of OBR with network prior knowledge over classical BLR. We randomly generate linear Gaussian networks by randomly simulating A, with rejection sampling to keep only stable models with the magnitude of the leading eigen-value less than or equal to 1. With that we are guaranteed stable X(∞). We simulate the network dynamics with A and take the steady-state samples X(∞) upon reaching the converging covariance matrix. Then we simulate the outcome y = βT X(∞) + ϵ′, where β ~ 𝒩 (0, σ2I) and ϵ′ ~ 𝒩 (0, σ2) with σ = 1.

We have simulated data based on both the linear Gaussian network model with five nodes and linear regression for y. Based on the implementations of the BLR methods in the Probabilistic Modeling ToolKit (PMTK3: https://github.com/probml/pmtk3) [24], we have compared the MSE from both the OBR and classical BLR methods with different prior assumptions based on 1,000 simulated test samples using different training sample sizes. Our aim is to demonstrate that the prior knowledge of the linear Gaussian model (on x only) can help better predict y in OBR. Note that we assume no prior model knowledge on the regression function parameters β in both our OBR implementation and BLR. For BLR, we have tested non-informative Jeffreys prior [41], [42], Zellner’s g-prior [43], [44], as well as Gaussian prior by assuming diagonal Gaussian covariance matrix for the regression coefficient β, which corresponds to ridge regression for its MLE solution [21], [22], [24], [45]. For the hyper-parameters in these models, we have taken empirical Bayesian estimates. We note that for Jeffrey’s prior, performance is quite unstable with higher MSE when the sample size is fewer than the number of covariates. For such cases with much worse performance than OBR, including Zellner’s g-prior, we do not plot the results. As illustrated in Figure 1, OBR performs consistently better than BLR methods, especially with small samples (≤ 10). When the sample size is close to 20, different methods achieve similar MSE performance. All methods converge to the Bayes error (1.0) when the number of samples increases.

Figure 1.

Figure 1

Performance comparison between OBR and BLR for linear regression with covariates from a linear Gaussian network model with five nodes.

Based on the simulation results, it is clear that the interdependence relationships among measurements x themselves as well as the interrelationships between x and y affect the solutions. In BLR, as the derivation is based on the factorization p(y|x; β) and p(β), where β denotes the linear model parameters, the prior knowledge on the interdependence relationships among x may not help improve the predictive models. OBR is fundamentally different from BLR and is naturally derived based on the multivariate Gaussian prior knowledge considering all the interdependence relationships, which leads to more accurate and robust prediction.

B. A Colon Cancer Example

We now consider on phenotypic prediction with pathway models and compare OBR performance with BLR methods. Motivated by modeling colon cancer disease mechanisms as described in [46], [47], we assume that the simplified regulatory model illustrated in Figure 2, abstracted from three basic pathways, Ras/Raf/Mek pathway, PI3K pathway, and JAK/STAT pathway, can approximately model dynamic genome behavior of colon cancer. In Figure 2, we model the regulatory relationships among 11 proteins or protein complexes: EGF, Ras, MEK1/2, PIK3CA, STAT3, mTORC1, HGF, IL6, PKC, SPYR4, and TSC1/TSC2. System dynamics are represented by the vector z = [EGF, HGF, IL6, Ras, PIK3CA, STAT3, TSC1/TSC2, mTORC1, SPYR4, PKC, MEK1/2]. As MEK1/2 is a typical downstream marker for colon cancer, we consider the expression level of MEK1/2 is indicative of colon cancer status. Hence, in the regression setup, x = [EGF, HGF, IL6, Ras, PIK3CA, STAT3, TSC1/TSC2, mTORC1, SPYR4, PKC] and the disease outcome y = [MEK1/2]. Here, EGF, HGF, and IL6 are upstream ligands or stimulation factors regulating downstream dynamics. We assume that the network dynamics follow a linear Gaussian network model, which has been one of the commonly adopted gene regulatory network models due to its simplicity [48]–[58]. To assign the corresponding model parameters, we first set the regulatory relationships among these entities as derived from the prior pathway knowledge described in [46]:

Ras=13EGF+13HGF+13IL6+ϵ;PIK3CA12HGF+12Ras+ϵ;STAT313EGF+13IL613PIK3CA+ϵ;TSC1TSC2=ϵ;mTORC1=TSC1TSC2+ϵ;SPYR4=12STAT3+12mTORC1+ϵ;PKC=12IL612SPYR4+ϵ;MEK12=12Ras+12PKC+ϵ,

TSC1/TSC2 is stuck at 0 with a small probability of being changed: ϵ′ ~ 𝒩 (0, 0.05). This setup is motivated by the finding that the mutation for the tumor suppressor complex TSC1/TSC2 increases the colon cancer risk [46], [59]. For the upstream entities, [EGF, HGF, IL6] ~ 𝒩 (μ, Σ) with μ = [−0.3, −0.3, −0.3] and

Σ=[2.00.40.40.42.00.40.40.42.0]

Finally, ϵ ~ 𝒩 (0, 0.2). This simplified colon cancer model is a linear Gaussian network approximation. From the assumed dependencies, we can construct the covariance matrix of all the involved entities. The task is to show that OBR outperforms traditional Bayesian regression methods when estimating the expression level of MEK1/2.

Figure 2.

Figure 2

A simplified colon cancer network [46]: regular arrows represent activating regulations while blunt arrows represent repressing regulations. To approximately model colon cancer, TSC1/TSC2 is stuck at “0”.

Analogous to the previous set of experiments, we simulate data based the colon cancer network model and compare the MSE for OBR and classical BLR methods with different prior assumptions based on 1,000 simulated test samples using different sample sizes. When testing OBR, we have perturbed the theoretically derived system covariance matrix to be the scale matrix of the Wishart prior distribution, since in reality we often do not have a complete and accurate network prior. For BLR, we have tested non-informative Jeffreys prior and Gaussian prior with empirical Bayesian estimates for the required hyper-parameters. As illustrated in Figure 3, OBR performs significantly better than BLR methods with limited sample sizes (≤ 40), which are typical in genomic applications [60]. All of the methods converge to the Bayes error (0.2) when the sample size increases. With small sample sizes, OBR with the network prior can provide more accurate and stable estimates.

Figure 3.

Figure 3

Performance comparison between OBR and BLR for the simplified colon cancer network model: (A) with all tested training sample size; (B) with the limited number of samples from 5 to 45.

VI. Conclusions

Given sample data, the theory of intrinsically Bayesian robust filters extends to optimal Bayesian filtering simply by forming a posterior distribution from the prior distribution and the data. The IBRF theory for effective characteristics and canonical expansions directly extends to the OBF setting. The main issue addressed here has been to demonstrate the advantage of regression (OBR) within that setting over the classical Bayesian regression in the context of linear Gaussian models.

We plan to generalize the essential idea to develop new network models and OBR for non-Gaussian random processes. When the family of joint distributions is complicated, we cannot expect to derive closed-from solutions for model parameter updates for the posterior distribution and OBR. As in the case of optimal Bayesian classification [32], MCMC methods will be utilized [38], [39].

Finally, we note that our OBR has the critical assumption that there exists complete or partial knowledge on the underlying systems to achieve better prediction performance. As discussed in [33], we adopt an M-closed Bayesian framework. However, the prior knowledge is often an approximation to the true underlying systems. With the OBF framework that we develop here, we may quantify the uncertainty of the imposed model with the operative objective of regression accuracy to derive the corresponding guidelines for the potential iterative network prior construction and experimental design tasks based on the criterion of the Mean Objective Cost of Uncertainty (MOCU) [9]. Such extensions of OBR with M-open perspectives, for example by considering the uncertainty class of all models constrained by a Kullback-Leibler ball given a nominal joint distribution F(x, y; t, v) [33], are admittedly challenging and will be studied in our future research.

Acknowledgments

We would like to thank three anonymous reviewers for their constructive suggestions to help improve the manuscript. This research was supported by the NSF Award CCF-1553281 from the National Science Foundation.

Biographies

graphic file with name nihms-821740-b0004.gif

Xiaoning Qian (S’01–M’07) received the Ph.D. degree in Electrical Engineering from Yale University, New Haven, CT, in 2005. Currently, he is an assistant professor with the Department of Electrical & Computer Engineering, Texas A&M University, College Station, TX. He is affiliated with the Center for Bioinformatics & Genomic Systems Engineering (CBGSE) and the Center for Translational Environmental Health Research (CTEHR) at Texas A&M. He also is a courtesy assistant professor in the Department of Computer Science & Engineering and the Department of Pediatrics at the University of South Florida, Tampa FL. He was with the Bioinformatics Training Program at Texas A&M University, sponsored by the National Cancer Institute (NCI). His recent honors include the US NSF CAREER Award, Montague-Center for Teaching Excellence Scholar at Texas A&M, and the Best Paper Award at the 11th Asia Pacific Bioinformatics Conference. His current research interests include computational network biology, genomic signal processing, and biomedical image analysis.

graphic file with name nihms-821740-b0005.gif

Edward R. Dougherty is a Distinguished Professor in the Department of Electrical and Computer Engineering at Texas A&M University in College Station, TX, where he holds the Robert M. Kennedy ‘26 Chair in Electrical Engineering and is Scientific Director of the Center for Bioinformatics and Genomic Systems Engineering. He holds a Ph.D. in mathematics from Rutgers University and an M.S. in Computer Science from Stevens Institute of Technology, and has been awarded the Doctor Honoris Causa by the Tampere University of Technology. He is a fellow of both IEEE and SPIE, has received the SPIE President’s Award, and served as the editor of the SPIE/IS&T Journal of Electronic Imaging. At Texas A&M University he has received the Association of Former Students Distinguished Achievement Award in Research, been named Fellow of the Texas Engineering Experiment Station and Halliburton Professor of the Dwight Look College of Engineering. Prof. Dougherty is author of 16 books and author of more than 300 journal papers.

Contributor Information

Xiaoning Qian, Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843 USA.

Edward R. Dougherty, Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843 USA, and the Computational Biology Division of the Translational Genomics Research Institute, Phoenix, AZ 85004 USA.

REFERENCES

  • [1].Kauffman S. The Origins of Order. Oxford University Press; New York: 1991. [Google Scholar]
  • [2].Huang S, Ingber D. The structural and mechanical complexity of cell-growth control. Nature Cell Biology. 1999;1:E131–138. doi: 10.1038/13043. [DOI] [PubMed] [Google Scholar]
  • [3].Mazzocchi F. Complexity in biology. Exceeding the limits of reductionism and determinism using complexity theory. EMBO Reports. 2008;1:10–14. doi: 10.1038/sj.embor.7401147. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [4].Pal R, Datta A, Dougherty E. Optimal infinite horizon control for probabilistic Boolean networks. IEEE Trans. on Signal Processing. 2006;54(6):2375–2387. Part 2. [Google Scholar]
  • [5].Qian X, Dougherty E. Effect of function perturbation on the steady-state distribution of genetic regulatory networks: Optimal structural intervention. IEEE Trans. on Signal Processing. 2008;56(10):4966–4975. Part 1. [Google Scholar]
  • [6].Yousefi M, Datta A, Dougherty E. Optimal intervention in Markovian gene regulatory networks with random-length therapeutic response to antitumor drug. IEEE Trans. on Biomedical Engineering. 2013;60(12):3542–3552. doi: 10.1109/TBME.2013.2272891. [DOI] [PubMed] [Google Scholar]
  • [7].Pal R, Datta A, Dougherty E. Robust intervention in probabilistic Boolean networks. IEEE Trans. on Signal Processing. 2008;56(3):1280–1294. [Google Scholar]
  • [8].Pal R, Datta A, Dougherty E. Bayesian robustness in the control of gene regulatory networks. IEEE Trans. on Signal Processing. 2009;57(9):3667–3678. [Google Scholar]
  • [9].Yoon B-J, Qian X, Dougherty E. Quantifying the objective cost of uncertainty in complex dynamical systems. IEEE Trans. on Signal Processing. 2013;51(9):2256–2266. [Google Scholar]
  • [10].Kuznetsov V. Stable detection when the signal and spectrum of normal noise are inaccurately known. Telecommun. Radio Eng. 1976;30:31. [Google Scholar]
  • [11].Kassam S, Lim T. Robust Wiener filters. Journal of the Franklin Institute. 1977;304(4-5):171–185. [Google Scholar]
  • [12].Poor H. On robust Wiener filtering. Automatic Control, IEEE Transactions on. 1980;25(3):531–536. [Google Scholar]
  • [13].Vastola K, Poor H. Robust Wiener-Kolmogorov theory. IEEE Transactions on Information Theory. 1984;30(2):316–327. [Google Scholar]
  • [14].Verdu S, Poor H. On minimax robustness: A general approach and applications. IEEE Transactions on Information Theory. 1984;30(2):328–340. [Google Scholar]
  • [15].Poor H. Robust matched filters. IEEE Transactions on Information Theory. 1983;29(5):677–687. [Google Scholar]
  • [16].Yang T, Lee J-H, Lee K, Sung K-M. On robust Kalman filtering with forgetting factor for sequential speech analysis. Signal Processing. 1997;63:151–156. [Google Scholar]
  • [17].Grigoryan A, Dougherty E. Design and analysis of robust binary filters in the context of a prior distribution for the states of nature. Math. Imag. Vision. 1999;11(3):239–254. [Google Scholar]
  • [18].Grigoryan A, Dougherty E. Bayesian robust optimal linear filters. Signal Processing. 2001;81(12):2503–2521. [Google Scholar]
  • [19].Dalton L, Dougherty E. Intrinsically optimal Bayesian robust filtering. IEEE Trans. on Signal Processing. 2014;62(3):657–670. [Google Scholar]
  • [20].Gelman A, Carlin J, Stern H, Rubin D. Bayesian Data Analysis. Chapman and Hall; Boca Raton: 1995. [Google Scholar]
  • [21].Bernado J, Smith A. Bayesian Theory. Wiley; Chichester: 2000. [Google Scholar]
  • [22].Bishop C. Pattern Recognition and Machine Learning. Springer-Verlag; New York: 2006. [Google Scholar]
  • [23].Hastie T, Tibshirani R, Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer; 2009. [Google Scholar]
  • [24].Murphy K. Machine Learning: A Probabilistic Perspective. The MIT Press; 2012. [Google Scholar]
  • [25].Mitchell T, Beauchamp J. Bayesian variable selection in linear regression. J. of the Ann. Stat. Assoc. 1988;83:1023–1036. [Google Scholar]
  • [26].Kuo J, Mallick B. Variable selection for regression models. Sankhya Series B. 1998;60:65–81. [Google Scholar]
  • [27].Zou H, Hastie T. Regularization and variable selection via the elastic net. J. of Royal. Stat. Soc. Series B. 2005;67(2):301–320. [Google Scholar]
  • [28].Park T, Casella G. Bayesian LASSO. J. of Ann. Stat. Assoc. 2008;103(482):681–686. [Google Scholar]
  • [29].Seeger M. Bayesian inference and optimal design in the sparse linear model. J. of Machine Learning Research. 2008;9:759–813. [Google Scholar]
  • [30].Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J. of Royal. Stat. Soc. Series B. 2006;68(1):49–67. [Google Scholar]
  • [31].Yuan M, Lin Y. Model selection and estimation in the Gaussian graphical model. Biometrika. 2007;94(1):15–35. [Google Scholar]
  • [32].Dalton L, Dougherty E. Optimal classifiers with minimum expected error within a Bayesian framework – Part I: Discrete and Gaussian models. Pattern Recognition. 2013;46(5):1288–1300. [Google Scholar]
  • [33].Watson J, Holmes C. Approximate models and robust decisions. Statistical Science. 2016 in press. [Google Scholar]
  • [34].Pugachev V. Theory of Random Functions and Its Applications to Control Problems. Pergamon, Oxford, U.K.: Distributed by Addison-Wesley; Reading, MA, USA: 1965. [Google Scholar]
  • [35].Dougherty E. SPIE/IEEE Ser. Imag. Sci. Eng. Bellingham; WA, USA: 1999. Random Processes for Image and Signal Processing. [Google Scholar]
  • [36].Kumar P, Varaiya P. Stochastic Systems: Estimation, Identification and Adaptive Control. Prentice Hall; 1986. [Google Scholar]
  • [37].Wainwright M, Jordan M. Graphical models, exponential families and variational inference. Foundations and Trends in Machine Learning. 2003;2003:1–305. [Google Scholar]
  • [38].Tierney L. Markov chains for exploring posterior distributions. Ann. Statist. 1994;22(4):1701–1728. [Google Scholar]
  • [39].Knight J, Ivanov I, Dougherty E. MCMC implementation of the optimal Bayesian classifier for non-Gaussian models: Model-based RNA-Seq classification. BMC Bioinformatics. 2014;15:401. doi: 10.1186/s12859-014-0401-3. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [40].Yang E, Ravikumar PP, Allen G, Liu Z. On Poisson graphical models. Neural Information Processing Systems (NIPS) 2013 [Google Scholar]
  • [41].Jeffreys H. An invariant form for the prior probability in estimation problems. Proc. Royal Soc. of London, Series A. Math. and Physical Sciences. 1946;186(1007):453–461. doi: 10.1098/rspa.1946.0056. [DOI] [PubMed] [Google Scholar]
  • [42].Jaynes E. Prior probabilities. IEEE Trans. on Systems Science and Cybernetics. 1968;4(3):227–241. [Google Scholar]
  • [43].Zellner A. Models, prior information, and Bayesian analysis. J. Econometrics. 1996;75(1):51–68. [Google Scholar]
  • [44].Zellner A. Past and recent results on maximal data information priors. J. Statistical Research. 1998;32:1–22. [Google Scholar]
  • [45].Berger J, Bernardo J. On the development of reference priors. Bayesian Statistics. 1992;4(4):35–60. [Google Scholar]
  • [46].Esfahani M, Dougherty E. Incorporation of biological pathway knowledge in the construction of priors for optimal Bayesian classification. IEEE/ACM Trans. on Comput. Biol. & Bioinfo. 2014;11(1):202–218. doi: 10.1109/TCBB.2013.143. [DOI] [PubMed] [Google Scholar]
  • [47].Hua J, Sima C, Cypert M, Gooden G, Shack S, Alla L, Smith E, Trent J, Dougherty E, Bittner M. Tracking transcriptional activities with high-content epifluorescent imaging. J. Biomedical Optics. 2012;17(4) doi: 10.1117/1.JBO.17.4.046008. 0 460 081–04 600 815. [DOI] [PubMed] [Google Scholar]
  • [48].D’haeseleer P, Wen X, Fuhrman S. Linear modeling of mRNA expression levels during CNS development and injury. Pac Symp Biocomput. 1999;4:41–52. doi: 10.1142/9789814447300_0005. [DOI] [PubMed] [Google Scholar]
  • [49].Yeung M, Tegner J, Collins J. Reverse engineering gene networks using singular value decomposition and robust regression. Proc Natl Acad Sci USA. 2003;99(9):6163–6168. doi: 10.1073/pnas.092576199. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [50].Bernard A, Hartemink A. Informative structure priors: joint learning of dynamic regulatory networks from multiple types of data. Pac Symp Biocomput. 2005;10:449–470. [PubMed] [Google Scholar]
  • [51].Rogers S, Girolami M. A Bayesian regression approach to the inference of regulatory networks from gene expression data. Bioinformatics. 2005;21(14):3131–3137. doi: 10.1093/bioinformatics/bti487. [DOI] [PubMed] [Google Scholar]
  • [52].Schäfer J, Strimmer K. An empirical Bayes approach to inferring large-scale gene association networks. Bioinformatics. 2005;21:754–764. doi: 10.1093/bioinformatics/bti062. [DOI] [PubMed] [Google Scholar]
  • [53].Bansal M, Gatta G, di Bernardo D. Inference of gene regulatory networks and compound mode of action from time course gene expression profiles. Bioinformatics. 2006;22:815–822. doi: 10.1093/bioinformatics/btl003. [DOI] [PubMed] [Google Scholar]
  • [54].Cantone I, Marucci L, Iorio F, Ricci M, Belcastro V, Bansal M, Santini S, di Bernardo M, di Bernardo D, Cosma M. A yeast synthetic network for in vivo assessment of reverse-engineering and modeling approaches. Cell. 2009;137(1):172–181. doi: 10.1016/j.cell.2009.01.055. [DOI] [PubMed] [Google Scholar]
  • [55].Marbach D, Costello J, Kffner R, Vega N, Prill R, Camacho D, Allison K, DREAM Challenge. Kellis M, Collins J, Stolovitzky G. Wisdom of crowds for robust gene network inference. Nat Methods. 2012;9(8):796–804. doi: 10.1038/nmeth.2016. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [56].Hasegawa T, Yamaguchi R, Nagasaki M, Miyano S, Imoto S. Inference of gene regulatory networks incorporating multi-source biological knowledge via a state space model with L1 regularization. PLoS ONE. 2014;9(8):e105942. doi: 10.1371/journal.pone.0105942. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [57].She Y, He Y, Li S, Wu D. Joint association graph screening and decomposition for large-scale linear dynamical systems. IEEE Trans. Signal Processing. 2015;63(2):389–401. [Google Scholar]
  • [58].Omranian N, Eloundou-Mbebi J, Mueller-Roeber B, Nikoloski Z. Gene regulatory network inference using fused LASSO on multiple data sets: application to escherichia coli. Scientific Reports. 2016;6:20533. doi: 10.1038/srep20533. [DOI] [PMC free article] [PubMed] [Google Scholar]
  • [59].Gao X, Zhang Y, Arrazola P, Hino O, Kobayashi T, Yeung R, Ru B, Pan D. TSC tumour suppressor proteins antagonize Amino-Acid-TOR signalling. Nature Cell Biology. 2002;4(9):699–704. doi: 10.1038/ncb847. [DOI] [PubMed] [Google Scholar]
  • [60].Boulesteix A-L. Over-optimism in bioinformatics research. Bioinformatics. 2010;26(3):437–439. doi: 10.1093/bioinformatics/btp648. [DOI] [PubMed] [Google Scholar]

RESOURCES