Abstract
Motivated by the increasing use of and rapid changes in array technologies, we consider the prediction problem of fitting a linear regression relating a continuous outcome Y to a large number of covariates X, eg measurements from current, state-of-the-art technology. For most of the samples, only the outcome Y and surrogate covariates, W, are available. These surrogates may be data from prior studies using older technologies. Owing to the dimension of the problem and the large fraction of missing information, a critical issue is appropriate shrinkage of model parameters for an optimal bias-variance tradeoff. We discuss a variety of fully Bayesian and Empirical Bayes algorithms which account for uncertainty in the missing data and adaptively shrink parameter estimates for superior prediction. These methods are evaluated via a comprehensive simulation study. In addition, we apply our methods to a lung cancer dataset, predicting survival time (Y) using qRT-PCR (X) and microarray (W) measurements.
Keywords and phrases: High-dimensional data, Markov chain Monte Carlo, missing data, measurement error, shrinkage
1. Introduction
The ongoing development of array technologies for assaying genomic information has resulted in an abundance of datasets with many predictors and presents both statistical opportunities and challenges. As an example, Chen et al. (2011) analyzed a gene-expression microarray dataset of 439 lung adenocarcinomas from four cancer centers in the United States, with the goal of using gene expression to improve predictions of survival time relative to using clinical covariates alone. Expression was measured using Affymetrix oligonucleotide microarray technology. After pre-screening the probes for consistency between centers, the authors initially evaluated 13,306 probes for construction of their predictor.
A clinical challenge to a candidate model which uses Affymetrix data is its application for predictions in new patients. The underlying complexity of Affymetrix data, including necessary preprocessing, requires specialized laboratory facilities, which will be locally unavailable at smaller hospitals. On the other hand, quantitative real-time polymerase chain reaction (qRT-PCR) offers a faster and more efficient assay of the same underlying genomic information, making a qRT-PCR-based prediction model clinically applicable. The tradeoff comes from the limited number of genes which may be assayed on a single qRT-PCR card. Thus, from the Affymetrix data, 91 promising genes were first identified. These 91 genes were then re-assayed with qRT-PCR. Because of tissue availability issues owing to the multi-center-nature of the study, only 47 out of 439 tumors were re-assayed by qRT-PCR, creating a significant missing data problem.
Motivated by this problem, in this paper we consider the analysis of a dataset with many predictors in which a large block of covariates are missing, a situation for which there is limited previous literature. To maintain relevance to the application which drives our methodology, we assume the data have two distinctive features. First, the number of covariates, ie genes, is of moderate size, approximately the same order as the number of observations. This precludes both a more traditional regression situation as well as an “ultra-high-dimensional” regression and reflects that an initial screening has identified a subset of potentially informative genes. Second, there are two versions of the genomic data: measurements from a prior technology, which are complete for all observations, and measurements from a newer, more efficient technology, which are observed only on a small subset of the observations. Owing to the inherent variability in parameter estimates induced by both the missing data and the dimensionality of the problem, we consider Bayesian approaches, which allow for the application of shrinkage methods, in turn offering better prediction.
Translating this into statistical terminology, we consider predicting an outcome Y given length-p covariates X. Assuming Y is continuous and fully observed, we use the linear model
(1) |
All observations contain Y and W, which is an error-prone length-p surrogate for the true covariate X. On a small number of observations of size nA, subsample A, we also observe X, which is missing for the remaining subjects, constituting subsample B, of size nB. Complete observations, then, contain an outcome Y, covariates X, and surrogates W. Subsample A is written as {yA, xA, wA} and subsample B as {yB, wB}. The true covariates from subsample B, xB, are unmeasured. The data are schematically presented in Figure S1 of the supplemental article (Boonstra, Mukherjee and Taylor, 2013).
Our goal is a predictive model for Y|X as in equation (1), but because W is correlated with X, subsample B contains information about β. Moreover, shrinkage of regression coefficients may alleviate problems associated with multicollinearity of covariates. Boonstra, Taylor and Mukherjee (2013) proposed a class of targeted ridge (TR) estimators of β, shrinking estimates toward a target constructed using subsample B, making a bias-variance tradeoff. The amount of shrinkage can be data-adaptive with a tuning parameter, say λ. In a simulation study of datasets with many predictors, they showed that two biased methods, a modified regression calibration algorithm and a “hybrid” estimator, which is a linear combination of multiple TR estimators with data-adaptive weights, uniformly out-perform standard regression calibration, an unbiased method, in terms of mean-squared prediction error (MSPE):
(2) |
where the expectation is over Ynew, Xnew, yA, yB|xA, wA, wB.
However, there are reasons to consider alternative strategies. The authors showed the TR estimator can be viewed as a missing data technique: make an imputation x̃B of the missing xB and calculate β̂ treating the data as complete. When the shrinkage is data-adaptive through the tuning parameter λ, there is an intermediate stage: choose λ given x̃B. Uncertainty in x̃B or λ is not propagated in the TR estimators, thus it can be viewed as improper imputation (Little and Rubin, 2002). Moreover, to choose λ, a generalized cross validation (GCV) criterion was applied to subsample A. Although GCV asymptotically chooses the optimal value of λ (Craven and Wahba, 1979), it can overfit in finite sample sizes, and an approach for estimating λ which also uses information in subsample B is preferred. Finally, constructing prediction intervals corresponding to the point-wise predictions generated by the class of TR estimators requires use of the bootstrap. This resampling process is computationally intensive and provides coverage which may not be nominal.
These reasons, ie characterizing prediction uncertainty and unifying shrinkage, imputation of missing data, and an adaptive choice of λ, motivate a fully Bayesian approach toward the same goal of improving predictions using auxiliary data. Consider the generic hierarchical model presented in Figure 1. Known (unknown, respectively) quantities are bounded by square (circular) nodes. Instead of splitting the data into subsamples (cf. Figure S1), we classify it more broadly into observed (Uobs) and missing (Umis) components. Let φ denote parameters of interest and nuisance parameters in the underlying joint likelihood of {Uobs, Umis}. Regularization of φ is achieved through the shrinkage parameter η, equivalently interpreted in Figure 1 as the hyperparameters which index a prior distribution on φ. One can impose another level of hierarchy through a hyperprior distribution on η. Using [·] and [·|·] to denote marginal and conditional distributions, draws from [Umis, φ, η|Uobs], the distribution of unknown random quantities conditional on the observed data, constitute proper imputation and incorporate all of the information in the data. Summary values, like posterior means, as well as measurements of uncertainty, like highest posterior density credible intervals and prediction intervals, can easily be calculated based on posterior draws.
Fig 1.
A hierarchical model with missing data Umis and observed data Uobs. The shrinkage penalty parameters η are the hyperparameters of φ, the quantity(ies) of primary interest
Placing the shrinkage parameter η in a hierarchical framework allows the flexibility to determine both which components of φ to shrink and to what extent. As an example of the former, Boonstra, Taylor and Mukherjee (2013) shrink estimates of the regression coefficients β, tuned by the parameter λ. However, for improved prediction of the outcome Y, it may be beneficial to shrink the parameters generating the missing data xB. For example, in a non-missing-data context, the SCOUT method (Witten and Tibshirani, 2009) shrinks the estimate of Var(X) for better prediction. As for the extent of shrinkage, the hyperparameter-equivalence of the tuning parameters allows for the use of Empirical Bayes algorithms to estimate η. This has been used in the Bayesian Lasso (Park and Casella, 2008; Yi and Xu, 2008).
This paper makes two primary contributions. First, in Section 2, we discuss variants of the Gibbs sampler (Geman and Geman, 1984), a key algorithm for fitting hierarchical models with missing data. Here, we keep the context broad, assuming a generic hierarchical model indexed by φ with missing data Umis and unspecified hyperparameters η, as in Figure 1. One variant, Data Augmentation (Tanner and Wong, 1987), is a standard Bayesian approach to missing data, and all unknown quantities have prior distributions. Two others are Empirical Bayes methods: the Monte Carlo Expectation-Maximization Algorithm (Wei and Tanner, 1990) and the Empirical Bayes Gibbs sampler (Casella, 2001). Although proposed for seemingly different problems, we argue that the sampling strategies in each are special cases of that in Figure 1: variants of the same general algorithm, which we call EM-within-Gibbs. This previously-unrecognized link is important, given the increasing role Empirical Bayes methods play in modern applications. The second primary contribution builds on this proposed framework (Section 3), namely a comparison of several fully Bayesian and Empirical Bayes options and their application to our motivating genomic analysis. Of note in the data are two crucial features: first, φ, comprised of β0, β, σ2 plus parameters for modeling the distribution of X, is of a significant dimension, so that fitting a model with no missing data would still be somewhat challenging, and second, the number of partial observations where X is missing is larger than the number of complete observations. Meaningful analysis then requires the regularization, or shrinkage, of φ via an appropriate specification of the hierarchy and choice of η. We propose to shrink several different components of φ, making use of the simultaneous interpretation of η as a shrinkage penalty and a hyperparameter on φ. We evaluate these methods via a comprehensive simulation study (Section 4), also considering robustness of these methods under model misspecification. Finally, we analyze the Chen et al. dataset (Section 5). We include ridge regression (Hoerl and Kennard, 1970) as a reference, because the additional modeling assumptions of the other likelihood-based methods offer efficiency gains only when they are satisfied.
2. Gibbs Sampler Variants
In this Section, we discuss four existing variants of the Gibbs sampler relevant to our analysis. As we will argue, two of these are special cases of a more general variant, which we call “Empirical Bayes Within Gibbs” (EWiG), an equivalence that has not been established previously, leaving three distinct variants. We define a “variant” here as the characterization of a posterior distribution plus an algorithm for fitting the model. All variants are summarized in Table 1.
Table 1.
A comparison of the general form of the Gibbs sampler variants from Section 2 as they were originally proposed. Differences between posteriors depend on the presence of missing data Umis and whether the hyperparameters are η fully known. Differences in algorithms depend on how the lowest level of the hierarchy, which is unknown, is treated. In particular, MCEM differs from DA because it returns only an estimate of the posterior mode.
Variant | Posterior | Prior on η |
---|---|---|
DA (Tanner and Wong, 1987) | [φ, Umis|Uobs, η] ∝ [Uobs, Umis|φ] × [φ|η] | No |
DA+ (Gelfand and Smith, 1990) | [φ, Umis, η|Uobs] ∝ [Uobs, Umis|φ] × [φ|η] × [η] | Yes |
MCEM (Wei and Tanner, 1990) | [φ, Umis|Uobs, η] ∝ [Uobs, Umis|φ] × [φ|η] | No |
EBGS (Casella, 2001) | [φ|Uobs, η] ∝ [Uobs|φ] × [φ|η] | No |
EWiG | [φ, Umis|Uobs, η] ∝ [Uobs, Umis|φ] × [φ|η] | No |
Data Augmentation (DA+, DA) (Tanner and Wong, 1987)
These two variants are natural Bayesian treatments of missing data: Umis and φ are both unobserved random variables. In DA+, which is given above, the hyperparameters η are also unknown (Gelfand and Smith, 1990). In DA, a value for η is chosen. In either case, draws of φ and Umis are sequentially made from their conditional posteriors. In DA+ only, η is also sampled from its conditional posterior. Then, in either DA or DA+, the whole process is iterated. Tanner and Wong prove that iterations will eventually yield a draw from the true posterior distribution of interest, [φ, Umis, η|Uobs] for DA+ or [φ, Umis|Uobs, η] for DA. The full conditional distribution [φ|Uobs, Umis, η] may be difficult to specify. Suppose instead a set of partial conditional distributions is available, [φJ|φ(J), Uobs, Umis, η], where the set of J’s forms a partition of the vector φ. Then under mild conditions, repeated iterative sampling from these partial conditional distributions will also yield draws from the true posterior (Geman and Geman, 1984).
Monte Carlo Expectation-Maximization (MCEM) (Wei and Tanner, 1990)
MCEM provides a point estimate of φ rather than an estimate of the posterior distribution, as with DA/DA+. It is a modification of the original EM algorithm (Dempster, Laird and Rubin, 1977), replacing an intractable expectation with a Monte Carlo average of multiple imputations. K draws of Umis are sampled conditional on the current value of φ: φ(i−1). The expected posterior is updated with a Monte Carlo average and maximized with respect to φ. When φ has a flat prior, as in the originally proposed MCEM, {φ(i)} will converge to the maximum likelihood estimate (MLE) of φ. If an informative prior is specified through a particular choice of η, the sequence will converge to a penalized MLE (Green, 1990).
Empirical Bayes Gibbs Sampling (EBGS) (Casella, 2001)
EBGS allows the data to determine a value for the hyperparameter η. In the context of Casella, there are no missing data Umis. However, φ is considered missing for purposes of determining η: choose η which maximizes its marginal log-likelihood, ln[Uobs|η]. As in MCEM, an EM-type algorithm can maximize this intractable log-likelihood. K draws of φ are made from the current estimate of its posterior, and η is updated by maximizing a Monte Carlo estimate of E [ln[φ|η]], where the expectation is over the distribution [φ|Uobs, η(i)]. This expected complete-data log-likelihood relates to the desired marginal log-likelihood as follows. First observe that
Let C = E [ln[Uobs|φ]], which is constant with respect to η. Then,
Because E [ln[φ|Uobs, η]] ≤ E [ln[φ|Uobs, η(i)]] for any η, we have the result that maximizing E [ln[φ|η]] (or a Monte Carlo approximation thereof) over φ will increase ln[Uobs|η] and converge to a local maximum.
EM-within-Gibbs (EWiG)
Importantly, both MCEM and EBGS allow the lowest level of the hierarchy to be adaptively determined by the data rather than chosen a priori. In MCEM, this lowest level is φ, and in EBGS, it is η. However, MCEM can be expanded in the presence of an unknown η by putting both Umis and φ into the imputation step, so φ is sampled rather than optimized. The maximization step determines η. This returns to the original goal of DA+/DA, which is determining the posterior distribution of φ. Equivalently, we can take the perspective of expanding EBGS: add an imputation step for sampling Umis, keeping the maximization step the same. As a result of this equivalence, expanding either MCEM or EBGS yields the same result, what we call EWiG, given above. Because η is unknown, the hierarchical model here is the same as that given in Figure 1.
In summary, we have asserted that MCEM and EBGS are special cases of EWiG, so there are three distinct variants which we apply to our problem in the following section: DA, DA+, and EWiG.
3. Specification of Likelihood and Priors
The discussion so far has been deliberately generic. We now specify a likelihood for our problem of interest, which in turn gives φ, and apply these Gibbs variants to several combinations of (i) choices of priors [φ|η] and (ii) values of the hyperparameter η. Translating the quantities in Figure 1 to our problem, we have Uobs = {yA, yB, xA, wA, wB} and Umis = xB. A commonly used factorization of the joint likelihood is [Y, X, W] = [Y|X][W|X][X], which makes a conditional independence assumption [Y|X, W] = [Y|X]. An alternative factorization is [Y|X][X|W], which we do not consider as it is inconsistent with the application-driven measurement error structure of W and X. We make the following assumptions:
(3) |
The likelihood has an outcome model relating Y to X, a measurement error model relating the error-prone W to X, and a multivariate distribution for X. Thus, φ = {β0, β, σ, ψ, ν, τ, μX, ΣX}, and η is described below. Of interest is prediction of a new value Ynew given Xnew, eg , where and β* are posterior summaries of β0 and β. Uncertainty is quantified using the empirical distribution of , where { , β(t), σ2(t)} is the set of posterior draws and . If xB = Umis were observed, the complete log-likelihood would be
(4) |
The log-likelihood gives the imputation step:
(5) |
where and . Note that the mean is an nB × p matrix, each row representing the mean vector corresponding to a length-p observation, but the covariance is shared. The imputation is defined only by the likelihood and is common to all methods we consider; the differences lie in the choice of prior [φ|η] and the hyperparameter η. These crucially determine the nature and extent of shrinkage induced on φ. In what follows, we propose several options, summarized in Table 2.
Table 2.
A summary of all Gibbs samplers and choices of priors we considered. Λ is constrained to the class of diagonal matrices. VANILLA and EBSIGMAX require that p ≤ nA + nB.
Method | [β|η] ∝ |
|
η | Variant | ||
---|---|---|---|---|---|---|
VANILLA | 1 |
|
{} | DA | ||
HIERBETAS |
|
|
{λ} | DA+ | ||
EBBETAS |
|
|
{λ} | EWiG | ||
EBSIGMAX | 1 |
|
{Λ} | EWiG | ||
EBBOTH |
|
|
{λ, Λ} | EWiG |
VANILLA
As a baseline approach, we apply DA to the problem. The choice of prior is
(6) |
where is the diagonal part of the empirical covariance of xA. This is a Jeffreys prior on each component of φ except (see Remark 1 below), and η is known. The product of expressions (4) and (6) yields the full conditional distributions of each component of φ. For brevity, we present only the Gibbs steps for β and ; the complete set of full conditional distributions are given in the supplemental article (Boonstra, Mukherjee and Taylor, 2013).
(7) |
The Wishart distribution with d degrees of freedom, W{d, S}, has mean dS. As made clear in the matrix inversion in (7), VANILLA may only be implemented when p ≤ nA + nB.
Remark 1
A Jeffreys prior on , may result in an improper joint posterior if nB ≫ nA and p is large, ie when the fraction of missing data is large. From our numerical studies and monitoring of trace plots, even a minimally proper prior on , that is, using p + 1 degrees of freedom, does not ensure a proper posterior. We assume a priori , a data-driven choice, the density of which is given in (6). The prior mean of is , and the prior mean of ΣX is . Heuristic numeric evidence shows that 3p degrees of freedom works well, but we have not demonstrated a theoretical optimality for this. Other values that ensure convergence are equally defensible.
We call the Gibbs sampler which uses this mildly informative prior specification VANILLA. All the other methods we propose will have modified Gibbs steps for two components of φ: β and . Shrinking β is a clear choice: from (3), β is closely tied to prediction of Y|X. As for , this determines in part the posterior variance of xB (5); as this variance increases, the posterior variance of β decreases (7), thereby shrinking draws β. Other factors in the variance of xB, like τ2, are additional candidates for shrinkage, but we do not pursue this here.
3.1. Adaptive Prior on β
Since we are interested in regularizing predictions of the outcome Y, a natural candidate for shrinkage via an informative prior is the parameter vector β, which yields the conditional mean of Y|X. Ridge regression offers favorable predictive capabilities (Frank and Friedman, 1993), and the ℓ2 penalty on the norm of β is conjugate to the Normal log-likelihood. For these reasons, we replace the Jeffreys prior on β in (6) with
(8) |
This Normal prior on β is analogous to Bayesian ridge regression. λ is a hyperparameter, ie η = {λ}. Conditional upon λ, the Gibbs step for β is
Thus, the posterior mean of β is shrunk toward zero and with smaller posterior variance. As we have outlined in Section 2, there are several options for the treatment of λ.
HIERBETAS
Following (Gelfand and Smith, 1990), we can treat the hyperparameter λ as random (DA+) with prior distribution [λ] ∝ λ−1. Then, we have the following additional posterior step: λ ~ G{p/2, β⊤ β/(2σ2)}. This Bayesian ridge regression with posterior sampling of λ is denoted by HIERBETAS.
EBBETAS
Alternatively, we may apply EWiG to estimate λ. That is, integrate log[β|σ2, λ] with respect to the density [φ|Uobs, λ], differentiate with respect to λ, and solve for λ. The resulting EWiG update is . This is a Monte Carlo estimate of p{E [(β⊤ β)/(σ2)]}−1, the maximum of the marginal likelihood of λ. The update occurs at every Kth iteration of the algorithm using the previous K draws of β and σ2; larger values of K yield a more precise estimate. This Bayesian ridge with an Empirical Bayes update of λ is denoted by EBBETAS.
3.2. Adaptive Prior on
(EBSIGMAX, EBBOTH) We noted previously that an informative prior on is necessary to ensure a proper joint posterior: , which has inverse scale matrix . As we have noted, shrinkage of is closely related to that of β. This was exploited by Witten and Tibshirani (2009) in the scout procedure, suggesting that prediction can be improved through adaptive regularization of . Leaving the inverse scale matrix unspecified, the prior is
(9) |
Λ is the unknown positive-definite matrix of hyperparameters. The full conditional distribution of becomes
(10) |
Λ may be random, or it can be updated with an EWiG step. Given the potential difficulty in precisely estimating an unconstrained matrix which maximizes the marginal likelihood, we constrain Λ to be diagonal. Under this constraint, the EWiG update for the ith diagonal of Λ is , where indicates the ith diagonal element of . Then, Λ = diag{Λ11, …, Λpp}. This is a Monte Carlo approximation of , the minimizer of with respect to Λ, subject to the diagonal constraint, with in (9). This approach is denoted as EBSIGMAX. Like VANILLA, EBSIGMAX may only be implemented when p ≤ nA + nB. Finally, let EBBOTH be the approach which uses both priors in (8) and (9) with EWiG updates for λ and Λ. These alternatives are all summarized in Table 2.
Remark 2
Adaptively estimating the diagonal inverse scale matrix parameter Λ modifies the variance components of X. Alternatively, one might apply an EWiG update to the degrees of freedom parameter, say d, which modifies the partial correlations of X. For example, when d = p+1, the induced prior on each partial correlation is uniform on [−1, 1] (Gelman and Hill, 2006). Larger values of d place more prior mass closer to zero. Allowing the data to specify d is a reasonable alternative; however, we encountered numerical difficulties in implementing this approach. The EWiG update cannot be expressed in closed form and must be estimated numerically. Additionally, the “complete-data log-likelihood” in the M-step is often flat, and a wide range of values for d will return nearly equivalent log-likelihoods.
3.3. Estimation under Predictive Loss
A fitted model may be summarized by measures of uncertainty, eg a posterior predictive interval ( ), as well as point predictions, using summary values and β*. These are calculated with draws from the posterior distribution, {φ(t)}. Predictive intervals are given by empirical quantiles of { }, where and . For point predictions, a summary value of β0 is given by . For β, we minimize posterior predictive loss of Ŷnew. Specifically we define the posterior predictive mean by . This is in contrast to the posterior mean: βpm = arg minb Eβ|Uobs (β − b)⊤ (β − b). Estimates of these quantities are given by
(11) |
(12) |
To summarize, different posterior summaries of β come from minimizing different loss functions; we have two estimates of β for each method and, as a consequence, two choices of point predictions for Ynew. In contrast, we have only one posterior predictive interval, that derived from the empirical quantiles of { }.
4. Simulation Study
We conducted a simulation study based upon the motivating data to evaluate these methods. The assumed model of the data satisfied the generating model, as given in (3); violations to these modeling assumptions are considered later. We fixed nA = 50 and nB = 400. The diagonal and off-diagonal elements of ΣX were 1 and 0.15, respectively. The regression coefficients were (a diffuse signal) or (a signal concentrated in a limited number of coefficients). Values of R2 were either 0.1 or 0.4. Given β, ΣX, and R2, σ2 was determined by solving β⊤ ΣX β/(β⊤ ΣX β + σ2) = R2. β0 was set to zero. This yielded four unique simulation settings: two choices each for β and R2. The covariates xA and xB were sampled from N{0p, ΣX}, and yA|xA and yB|xB were drawn for each combination of β and σ2. We set ψ = 0 and ν = 1 and repeated each of the four settings for τ ∈ (0, 2), drawing wA|xA and wB|xB, the auxiliary data, based on the measurement error model in (3).
After a burn-in period of 2500, we stored 1000 posterior draws. We calculated β̂0, β̂ppm (11) and β̂pm (12). For VANILLA, HIERBETAS, EBBETAS, EBSIGMAX, EBBOTH, we estimated the MSPE using β̂ppm on 1000 new observations: . {Ynew,j, Xnew,j} are resampled from the same generating distribution for each simulation. As a comparison, we fit a ridge regression (RIDG) on subsample A only, choosing the tuning parameter with the GCV function. Figure 2 plots , averaged over 250 simulated datasets, over τ. Smaller values are better, and the smallest theoretical value is σ2, which is also plotted for reference. We also estimated MSPE using β̂pm. Numerical values are given in Tables S1 and S2, which also contain results from additional parameter configurations. Finally, we computed prediction intervals for the new observations (Section 3.3). Although frequentist in nature, it is still desirable for a Bayesian prediction interval to achieve nominal coverage; the average coverage rates of Ynew,j, nominally 95%, are given in Figure 3.
Fig. 2.
MSPE(β̂ppm) plotted against τ, the standard deviation of the ME model, for four simulation settings. For each method, β was estimated from 250 independent training datasets, and MSPE was estimated from 250 validation datasets of size 1000. The thick, solid bar (σ2) corresponds to predictions made using the true generating parameters. The three best-performing methods, HIERBETAS, EBBETAS and EBBOTH, are virtually indistinguishable.
Fig. 3.
Average coverage of prediction intervals plotted against τ, the standard deviation of the ME model, for four simulation settings. For each method, prediction intervals were created using draws of β from the converged Gibbs sampler, and coverage was averaged over 250 validation datasets of size 1000. Nominal coverage is 95. The lines for the three methods that are closest to maintaining nominal coverage, HIERBETAS, EBBETAS and EBBOTH, are virtually indistinguishable.
From Figure 2, HIERBETAS, EBBETAS and EBBOTH give about equally good predictions and are consistently the best over-all scenarios. EBSIGMAX, which corresponds to shrinkage on alone, predicts poorly, and VANILLA does only slightly better. RIDG does not beat the better-performing Bayesian methods. Even though the quality of the imputations for xB depends on the signal in the ME model, the resulting prediction error of HIERBETAS, EBBETAS, and EBBOTH varies little over the values of τ we evaluated.
Coverage Properties
HIERBETAS, EBBETAS, and EBBOTH maintain close-to-nominal prediction coverage (Figure 3). In contrast, larger values of τ drastically decrease the coverage of VANILLA and EBSIGMAX. Prediction intervals for RIDG are not automatic but may be calculated using the bootstrap. This is included in our primary data analysis.
Mean Squared Error
The results discussed above and reported in Figure 2 use β̂ppm, which minimizes predictive loss, and are evaluated by MSPE. If instead we use or , HIERBETAS, EBBETAS, and EBBOTH remain the preferred methods (results not given).
Computation Time
All Bayesian methods had approximately equal run-times, each requiring about 110 seconds per dataset under these simulation settings; run-times would increase with p, the dimension of β. While RIDG required only 1–2 seconds for each dataset, it does not give automatic prediction intervals, so a direct comparison of run-times here would be improper. In the data analysis (Section 5), we implement a bootstrap algorithm to construct prediction intervals, allowing for a fair comparison of computational time. Full computational details are in the supplemental article (Boonstra, Mukherjee and Taylor, 2013).
Violations to Modeling Assumptions
As we have noted, these likelihood-based approaches depend on the assumed model approximately matching the true generating model. We evaluated robustness by considering the following violations of the model assumptions: (i) the distribution of ε is skewed, shifted to maintain a zero mean: ε + 1 ~ G{1, 1}, (ii) the measurement error model is misspecified W|X ~ Np{ψ1p + νX2, τ2Ip} where we use X2 to denote the element-wise square, or (iii) X comes from a mixture of distributions: X|Z ~ Np{1[Z=2](3 × 1p) − 1[Z=3](3 × 1p), ΣX}, where 1[·] is the indicator function and .
The results of these modeling violations are given in Tables S3–S8. When ε is skewed (S3, S4), the rankings change little; the Bayesian ridge methods are equally preferred. The case is similar for the misspecified measurement error model (S5, S6). When X comes from a mixture of distributions, the results change depending on whether the signal in β is concentrated (S7) or diffuse (S8). In the former, EBBOTH is best by a large margin for larger values of τ, even over the other Bayesian ridge methods, HIERBETAS and EBBETAS. In this case, then, what is required is the joint, adaptive shrinkage of and β. This difference in performance is not observed when the signal is diffuse (S8), and the Bayesian ridge methods are all equally good.
A general conclusion of this study is that the shrinkage induced by a Bayesian ridge regression is adaptable to many scenarios and robust to modeling violations. The Gibbs sampler allows for the use of the additional information in subsample B despite xB being missing, and the ridge prior on β is effective at controlling variability, thereby increasing precision in predictions. Most important is that this holds even when the signals in the outcome model and the ME model are both very weak, a challenge commonly encountered in the analysis of genomic data.
5. Data Analysis
We now consider the motivating problem of efficiently using the auxiliary information in the data from Chen et al. (2011), containing 91 genes representing a broad spectrum of relevant biological functions, to build a predictive model for survival. Expression using Affymetrix is measured on 439 tumors, and qRT-PCR measurements are collected on a subset of 47 of these. Clinical covariates, age, gender and stage of cancer [I–III], are also available. Because qRT-PCR is the clinically applicable measurement for future observations, the goal is a qRT-PCR + clinical covariate model for predicting survival time after surgery. An independent cohort of 101 tumors with qRT-PCR measurements and clinical covariates is available for validation. After some necessary preprocessing of the data, as described in the supplemental article (Boonstra, Mukherjee and Taylor, 2013), the available data had nA = 47, nB = 389, and the validation sample is size 100.
Because our methodology was developed for continuous outcomes, censoring necessitated some adjustments to the data in order to fit our models. We first imputed each censored log-survival time from a linear model of the clinical covariates, conditional upon the censoring time. This model was fit to the training data, but censored survival times in both the training and validation data were imputed. Given completed log-survival times, we re-fit this same model and calculated residuals from both the training and validation data. These residuals were considered as outcomes, and the question is whether any additional variation in the residuals is explained by gene expression. While there are other ways of dealing with coarsened data and additional covariates in the likelihood-based framework, processing the data this way allows for RIDG to serve as a reference. To more realistically model the data, we allow for a gene-specific ME model: wij = ψj + νjxij + τξij. To incorporate this modification into our model, we put independent flat priors on ψj and νj, j = 1, …, p. The modified Gibbs steps are included in the supplemental article (Boonstra, Mukherjee and Taylor, 2013).
We applied each Bayesian approach, running each chain of the Gibbs sampler for 4000 iterations and storing posterior draws from the subsequent 4000 iterations. Table 3 presents numerical results: the estimated MSPE from predicting the uncensored residuals in the validation data and the average prediction coverage of these residuals. Additionally, Table 3 presents the Scaled Integrated Brier Score (SIBS, Graf et al., 1999), which is a scoring method for right-censored data, on the original, un-adjusted validation data.
Table 3.
Results from lung adenocarcinoma analysis. MŜPE is the empirical prediction error in the validation data, SIBS is the Scaled Integrated Brier Score, Avg. Coverage is average coverage of the prediction intervals, gives the average prediction interval length for the validation sample, and Computation gives the time, in seconds, to calculate coefficient estimates and prediction intervals.
RIDG | VANILLA | HIERBETAS | EBBETAS | EBSIGMAX | EBBOTH | ||
---|---|---|---|---|---|---|---|
|
0.620 | 1.251 | 0.555 | 0.555 | 1.230 | 0.561 | |
|
– | 1.768 | 0.559 | 0.558 | 1.932 | 0.560 | |
SIBS (β̂ppm) | 0.544 | 0.629 | 0.394 | 0.393 | 0.632 | 0.396 | |
SIBS (β̂pm) | – | 0.796 | 0.395 | 0.395 | 0.848 | 0.395 | |
Avg. Coverage | 0.92 | 0.88 | 0.96 | 0.97 | 0.87 | 0.96 | |
|
3.37 | 3.98 | 3.11 | 3.11 | 3.93 | 3.09 | |
Computation (sec) | 298 | 268 | 269 | 268 | 269 | 269 |
To calculate the SIBS, which is a function of predicted survival probabilities, we used the survival function from the Normal distribution, estimating the mean log-survival time by adding the linear predictor of the genomic data to the linear predictor of the clinical covariates. At each unique time of last follow-up (either time of death or censoring), the squared difference in predicted survival probability for each individual minus current dead/alive status was calculated and averaged over all individuals and integrated over all time points, with censored individuals only contributing to the calculation of the score until their censoring time. This quantity was scaled by a reference score, that from plugging in 0.5 as a predicted survival probability everywhere, to get the SIBS. Thus, any model that does better than random guessing has a SIBS in the interval (0,1), and a smaller SIBS is better.
Based upon MSPE, HIERBETAS, EBBETAS, and EBBOTH were about equally good, with MSPEs of 0.555, 0.555, and 0.561, respectively, using β̂ppm. These MSPEs are smaller than those from RIDG (0.620) as well as VANILLA (1.251), and EBSIGMAX (1.230). Using β̂pm, the estimated posterior mean of β, the three best methods gave almost identical results, while VANILLA and EBSIGMAX had worse prediction error. Similarly, HIERBETAS, EBBETAS, and EBBOTH had the smallest SIBS (respectively 0.394, 0.393, and 0.396), and the remaining methods had larger SIBS.
Considering coverage of the prediction intervals, HIERBETAS (0.96), EBBETAS (0.97) and EBBOTH (0.96) all had rates close to their nominal values, and their prediction intervals widths are smallest. This contrasts with VANILLA and EBSIGMAX, whose coverage rates are less than nominal (0.88, 0.87). We created prediction intervals for RIDG using a bootstrap algorithm; the resulting coverage is 0.92. The required computational time is 298 seconds for RIDG, including the bootstrap algorithm to calculate prediction intervals, and about 268–269 seconds for each Bayesian method. Although p, nA, and nB were about the same as in the simulation study, fitting the methods took longer (268 vs. 110 seconds) because the number of total MCMC iterations increased (8000 vs. 3500).
To summarize the analysis thus far, a Bayesian ridge regression, which uses all observations in the data, offers better overall predictive performance in our validation data and, compared to a ridge regression on the complete observations alone, narrower prediction intervals that still achieve nominal coverage. This is a reflection of the extra information that is available in the incomplete observations. Beyond the question of how to use the auxiliary genomic information in a prediction model, which has been already been covered in detail, more fundamental to the application is whether one of the Bayesian ridge regressions, for example EBBETAS, can do better than an analysis using clinical covariates alone, of which complete information is available on all observations. The natural comparison would be an accelerated failure time (AFT) regression, modeling censored log-survival time as a linear function of the clinical covariates and gaussian noise. Predictions from this AFT model could be directly compared to the outcome model in (1).
The SIBS from fitting the AFT model is 0.394, nearly equal to that of EBBETAS. Exploring this comparison further, Figure 4 gives risk-indexed Kaplan-Meier plots of the validation data, comparing predictions using EBBETAS (calculated by adding together the genomic linear predictors to the clinical covariate linear predictors described at the beginning of this section) to that of the AFT model. For each model, patients in the validation sample were indexed based on the their predicted survival time: less than 30 months, between 30 and 60 months, or longer than 60 months. From the figure, the clearest distinction is in the low-risk group, those predicted to live longer than 60 months. In the low-risk, “>60 month” group as defined by EBBETAS, 25 out of 31 patients, or about 80%, were alive at 60 months’ time. This contrasts with the AFT model: 56 patients were predicted to live beyond 60 months, and 36, or about 64%, were alive at 60 months’ time. Also distinctive is that the survival curves for the medium- and high-risk groups of the AFT model cross several times and generally show less separation compared to EBBETAS. The estimated median survival times for these two groups are: 28.6 (high) and 47.5 (med.) months under the EBBETAS-based grouping versus 32.3 (high) and 31.1 (med.) under the AFT grouping. Thus, despite nearly equal values of the SIBS, which are aggregate measures of predictive performance, EBBETAS appears to have better individual predictions and discrimination between the three groups.
Fig. 4.
Comparison of risk-indexed Kaplan-Meier plots. For both EBBETAS and an accelerated failure time model using only the clinical covariates, the validation data was grouped based on predicted survival time (less than 30 months, between 30 and 60 months, and longer than 60 months).
6. Discussion
Driven by a need to incorporate genomic information into prediction models, we have considered the problem of shrinkage in a model with many covariates when a large proportion of the data are missing. Predictions for future observations are of primary interest. We discuss the primary contributions of this paper in two parts as follows.
6.1. Shrinkage via the Gibbs Sampler
A likelihood-based approach confers a number of advantages, these being the inclusion of shrinkage into the likelihood and the proper accounting of uncertainty in predictions coming from the unobserved data. A number of existing Bayesian approaches for the treatment of missing data and/or implementation of shrinkage methods are easily adapted here. We have shown how two such approaches, the Monte Carlo EM (Wei and Tanner, 1990), a Gibbs sampler which multiply imputes missing data, and the Empirical Bayes Gibbs Sampler (Casella, 2001), a Gibbs sampler which adaptively shrinks parameter estimates, generalize to the same algorithm, which we call EM-within-Gibbs.
We proposed specific choices of prior specification aimed at improving prediction with shrinkage methods. The various flavors of the Bayesian ridge, denoted as HIERBETAS, EBBETAS and EBBOTH, stand out as the methods of choice, indicating that shrinkage of β, which is the vector of regression coefficients in the outcome model, is most crucial, over and above no shrinkage at all (VANILLA) or shrinkage of alone (EBSIGMAX). Our simulation study and data analysis showed the Bayesian ridge to be best under a number of scenarios using several criteria, including MSPE and prediction coverage, and robust to several modeling violations. In addition, the Bayesian ridge does not require p ≤ nA + nB, in contrast to VANILLA or EBSIGMAX. As for the specific choice of which Bayesian ridge regression is best, we found little evidence to recommend any one variant.
That shrinkage of alone, as we have implemented it, does not improve predictions (and some-times actually worsens predictions) may be due to the specific nature of the shrinkage we implemented. The mean of the conditional distribution of given in (10) is a convex combination of Λ/(3p), which is the inverse of its prior mean, and the sample variance of xA and xB. In contrast, ridge regression may be viewed as simply adding λIp to the sample variance of the covariates. The Wishart prior cannot mimic this effect, and the construction of a different, non-conjugate prior for may be required to induce ridge-type shrinkage.
6.2. Using Genomic Information in Prediction Models
Figure 5 plots coefficient estimates and 95% credible intervals for the 91 genes according to EBBETAS. They are ordered by the ratio of their posterior mean to posterior standard deviation, an estimate of statistical significance. The ten most significant genes are annotated, according to the R package annotate (Gentleman, 2012). Even the most significant gene, ERBB3, is not significant at the 0.05 level. Although these are pre-selected genes that were deliberately chosen to represent a wide spectrum of biological functions, many of which have already been implicated in different cancers, this lack of significance for individual genes is not unexpected. The genomic effect is likely to be at the pathway-level rather than individual expressions, which a plot like Figure 5 is too coarse to detect. Despite this lack of individual significance, the small genomic effects collectively yield an overall improvement, albeit small, in predictive ability when the information is properly incorporated, and the Bayesian ridge regression appears best-equipped to do so.
Fig. 5.
Coefficient estimates (X) and 95% credible intervals (- -) of the 91 genes according to EBBETAS, ordered from top to bottom by the magnitude of the ratio of posterior predictive mean to posterior standard deviation. The top ten genes are highlighted and annotated.
Supplementary Material
Footnotes
We thank the area Editor, an Associate Editor and two reviewers for comments that greatly improved the manuscript. This work was supported by the National Science Foundation [DMS1007494]; and the National Institutes of Health [CA129102, CA156608, ES020811].
(doi: COMPLETED BY THE TYPESETTER; .pdf). Here we give the full derivation of the Gibbs steps, computational details, and the results from the simulation study. The data from Section 5 and the code for its analysis are available at http://www-personal.umich.edu/~philb.
Contributor Information
Philip S. Boonstra, Email: philb@umich.edu.
Bhramar Mukherjee, Email: bhramar@umich.edu.
Jeremy MG Taylor, Email: jmgt@umich.edu.
References
- Boonstra PS, Mukherjee B, Taylor JMG. 2013 Supplement to “Bayesian shrinkage methods for partially observed data with many predictors”. [Google Scholar]
- Boonstra PS, Taylor JMG, Mukherjee B. Incorporating auxiliary information for improved prediction in high-dimensional datasets: an ensemble of shrinkage approaches. Biostatistics. 2013;14:259–272. doi: 10.1093/biostatistics/kxs036. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Casella G. Empirical Bayes Gibbs sampling. Biostatistics. 2001;2:485–500. doi: 10.1093/biostatistics/2.4.485. [DOI] [PubMed] [Google Scholar]
- Chen G, Kim S, Taylor JMG, Wang Z, Lee O, Ramnath N, Reddy RM, Lin J, Chang AC, Orringer MB, Beer DG. Development and validation of a qRT-PCR–classifier for lung cancer prognosis. Journal of Thoracic Oncology. 2011;6:1481–1487. doi: 10.1097/JTO.0b013e31822918bd. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Craven P, Wahba G. Smoothing noisy data with spline functions. Numerische Mathematik. 1979;31:377–403. [Google Scholar]
- Dempster AP, Laird NM, Rubin DB. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B. 1977;39:1–38. [Google Scholar]
- Frank IE, Friedman JH. A statistical view of some chemometrics regression tools. Technometrics. 1993;35:109–135. [Google Scholar]
- Gelfand AE, Smith AFM. Sampling-Based Approaches to Calculating Marginal Densities. Journal of the American Statistical Association. 1990;85:398–409. [Google Scholar]
- Gelman A, Hill J. Data analysis using regression and multilevel hierarchical models. Cambridge University Press; New York: 2006. [Google Scholar]
- Geman S, Geman D. Stochastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1984;PAMI-6:721–741. doi: 10.1109/tpami.1984.4767596. [DOI] [PubMed] [Google Scholar]
- Gentleman R. annotate: Annotation for microarrays R package version 1.36.0. 2012. [Google Scholar]
- Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of prognostic classification schemes for survival data. Statistics in Medicine. 1999;18:2529–2545. doi: 10.1002/(sici)1097-0258(19990915/30)18:17/18<2529::aid-sim274>3.0.co;2-5. [DOI] [PubMed] [Google Scholar]
- Green PJ. On use of the EM Algorithm for penalized likelihood estimation. Journal of the Royal Statistical Society: Series B. 1990;52:443–452. [Google Scholar]
- Hoerl AE, Kennard RW. Ridge regression: Biased estimation for nonorthogonal problems. Technometrics. 1970;12:55–67. [Google Scholar]
- Little RJA, Rubin DB. Statistical Analysis with Missing Data. 2. John Wiley & Sons, Inc; Hoboken, NJ: 2002. [Google Scholar]
- Park T, Casella G. The Bayesian Lasso. Journal of the American Statistical Association. 2008;103:681–686. [Google Scholar]
- Tanner MA, Wong WH. The calculation of posterior distributions by data augmentation. Journal of the American Statistical Association. 1987;82:528–540. [Google Scholar]
- Wei GCG, Tanner MA. A Monte Carlo implementation of the EM algorithm and the Poor Man’s Data Augmentation algorithms. Journal of the American Statistical Association. 1990;85:699–704. [Google Scholar]
- Witten DM, Tibshirani R. Covariance-regularized regression and classification for high dimensional problems. Journal of the Royal Statistical Society: Series B. 2009;71:615–636. doi: 10.1111/j.1467-9868.2009.00699.x. [DOI] [PMC free article] [PubMed] [Google Scholar]
- Yi N, Xu S. Bayesian Lasso for quantitative trait loci mapping. Genetics. 2008;179:1045–1055. doi: 10.1534/genetics.107.085589. [DOI] [PMC free article] [PubMed] [Google Scholar]
Associated Data
This section collects any data citations, data availability statements, or supplementary materials included in this article.