Skip to main content
NIHPA Author Manuscripts logoLink to NIHPA Author Manuscripts
. Author manuscript; available in PMC: 2024 Apr 17.
Published in final edited form as: Biometrics. 2022 Mar 22;79(2):1472–1484. doi: 10.1111/biom.13629

Leveraging a surrogate outcome to improve inference on a partially missing target outcome

Zachary R McCaw 1,*, Sheila M Gaynor 1, Ryan Sun 2, Xihong Lin 1,3,**
PMCID: PMC11023615  NIHMSID: NIHMS1976689  PMID: 35218565

SUMMARY:

Sample sizes vary substantially across tissues in the Genotype-Tissue Expression (GTEx) project, where considerably fewer samples are available from certain inaccessible tissues, such as the substantia nigra (SSN), than from accessible tissues, such as blood. This severely limits power for identifying tissue-specific expression quantitative trait loci (eQTL) in undersampled tissues. Here we propose Surrogate Phenotype Regression Analysis (Spray) for leveraging information from a correlated surrogate outcome (e.g. expression in blood) to improve inference on a partially missing target outcome (e.g. expression in SSN). Rather than regarding the surrogate outcome as a proxy for the target outcome, Spray jointly models the target and surrogate outcomes within a bivariate regression framework. Unobserved values of either outcome are treated as missing data. We describe and implement an expectation conditional maximization algorithm for performing estimation in the presence of bilateral outcome missingness. Spray estimates the same association parameter estimated by standard eQTL mapping and controls the type I error even when the target and surrogate outcomes are truly uncorrelated. We demonstrate analytically and empirically, using simulations and GTEx data, that in comparison with marginally modeling the target outcome, jointly modeling the target and surrogate outcomes increases estimation precision and improves power.

Keywords: EM Algorithm, Genetic Association Analysis, Missing Data, Multivariate Analysis, Surrogate Outcomes

1. Introduction

Tissue-specific expression quantitative trait loci (eQTL) are of substantial biological interest as mechanisms for explaining how the genetic variants identified in genome-wide association studies (GWAS) influence complex traits and diseases (Gamazon et al., 2015; Gusev et al., 2016; Hormozdiari et al., 2016; Zhu et al., 2016; Visscher et al., 2017). Traditional eQTL studies have focused on accessible tissues such as blood (McKenzie et al., 2014; Westra et al., 2013), while eQTL discovery in inaccessible tissues, such as the substantia nigra (SSN), have been impeded by insufficient sample sizes. Cross-tissue studies, including the Genotype-Tissue Expression Project (GTEx), have demonstrated that the effect sizes of eQTL are heterogeneous across tissues (Consortium, 2017). Consequently, studying only accessible tissues is insufficient to understand the genetic basis of gene regulation. Larger sample sizes are needed to provide sufficient power for reliable eQTL detection in inaccessible tissues, and there is great interest in borrowing information from accessible tissues to increase the effective sample sizes of inaccessible tissues.

Our work was motivated by the goal of improving power for eQTL mapping in the SSN, a region of the midbrain implicated in the development of Parkinson’s disease (Poewe et al., 2017). Due to the scarcity of expression data, no previous studies have focused on eQTL mapping in this region. At the time of our analysis, only 80 genotyped subjects with expression data in SSN were available from GTEx, in contrast to 369 with expression in whole blood. Among subjects with expression in blood, nearly 90% were missing expression in SSN. The methodology developed here leverages gene expression from a correlated surrogate tissue, such as blood, to improve power for identifying eQTL in the target tissue, SSN.

Several methods have been developed to address the related problem of multi-tissue eQTL mapping. Flutre et al. (2013) developed eQtlBma, a fixed-effects, heteroscedastic ANOVA model that jointly models gene expression in multiple tissues. Evidence against the global null hypothesis, that a SNP has no effect on gene expression in any tissue, is quantified using a Bayes factor averaged across potential non-null configurations. Sul et al. (2013) proposed Meta-Tissue, which jointly estimates the effect of a SNP on gene expression in multiple tissues using a mixed-effects model, then combines effect size estimates across tissues via meta-analysis. Li et al. (2018) developed MT-eQTL and its extension HT-eQTL, which modeles the vector of Fisher-transformed genotype-expression correlations across tissues. They propose a generative hierarchical model for the multivariate correlation vector and an empirical Bayes procedure for identifying multi-tissue eQTL based on the local false discovery rate.

Our approach differs from existing methods in two key respects. First, we are interested in identifying target-tissue eQTL not multi-tissue eQTL. That is, our null hypothesis is that a SNP has no effect on gene expression in the target tissue, not that a SNP has no effect on gene expression in any tissue. Moreover, we focus on the setting where the target tissue is subject to missing data, and empower eQTL analysis of the target tissue by leveraging data from the surrogate. Second, we are interested in frequentist rather than Bayesian inference, and specifically in asymptotic inference, which does not depend on computationally-intensive permutation procedures that are intractable at genome-scale.

In this paper, we propose improving power for eQTL mapping in an inaccessible tissue (e.g. SSN), for which expression is partially missing, by augmenting the sample with expression data from an accessible surrogate tissue, for which the sample size is substantially larger. Specifically, we propose jointly modeling expression in the target and surrogate tissues while regarding unobserved measurements in either tissue as missing data. We refer to this approach as Surrogate Phenotype Regression Analysis (Spray). Spray leverages the correlation in expression levels across tissues to increase the effective sample size, but maintains eQTL in the target tissue as the focus of inference. We note that Spray is unrelated to Surrogate Variable Analysis (Leek and Storey, 2007; Lee et al., 2017), a method developed to identify latent factors of variation present in microarray data.

For estimation, we implement a computationally efficient Expectation Conditional Maximization Either (ECME) algorithm (Meng and Rubin, 1993; Liu and Rubin, 1994), which is adapted to fitting the association model in the presence of bilateral outcome missingness. The algorithm iterates between conditional maximization of the observed data log likelihood with respect to the regression parameters and conditional maximization of the EM objective function with respect to the covariance parameters. In addition, we derive the covariance estimators of all model parameters and implement a flexible Wald test for evaluating hypotheses about the target regression parameters.

We show analytically that the asymptotic relative efficiency of jointly modeling the target and surrogate outcomes, compared with marginally modeling the target outcome only, increases with the target missingness and the square of the target-surrogate correlation. We numerically demonstrate the analytical results through extensive simulations evaluating the empirical efficiency of the Spray Wald test.

Compared to complete case analysis, maximum likelihood estimation as implemented by Spray is efficient, making full use of the available data, and provides more precise estimates of the target regression parameters. All estimation and inference procedures described in this article have been implemented in an easy-to-use R package (SurrogateRegression), which is available on CRAN (McCaw, 2020).

Using data from GTEx, we applied Spray to identify eQTL in the SSN, considering expression in blood, skeletal muscle, and the cerebellum as candidate surrogate outcomes. Compared with marginal eQTL mapping using expression in SSN only, Spray identified 4 to 5 times as many Bonferroni-significant eQTL, including all those identified by marginal analysis. Importantly, while the effect sizes estimated by Spray were nearly identical to those obtained via traditional, marginal eQTL mapping R2  0.995, the sampling variance of the estimates was reduced by up to 26%, on average, indicating that Spray increased power primarily by drawing on the correlated surrogate outcome to improve precision. Moreover, the effect sizes estimated by Spray are robust to the choice of surrogate outcome.

The remainder of this paper is organized as follows: Section 2 introduces the setting and model. Sections 3 and 4 detail the estimation and inference procedures. Section 5 addresses the estimand of Spray, and the asymptotic relative efficiency of jointly versus marginally modeling the target outcome. Section 6 presents the results of simulation studies, and Section 7 the application to the GTEx data. We conclude with discussions in Section 8.

2. Model and Setting

For each of i=1,,n independent subjects, suppose that two continuous outcomes are potentially observed: the target outcome Ti and the surrogate outcome Si. Consider the model:

TiSi xi,zixiβziα+ϵT,iϵS,i, (1)

where xi is a p×1 vector of covariates for the target outcome, with regression coefficients β; zi is a q×1 vector of covariates for the surrogate outcome, with regression coefficients α; and ϵi=ϵT,i,ϵS,iN0,Σ, with Σ=ΣTTΣTSΣSTΣSS. Let yi=vecTi,Si2 denote the 2×1 outcome vector, 𝒳i=diagxi,zi the 2×p+q subject-specific design matrix, and γ=vecβ,α the p+q×1 overall regression coefficient. With this notation, model (1) is succinctly expressible as: yi|𝒳iN𝒳iγ,Σ.

Our derivations proceed under the assumption of residual normality. However, because in many applications, including eQTL mapping, the target and surrogate outcomes may be non-normal, we apply the rank-based inverse normal transformation (INT) to each outcome prior to analysis (McCaw et al., 2020). Application of INT, which ensures that the marginal distribution of each outcome is univariate normal, is common in eQTL studies, including all published analyses from GTEx (Consortium, 2017). While marginal normality of each outcome does not guarantee bivariate normality, our simulation studies demonstrate that this strategy provides unbiased estimation and valid inference even under residual distributions that are far from bivariate normal.

Unbiased estimation of model parameters requires that the target and surrogate outcomes are missing as random (MAR). For the ith subject, define the target RT,i and surrogate RS,i responses indicators:

RT,i=1,Ti is observed, 0,Ti is missing.   RS,i=1,Si is observed, 0,Si is missing. 

These indicators partition the n subjects into 3 missingness patterns: complete cases (RT,i=1 and RS,i=1); subjects with target missingness (RT,i=0 and RS,i=1); and subjects with surrogate missingness (RT,i=1 and RS,i=0). Subjects with neither outcome observed (RT,i=0 and RS,i=0) make no likelihood contribution and are not considered further. Supposing n0 complete cases, n1 subjects with target missingness, and n2 subjects with surrogate missingness, the total sample size is n=n0+n1+n2.

MAR requires that observation of the target outcome RT,i is unrelated to its value Ti, given the remaining data Si,xi,zi, and likewise that RS,i is supposed unrelated to Si, given Ti,xi,zi. In our analysis of GTEx, the MAR assumption is plausible because donors were selected to be free of major diseases and the collection of tissue specimen was based on factors such as provision of consent and on the availability of sufficient tissue from the autopsy or surgical procedure (NCI, 2013; Consortium, 2013). Importantly, the decision to ascertain a tissue sample was not directly based on gene expression.

3. Estimation

3.1. Regression Parameters

Define the response indicator matrix Ri=diagRT,i,RS,i, and note that Ri is a projection matrix. The distribution of the observed data is expressible as:

yi|Ri,𝒳iNRi𝒳iγ,RiΣRi,

and the observed data log likelihood is:

obs γ,Σ12i=1nln detRiΣRi      12i=1nRiyiRi𝒳iγRiΣRi1RiyiRi𝒳iγ. (2)

The observed data score equation for the regression parameters γ is:

𝒰γγ,Σobsγ=i=1n𝒳iRiRiΣRi1RiyiRi𝒳iγ.

Conditional on Σ, the maximum likelihood estimator (MLE) of γ is the generalized least squares (GLS) estimator:

γ^Σ=i=1n𝒳iRiRiΣRi1Ri𝒳i1i=1n𝒳iRiRiΣRi1yi. (3)

3.2. Covariance Matrix

Let ϵi=yi𝒳iγ denote the residual vector. The observed data score equation for Σ is:

𝒰Σγ,ΣobsΣ=12i=1nRiRiΣRi1Ri+12i=1nRiRiΣRi1ϵiϵiRiΣRi1Ri.

However, the score equation for Σ does not admit a closed form. To obtain the MLE, we apply the ECME algorithm (Meng and Rubin, 1993; Liu and Rubin, 1994). Define the 2×2 residual outer product matrix:

Viϵiϵi=Tixiβ2TixiβSiziαSiziαTixiβSiziα2.

The complete data log likelihood is now expressible as:

γ,Σn2ln detΣ12trΣ1i=1nVi. (4)

The EM objective is the expectation of the complete data log likelihood in (4) given the observed data 𝒟obs and the current parameter estimates γr,Σr:

Qγ,Σ|γr,ΣrEγ,Σ|𝒟obs;γr,Σr. (5)

To obtain an expression for (5), define the working outcome vector:

y^irEyi|𝒟obs;γr,Σr=Ti,Si,RT,i=1RS,i=1,T^ir,Si,RT,i=0RS,i=1,Ti,S^ir,RT,i=1RS,i=0.

For complete cases, the working outcome vector is identically the observed outcome vector. For subjects with target missingness, the unobserved value of Ti is replaced by its conditional expectation given the surrogate outcome and covariates:

T^irETi|Si,𝒳i;γr,Σr=xiβr+ΣTSΣSS1rSiziαr.

Note that we adopt the convention that ΣSS1 refers to subsetting the S,S th element of Σ then taking its inverse, as opposed to subsetting the S,Sth of Σ1. For subjects with surrogate missingness, the unobserved value of Si is replaced by its conditional expectation give the target outcome and covariates:

S^irESi|Ti,𝒳i;γr,Σr=ziαr+ΣSTΣTT1rTixiβr.

Let Λ=Σ1 denote the precision matrix. Define the working residual outer product:

V^irEVi|𝒟obs;γr,Σr=y^ir𝒳iγy^ir𝒳iγ+diag0,0,RT,i=1RS,i=1,diagΛTT1,r,0,RT,i=0RS,i=1,diag0,ΛSS1,r,RT,i=1RS,i=0.

Expressed in terms of the working residual outer product, the EM objective function is:

Qγ,Σ|γr,Σr=n2ln detΣ12trΣ1i=1nV^ir.

The EM score equation for Σ is:

𝒰Σγ,Σ|γr,ΣrQΣ=n2Σ1+12Σ1i=1nV^irΣ1.

Conditional on γ, the EM update for Σ is:

Σ^rγ1ni=1nV^iγ|γr,Σr. (6)

3.3. Optimization

Spray implements the following ECME algorithm, in which the regression parameters γ are updated via conditional maximization of the observed data log likelihood in (2), and the covariance matrix Σ is updated via conditional maximization of the EM objective in (5).

Algorithm 1 ECME for Bivariate Normal Regression                                                               ¯¯Require:For each subject, observed rsponse and covariate data Riyi,𝒳iRequire:Initial estimates of the regression γ0 and covariance Σ0 parameters. 1:repeat 2:  GLS step: Update γ(r+1)γ^Σr via (3). 3:  ECM step: Update Σ(r+1)Σ^γr+1 via (6). 4:  Update observed data log likelihood obsr+1obsγr+1,Σ(r+1) via (2). 5:Until  obsr+1obsr<ϵ, where ϵ is the tolerance.6:returnFinal estimates of the regression γ^ and covariance Σ^ parameters.                 ¯

The accompanying R package initializes γ via ordinary least squares using all observed data:

γ0=i=1n𝒳iRi𝒳i1i=1n𝒳iRiyi.

Given γ0, Σ is initialized using the residual outer product of the n0 complete cases:

Σ0=1n0i=1n0yi𝒳iγ0yi𝒳iγ0.

4. Inference

The ECME algorithm presented in the previous section does not provide the asymptotic information of the MLEs. The observed-data information matrices were obtained using the following identity:

VE𝒰θ|𝒟obs=V𝒰θEV𝒰θ|𝒟obs,

where 𝒰θ is the complete-data score, and 𝒟obs is the observed data. The observed-data information for the regression parameters γ decomposes as:

γγβββααβαα=γγ,0+γγ,1+γγ,2. (7)

γγ,0 is the contribution of complete cases and takes the form:

γγ,0=i0=1n0xi0ΛTTxi0xi0ΛTSzi0zi0ΛSTxi0zi0ΛSSzi0.

γγ,1 is the contribution of subjects with target missingness and γγ,2 is the contribution of subjects with surrogate missingness; these take the following forms respectively:

γγ,1=i1=1n1000zi1ΣSS1zi1 , γγ,2=i2=1n2xi2ΣTT1xi2000.

Complete cases contribute to the information for all regression parameters. Subjects with target missingness contribute to the information for the surrogate regression parameters α only, while subjects with surrogate missingness contribute to the information for the target regression parameters β only.

The observed-data information matrix for the covariance parameters (ΣTT, ΣTS, ΣSS) is presented in the supporting information, and follows a similar pattern of contributions. The cross information γς between the regression γ and covariance ς parameters is zero. Thus, the MLEs γ^ and Σ^ are asymptotically independent. For eQTL mapping, inference on the target regression parameter βγ is performed using the standard Wald test, the details of which are also presented in the supporting information. Standard errors for all model parameters are provided by the accompanying R package, allowing for inference on α and Σ in addition to β.

5. Analytical Considerations

5.1. Marginal Interpretation of the Regression Parameter

The choice to jointly model the target and surrogate outcomes, rather than conditioning on the surrogate to predict the target, has important ramifications when interpreting the regression parameters estimated by Spray. For exposition, suppose (1) is the generative model, and consider the setting where the target and surrogate means each depend on genotype gi only:

TiSi giNgiβGgiαG,ΣTTΣTSΣSTΣSS. (8)

The implied marginal distribution of the target outcome is:

Ti|giNgiβG,ΣTT. (9)

Observe that the regression parameter for genotype βG from the joint model (8) is identical to that appearing in the marginal model (9). This equality is unchanged by the presence or absence of an association αG between genotype gi the surrogate outcome Si. Importantly, as is confirmed by our simulation studies, this implies that inference on βG under the joint model (1) does not depend on the value of αG. The same is not true of a model that conditions on the surrogate outcome. In particular, when conditioning on the surrogate outcome, the target outcome is distributed as:

Ti|Si,giNβGΣTSΣSS1αGgi+ΣTSΣSS1Si,ΣTTΣTSΣSS1ΣST.

Suppose that the target and surrogate outcomes are associated ΣTS0, which is a prerequisite for modeling the surrogate outcome to improve inference on βG. Then, in a model that regresses Ti on both Si,gi, the magnitude and direction of the regression coefficient for genotype βGΣTSΣSS1αG depends on whether and to what extent genotype is associated with the surrogate outcome (i.e. αG).

5.2. Efficiency Analysis

Consider again the genotype only model in (8). Suppose initially that all subjects are complete cases, and that the genotypes have been scaled such that: i0n0gi02=n0. Under these assumptions, the efficient information for βG from (8) is:

βGβG|αG=n0ΛTTΛTSΛSS1ΛST=n0ΣTT1.

This is identical to the information for βG from the marginal model in (9). Thus, in the absence of missingness, inference on βG under the joint model (8) is asymptotically equivalent to inference on βG under the marginal model (9).

Now suppose there are n0 complete cases and n1 subjects with target missingness. For simplicity, assume no subjects have surrogate missingness, n2=0. Genotypes have again been scaled, within outcome missingness groups, such that i0=1n0gi02=n0 for complete cases and i1=1n1gi12=n1 for subjects with target missingness. The efficient information from (8) becomes:

βGβG|αG=n0ΛTTΛTSn0n0ΛSS+n1ΣSS1ΛST,

while the information for βG from the marginal model remains βGβG=n0ΣTT1. The asymptotic relative efficiency (ARE) of inference under the joint model (8) versus inference under marginal model (9) is:

ARE=βGβG|αGβGβG=ΣTTΛTTΛTSn0n0ΛSS+n1ΣSS1ΛST. (10)

To better understand (10), suppose the covariance matrix in (8) is a correlation matrix, with ΣTT=ΣSS=1, and correlation ΣTS=ρ1,1. The ARE simplifies to:

ARE=11ρ21ρ21ρ2n0n01ρ21+n1=11πTρ2,

where πT=n1/n0+n1 is the proportion of subjects with target missingness. Now, if the target and surrogate outcomes are uncorrelated ρ=0, or if there is no target missingness πT=0, then the ARE is 1, and inference based on the marginal model is asymptotically equivalent to inference based on the joint model. For fixed target missingness πT, the ARE increases monotonically in the squared target-surrogate correlation ρ2. In the limit as ρ1, the ARE is maximized at 1πT1=1+n1/n0. Likewise, for fixed target-surrogate correlation ρ2, the ARE increases monotonically in the target missingness πT. In the limit as πT1, which occurs when n1, the ARE is maximized at 1ρ21. Overall, the power gain attributable to jointly modeling the target and surrogate outcomes is expected to increase with the squared target-surrogate correlation ρ2, and with the number of subjects with target missingness n1. This demonstrates an interesting property of the surrogate model: by leveraging the target-surrogate correlation, inference on the target outcome can be improved by incorporating information from subjects whose target outcomes are missing.

6. Simulation Studies

6.1. Brief Methods

The simulation methods are described in detail in the supporting information. Briefly, for each subject, the target Ti and surrogate Si outcomes were simulated to depend on genotype gi and covariates xi, including age, sex, and genetic PCs. The simulations considered both normally and non-normally distributed residuals ϵT,i,ϵS,i. INT was always applied to Ti and Si prior to analysis. The number of complete cases was fixed at n0=103. The numbers of subjects with missing outcomes n1,n2 were varied to change the proportions πT,πS of subjects with target and surrogate missingness. Seven (target, surrogate) missingness patterns πT,πS were considered: no missingness 0.00,0.00; unilateral target missingness 0.25,0.00,0.50,0.00,0.75,0.00; and bilateral outcome missingness 0.25,0.25,0.50,0.25,0.25,0.50. For each missingness pattern, the target-surrogate correlation ρ spanned 0.00,0.25,0.50,0.75.

6.2. Estimation

Table 1 considers estimation of the target genetic effect βG, the target variance ΣTT, and the target-surrogate correlation ρ both in the absence of missingness and in the presence of unilateral missingness in the target outcome. In all cases, parameter estimation was essentially unbiased, and the model-based standard errors (SEs), obtained from equation (7), agreed closely with the empirical standard deviations of the point estimates. Analogous tables for estimation of (βG, ΣTT, ρ) in the presence of bilateral missingness (S1), and for estimation of (αG, ΣSS) in the presence of both unilateral and bilateral missingness (S2) are presented in the supporting information.

Table 1: Target parameter estimation and standard error calibration across R=5×107 simulations in the presence of unilateral missingness.

The number of complete cases was n0=103. The true regression coefficient βG0.08 was chosen such that the heritability of the target outcome was 0.5% and the true variance of the target outcome was ΣTT=1.00. The surrogate missingness was fixed at πS=0.00 while the target missingness πT and target-surrogate correlation ρ were varied. The point estimate (EST) is the average across simulation replicates. The standard error is presented as the root mean square modelbased standard error SEM, followed by the empirical standard error SEE in parentheses, which is the standard deviation of the simulation point estimates.

Settings βG ΣTT ρ

ρ πT EST SEM SEE EST SEM SEE EST SEM SEE
0.00 0.00 0.08 0.05 (0.05) 0.99 0.04 (0.04) 0.00 0.03 (0.03)
0.25 0.00 0.08 0.05 (0.05) 0.99 0.04 (0.04) 0.25 0.03 (0.03)
0.50 0.00 0.08 0.05 (0.05) 0.99 0.04 (0.04) 0.50 0.04 (0.04)
0.75 0.00 0.08 0.05 (0.05) 0.99 0.04 (0.05) 0.75 0.04 (0.04)

0.00 0.25 0.08 0.05 (0.05) 0.99 0.04 (0.04) 0.00 0.03 (0.03)
0.25 0.25 0.08 0.05 (0.05) 0.99 0.04 (0.04) 0.25 0.03 (0.03)
0.50 0.25 0.08 0.05 (0.05) 0.99 0.04 (0.04) 0.50 0.03 (0.03)
0.75 0.25 0.08 0.05 (0.05) 0.99 0.04 (0.04) 0.75 0.04 (0.04)

0.00 0.50 0.08 0.05 (0.05) 0.99 0.04 (0.04) 0.00 0.03 (0.03)
0.25 0.50 0.08 0.05 (0.05) 0.99 0.04 (0.04) 0.25 0.03 (0.03)
0.50 0.50 0.08 0.05 (0.05) 1.00 0.04 (0.04) 0.50 0.03 (0.03)
0.75 0.50 0.08 0.04 (0.04) 1.00 0.04 (0.04) 0.75 0.03 (0.03)

0.00 0.75 0.08 0.05 (0.05) 0.99 0.04 (0.04) 0.00 0.03 (0.03)
0.25 0.75 0.08 0.05 (0.05) 0.99 0.04 (0.04) 0.25 0.03 (0.03)
0.50 0.75 0.08 0.05 (0.05) 1.00 0.04 (0.04) 0.50 0.03 (0.03)
0.75 0.75 0.08 0.04 (0.04) 1.00 0.04 (0.04) 0.75 0.03 (0.03)

To evaluate sensitivity of the estimation procedure to the bivariate normality assumption, additional simulations were conducted in which the target and surrogate residuals were generated from non-normal distributions, including bivariate versions of the exponential, log-normal, and Student t3 distributions. The bias and SE for estimating the parameter of primary interest, target genetic effect βG, are presented in supporting table (S3). Even when applied to skewed and kurtotic phenotypes, the estimation procedure remained unbiased and the SEs correctly calibrated, suggesting robustness to the residual distribution.

6.3. Type I Error Simulations

Table 2 presents the empirical type I error and non-centrality parameter (NCP) of the Spray Wald test in the presence of unilateral missingness; estimates under bilateral missingness are presented in supporting table S4. For these simulations the genetic effects were set to zero βG=0.00 and the null hypothesis H0:βG=0 was evaluated. The type I error was controlled to within 0.8% of the nominal level, and the NCP was within 0.2% of the reference value; both were insensitive to outcome missingness and target-surrogate correlation. Supporting figures S1-S2 demonstrate that, across outcome missingness patterns and target-surrogate correlation levels, the p-values provided by the Spray Wald test were uniformly distributed under the null. Thus, Spray provides a valid test of association between genotype and the target outcome. Supporting tables S5-S7 and figures S3-S5 indicate that the type I error is well-controlled even when the distribution of the phenotypic residuals is non-normal. Supporting table S8 verifies that control of the type I error becomes increasingly tight as sample size increases, to within 0.2% of nominal by a sample size of 20×103.

Table 2: Empirical type I error and power of the Spray Wald test across R=5×107 simulation replicates in the presence of unilateral missingness.

The number of complete cases was n0=103. The surrogate missingness was fixed at πS=0. For type I error, βG=0 while for power βG was selected to explain 0.5% of variation in the target outcome. The target missingness πT and target-surrogate correlation ρ were varied. Prob refers to the rejection probability at a target type I error of 5% and NCP is the non-centrality parameter of the Wald test.

Settings Type I Error Power

ρ πT Prob (%) NCP Prob (%) NCP
0.00 0.00 5.01 1.00 72.08 7.50
0.25 0.00 5.02 1.00 72.09 7.50
0.50 0.00 5.02 1.00 72.31 7.52
0.75 0.00 5.01 1.00 72.04 7.48

0.00 0.25 5.01 1.00 72.33 7.53
0.25 0.25 5.03 1.00 72.94 7.59
0.50 0.25 5.03 1.00 75.14 7.96
0.75 0.25 5.02 1.00 78.49 8.58

0.00 0.50 5.02 1.00 72.16 7.52
0.25 0.50 5.03 1.00 73.57 7.73
0.50 0.50 5.01 1.00 77.83 8.45
0.75 0.50 5.03 1.00 85.26 10.06

0.00 0.75 5.03 1.00 72.38 7.55
0.25 0.75 5.04 1.00 74.18 7.86
0.50 0.75 5.03 1.00 80.84 9.06
0.75 0.75 5.02 1.00 91.93 12.34

It is important to note that throughout the simulations, the target-surrogate correlation was estimated. For a given realization of the data, the MLE ρ^ will differ from 0 even when in truth ρ=0. The type I error simulations verify that this spurious estimated correlation does not compromise inference on βG.

6.4. Power Simulations

Table 2 presents the estimated power and NCP of the Spray Wald test for rejecting the H0:βG=0 in the presence of unilateral missingness; estimates under bilateral missingness are presented in supporting table S4. For these simulations, βG was chosen such that the proportion of variation in the target outcome explained by variation in genotype (i.e. the heritability) was 0.5%. Figures 1 and S6 present power curves describing how the probability of correctly rejecting the null hypothesis increases as the heritability increases from 0.1% to 1.0%. In the absence of target missingness, no additional power was gained by modeling the surrogate outcome. In the presence of target missingness, the power of the Spray Wald test increased with the target-surrogate correlation, and the relative improvement increased with the extent of target missingness. Supporting tables S5-S7 and figures S7-S9 demonstrate that similar trends with respect to power held under model misspecification. Whereas power under an exponential data generating process nearly matched that under a normal data generating process, power was attenuated in the more kurtotic cases of log-normal and Student t3 residuals.

Figure 1: Power curves for the Spray test of association in the presence of unilateral missingness.

Figure 1:

The number of complete cases was n0=103, and the type I error was α=0.05. Each point on the curve is the average across R=5×105 simulation replicates. The standard errors of the point estimates were negligible. The target regression coefficient βG was varied between 0.037 and 0.14 to achieve heritabilities between 0.1% and 1.0%, while the surrogate regression coefficient αG was fixed at zero. The surrogate missingness was held at πS=0, while the target missingness πT and target-surrogate correlation ρ were varied. Note that this figure appears in color in the electronic version of this article, and any mention of color refers to that version.

6.5. Empirical Relative Efficiency

To validate the ARE formula in equation (10), we conducted simulations comparing the Spray estimator β^GSpray  of βG with the marginal estimator β^GMarginal from the model:

Ti|gi,xiNgiβG+xiβX,ΣTT,

These simulations quantify the efficiency gain attributable to incorporating information from the surrogate. Table 3 compares the empirical variances of β^GSpray  and β^GMarginal in the presence of unilateral missingness, while supporting table S9 compares the empirical variances under bilateral missingness. In the absence of target missingness πT=0, or when the target-surrogate correlation was zero ρ=0, the empirical RE was one, as predicted by (10). Thus, while jointly modeling the target and surrogate outcomes is unnecessary in the absence of missingness, power is not substantially diminished by modeling an uninformative surrogate. In the presence of missingness, modeling an uninformative surrogate ρ=0 did not spuriously inflate the RE. As the target missingness πT and target-surrogate correlation ρ increased, the empirical RE increased as predicted by (10). The precise agreement between the empirical and theoretical REs suggests that equation (10) could prove useful for study design.

Table 3: Empirical relative efficiency comparing the Spray estimator to the marginal estimator of βG test across R=5×107 simulation replicates in the presence of unilateral missingness.

The number of complete cases was n0=103. The true regression coefficient βG0.08 was chosen such that the heritability of the target outcome was 0.5%. The true variances of the target and surrogate outcomes were ΣTT=ΣSS=1.00. The surrogate missingness was fixed at πS=0.00. The target missingness πT and target-surrogate correlation ρ were varied. Variance refers to the empirical variance of the corresponding estimator across simulation replicates. The empirical RE is the ratio of the variance of β^GSpray to that of β^GMarginal. The theoretical RE was obtained from (10).

Settings Variance Relative Efficiency

ρ πT Marginal Spray Empirical Theoretical
0.00 0.00 0.0027 0.0027 1.0000 1.0000
0.25 0.00 0.0027 0.0027 1.0001 1.0000
0.50 0.00 0.0027 0.0027 1.0005 1.0000
0.75 0.00 0.0027 0.0027 1.0011 1.0000

0.00 0.25 0.0027 0.0027 0.9997 1.0000
0.25 0.25 0.0027 0.0026 1.0158 1.0159
0.50 0.25 0.0027 0.0025 1.0672 1.0667
0.75 0.25 0.0027 0.0023 1.1657 1.1636

0.00 0.50 0.0027 0.0027 0.9995 1.0000
0.25 0.50 0.0027 0.0026 1.0318 1.0323
0.50 0.50 0.0027 0.0023 1.1431 1.1429
0.75 0.50 0.0027 0.0019 1.3931 1.3913

0.00 0.75 0.0027 0.0027 0.9992 1.0000
0.25 0.75 0.0027 0.0026 1.0485 1.0492
0.50 0.75 0.0027 0.0022 1.2305 1.2308
0.75 0.75 0.0027 0.0016 1.7303 1.7297

7. Application to Identifying SSN eQTL in GTEx

7.1. Brief Data Analysis Methods

Details of the GTEx analysis are presented in the supporting information. Briefly, gene expression in SSN was the target outcome. Three surrogate analyses were conduct in parallel, based respectively on whole blood, skeletal muscle, and cerebellum as the surrogate. We address the idea of using multiple surrogates simultaneously in the Discussion. For inclusion in the analysis, a transcript was required to be expressed in both the target and surrogate tissues. SNPs in cis to an expressed transcript were tested for association. Two associations methods were applied, a marginal analysis that regresses the target outcome only on genotype and covariates, and a joint analysis (Spray) that regresses the target and surrogate outcomes on genotype and covariates. Significance was declared at the Bonferroni threshold, adjusted for the number of SNP-transcript pairs tested for association.

7.2. Results

There were 80 genotyped subjects with expression in SSN. Supporting table S10 presents the sample sizes available in the 3 candidate surrogate tissues. The total sample size was largest for muscle n=507 and smallest for cerebellum n=168. However, as figure S10 demonstrates, the correlation between cerebellum and SSN was typically higher than that between muscle or blood and SSN. The root-mean-square correlation between the target and surrogate tissues was 0.18 for blood and muscle in comparison to 0.31 for cerebellum. Moreover, the number of transcripts expressed in both SSN and the surrogate tissue was greatest for cerebellum (table S11).

Table 4 compares the marginal and joint (Spray) eQTL analyses of SSN by surrogate tissue. In all cases, joint analysis identified more Bonferroni significant associations and did so more efficiently. All eQTL identified by the marginal analysis were also identified by the joint analysis, but not conversely. Most eQTL were detected when using cerebellum as the surrogate, although muscle in fact provided a more efficient surrogate, meaning the estimated standard errors were on average lower. More eQTL were identified with cerebellum because 19 transcripts containing 33 significant eQTL were expressed in cerebellum but not muscle.

Table 4: Comparison of marginal and Spray analyses by surrogate tissue.

Significant eQTL were identified at the Bonferroni threshold for each analysis. The mean χ2 statistic is calculate across those eQTL significant under either the marginal or joint (Spray) analyses. Relative efficiency is calculated as the mean of the ratio of the sampling variance of the marginal estimator to the Spray estimator.

Significant eQTL Mean χ2
Surrogate Marginal Spray Marginal Spray Relative Efficiency
Blood 24 111 40.8 48.0 1.18
Muscle 37 149 40.8 50.1 1.26
Cerebellum 42 176 40.3 49.1 1.25

Figure 2 A compares the estimated effect sizes of the marginal and joint analyses using cerebellum as the surrogate. Analogous figures for blood and muscle are presented in supporting figures S11 and S12. In all cases, the effect sizes were tightly correlated, verifying that Spray estimates the same effect as traditional, marginal analyses. However, figure 2B demonstrates that Spray provides greater power to detect eQTL. From table 4, this is because, at eQTL considered significant by either marginal or joint analysis, Spray provided standard errors that were up to 26% smaller, on average. Finally, figure 3 considers the concordance in effect sizes and p-values among SNP-transcript pairs that were tested for association in at least 2 of the surrogate analyses and were significant in at least 1. The tight correlation in effect sizes suggests that Spray is robust to the choice of surrogate outcome.

Figure 2: Comparison of the marginal and joint (Spray) eQTL analysies of substantia nigra, using cerebellum as the surrogate tissue.

Figure 2:

A. Estimated effect size from the joint analysis vs. the estimated effect size from the marginal analysis for eQTL significant in at least 1 of the analyses. B. P-value from the joint analysis vs. p-value from the marginal analysis for eQTL significant in at least 1 of the analyses. C. Mirrored Manhattan plots comparing the p-values of the joint and marginal analyses by genomic position. Dotted line is the Bonferroni significance threshold. Note that this figure appears in color in the electronic version of this article, and any mention of color refers to that version.

Figure 3: Effect of the surrogate outcome on the results of joint (Spray) analysis.

Figure 3:

A. Estimated effect size by surrogate outcome for eQTL evaluated in at least 2 of the surrogate analyses and significant in at least 1. B. Association p-value by surrogate outcome, again for QTL evaluated in at least 2 of the surrogate analyses and significant in at least 1. Note that this figure appears in color in the electronic version of this article, and any mention of color refers to that version.

8. Discussion

In this article, we have proposed leveraging a correlated surrogate outcome to improve inference on a partially missing target outcome, and derived a computationally efficient, ECME-type algorithm for fitting the association model. We demonstrated analytically and empirically, though extensive simulations and in real data, that the Spray test of association, which incorporates information from the target and surrogate outcomes, is more efficient than the marginal test of association, which incorporates information from the target outcome only. The efficiency of Spray increases with the target missingness, and with the square of the target-surrogate correlation. Moreover, we showed that modeling the surrogate as an outcome, rather than conditioning on it as a covariate, allows Spray to estimate the same effect size as traditional, marginal analysis. All estimation and inference procedures described in this article have been made available as an R package (McCaw, 2020).

We applied Spray to eQTL mapping in GTEx, using expression in SSN as the target outcome and expression in one of blood, muscle, or cerebellum as the surrogate outcome. Relative to marginal analysis, joint analysis using Spray consistently identified more Bonferroni significant associations. Although the joint and marginal effect size estimates were highly concordant R20.995, the Spray estimator was up to 26.0% more efficient, on average, at Bonferroni-significant eQTL. The choice of surrogate tissue highlighted a trade-off between the quality of the surrogate, as measured by its correlation with the target outcome, and the availability of the surrogate. Expression in muscle was available for 3 times as many subjects as expression in cerebellum, yet expression in cerebellum was better correlated with expression in SSN. Although the effect size estimated by Spray is unaffected by the choice of surrogate, the power is; sample sizes being equal, the better correlated surrogate is preferred. When the available sample sizes are not equal, equation (10) may be used to examine the trade-off.

Our work suggests several areas for further improvement. Although INT was applied to ensure marginal normality of the target and surrogate outcomes, joint bivariate normality is not guaranteed. While our results show that INT confers robustness to residual non-normality, a future direction is to develop association tests that allow for arbitrary patterns of outcome missingness but do not require specification of a joint distribution. Instead of maximum likelihood based estimation, this procedure could use a set of inverse probability weighted estimating equations (Robins et al., 1995).

Another avenue for future development is to incorporate multiple surrogate outcomes. One way to achieve this would be to extend the bivariate normal regression framework to a multivariate normal regression framework. However, there are drawbacks to directly modeling multiple surrogate outcomes: the number of nuisance covariance parameters increases quadratically with the number of surrogates, and the number of potential missingness patterns increases exponentially. Finally, although the current work was motivated by eQTL mapping, the idea of leveraging a surrogate outcome to improve inference on a partially missing target outcome is broadly applicable. For example, in large cohort studies such as the UK Biobank (Allen et al., 2014), the target outcome may be any incompletely ascertained phenotype, such as the concentration of a biomarker only measured for a subset of participants, while the surrogate outcome may be a readily ascertained phenotype, such as a risk score based on diagnostic codes from electronic health records.

DATA AVAILABILITY

The data that support the findings in this paper are available from the Genotype-Tissue Expression (GTEx) Project. Restrictions apply to the availability of these data, which were used under license in this paper. Data are available at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v7.p2 with the permission of GTEx.

Supplementary Material

Publications
supporting information

ACKNOWLEDGMENTS

This work was supported by the National Institutes of Health grants R35 CA197449 and F31 HL140822 (to Z.M.) and R35 CA197449, P01 CA134294, U01 HG009088, U01HG012064, 19 CA203654, and R01 HL113338 (to X.L.).

Footnotes

SUPPORTING INFORMATION

Web Appendices, Tables, and Figures referenced in Sections 4, 6, and 7 are available with this paper at the Biometrics website on Wiley Online Library. Code for replicating the simulations and data analyses in this paper are being made available online with the paper, and are available on GitHub at https://github.com/zrmacc/Surrogate-Replication-eQTL. Spray is available as an R package on CRAN at https://CRAN.R-project.org/package=SurrogateRegression.

REFERENCES

  1. Allen NE, Sudlow C, Peakman T, Collins R, et al. (2014). Uk biobank data: come and get it. [DOI] [PubMed] [Google Scholar]
  2. Consortium G (2013). The genotype-tissue expression (gtex) project. Nature 45, 580–585. [DOI] [PMC free article] [PubMed] [Google Scholar]
  3. Consortium G (2017). Genetic effects on gene expression across human tissues. Nature 550, 204. [DOI] [PMC free article] [PubMed] [Google Scholar]
  4. Flutre T, Wen X, Pritchard J, and Stephens M (2013). A statistical framework for joint eqtl analysis in multiple tissues. PLoS Genetics 9, e1003486. [DOI] [PMC free article] [PubMed] [Google Scholar]
  5. Gamazon ER, Wheeler HE, Shah KP, Mozaffari SV, Aquino-Michaels K, Carroll RJ, Eyler AE, Denny JC, Nicolae DL, Cox NJ, et al. (2015). A genebased association method for mapping traits using reference transcriptome data. Nature genetics 47, 1091. [DOI] [PMC free article] [PubMed] [Google Scholar]
  6. Gusev A, Ko A, Shi H, Bhatia G, Chung W, Penninx BW, Jansen R, De Geus EJ, Boomsma DI, Wright FA, et al. (2016). Integrative approaches for large-scale transcriptome-wide association studies. Nature genetics 48, 245. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Hormozdiari F, van de Bunt M, Segre AV, Li X, Joo JWJ, Bilow M, Sul JH, Sankararaman S, Pasaniuc B, and Eskin E (2016). Colocalization of gwas and eqtl signals detects target genes. The American Journal of Human Genetics 99, 1245–1260. [DOI] [PMC free article] [PubMed] [Google Scholar]
  8. Lee S, Sun W, Wright F, and Zou F (2017). An improved and explicit surrogate variable analysis procedure by coefficient adjustment. Biometrika 104, 303–316. [DOI] [PMC free article] [PubMed] [Google Scholar]
  9. Leek J and Storey J (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genetics 3, 1724–1735. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Li G, Shabolin A, Rusyn I, Wright F, and Nobel A (2018). An empirical bayes approach for multiple tissue eqtl analysis. Biostatistics 19, 391–406. [DOI] [PMC free article] [PubMed] [Google Scholar]
  11. Liu C and Rubin D (1994). The ecme algorithm: A simple extension of em and ecm with faster monotone convergence. Biometrika 81, 633–648. [Google Scholar]
  12. McCaw Z (2020). SurrogateRegression: Surrogate Outcome Regression Analysis. https://CRAN.R-project.org/package=SurrogateRegression. [Google Scholar]
  13. McCaw Z, Lane J, Saxena R, Redline S, and Lin X (2020). Operating characteristics of the rank-based inverse normal transformation for quantitative trait analysis in genomewide association studies. Biometrics 76, 1262–1272. [DOI] [PMC free article] [PubMed] [Google Scholar]
  14. McKenzie M, Henders A, Caracella A, Wray N, and Powell J (2014). Overlap of expression quantitative trait loci (eqtl) in human brain and blood. BMC Medical Genomics 7, 1–11. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Meng X-L and Rubin DB (1993). Maximum likelihood estimation via the ecm algorithm: A general framework. Biometrika 80, 267–278. [Google Scholar]
  16. NCI (2013). Gtex biobank donors. Accessed: 2021–03-15.
  17. Poewe W, Seppi K, Tanner C, Halliday G, Brundin P, et al. (2017). Parkinson disease. Nature Reviews Disease Primers 3, 1–21. [DOI] [PubMed] [Google Scholar]
  18. Robins J, Rotnitzky A, and Zhou L (1995). Analysis of semiparametric regression models for repeated outcomes in the presence of missing data. Journal of the American Statistical Association 90, 106–121. [Google Scholar]
  19. Sul J, Han B, Ye C, Choi T, and Eskin E (2013). Effectively identifying eqtls from multiple tissues by combining mixed model and meta-analytic approaches. PLoS Genetics 9, e1003491. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Visscher PM, Wray NR, Zhang Q, Sklar P, McCarthy MI, Brown MA, and Yang J (2017). 10 years of gwas discovery: biology, function, and translation. The American Journal of Human Genetics 101, 5–22. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Westra H, Peters M, Esko T, Yaghootkar H, Schurmann C, et al. (2013). Systematic identification of trans eqtls as putative drivers of known disease associations. Nature Genetics 45, 1238–1243. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Zhu Z, Zhang F, Hu H, Bakshi A, Robinson M, et al. (2016). Integration of summary data from gwas and eqtl studies predicts complex trait gene targets. Nature Genetics 48, 481–487. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Publications
supporting information

Data Availability Statement

The data that support the findings in this paper are available from the Genotype-Tissue Expression (GTEx) Project. Restrictions apply to the availability of these data, which were used under license in this paper. Data are available at https://www.ncbi.nlm.nih.gov/projects/gap/cgi-bin/study.cgi?study_id=phs000424.v7.p2 with the permission of GTEx.

RESOURCES