An improved and explicit surrogate variable analysis procedure by coefficient adjustment

Seunggeun Lee; Wei Sun; Fred A Wright; Fei Zou

doi:10.1093/biomet/asx018

. 2017 Apr 21;104(2):303–316. doi: 10.1093/biomet/asx018

An improved and explicit surrogate variable analysis procedure by coefficient adjustment

Seunggeun Lee ^1,^✉, Wei Sun ², Fred A Wright ³, Fei Zou ⁴

PMCID: PMC5627626 NIHMSID: NIHMS857072 PMID: 29430031

Summary

Unobserved environmental, demographic and technical factors canadversely affect the estimation and testing of the effects ofprimary variables. Surrogate variable analysis, proposed to tacklethis problem, has been widely used in genomic studies. To estimatehidden factors that are correlated with the primary variables,surrogate variable analysis performs principal component analysiseither on a subset of features or on all features, but weightingeach differently. However, existing approaches may fail to identifyhidden factors that are strongly correlated with the primaryvariables, and the extra step of feature selection and weightcalculation makes the theoretical investigation of surrogatevariable analysis challenging. In this paper, we propose an improvedsurrogate variable analysis, using all measured features, that has anatural connection with restricted least squares, which allows us tostudy its theoretical properties. Simulation studies and real-dataanalysis show that the method is competitive with state-of-the-artmethods.

Keywords: Batch effect, High-dimensional data, Principal component analysis, Surrogate variable analysis

1. Introduction

In regression analysis, the existence of unobserved factors can cause biases in estimating parameters. Suppose that the true relationship in the data is

y = X β + Z δ + ϵ,

where Inline graphic is a vector of outcome measurements, is a matrix of the observed covariates including the primary variables, and is a matrix of the unobserved factors. We are interested in estimating the regression parameter . Since is not observed, in practice we use the misspecified model

y = X β^{*} + ϵ^{*},

which can negatively impact inference on Inline graphic .

With the development of high-throughput technologies in biomedical sciences, high-dimensional data are routinely collected and analysed to find biologically meaningful features. Unobserved factors can cause adverse effects, including inflation of Type I error and/or power loss (Stegle et al., 2010). Although in practice great efforts are made to control confounders, such efforts may be insufficient to avoid all confounding issues (Leek et al., 2010).

Principal component analysis on the original or residualized features after removing the effects of observed dependent variables has often been used to identify hidden factors, and has been successful in identifying and controlling for population stratification in genome-wide association studies (Price et al., 2006). However, principal component analysis-based approaches are less effective for gene expression studies, where the hidden factors can affect a subset of features with relatively large effects (Leek & Storey, 2007). To overcome this limitation, surrogate variable analysis has been proposed (Leek & Storey, 2007, 2008; Teschendorff et al., 2011; Chakraborty et al., 2012) for microarray data. Leek & Storey (2007) initially developed a two-step approach which involves first identifying a subset of features that may be affected by hidden factors but not by primary variables, and then performing principal component analysis on the selected features. Later, they modified the approach to a weighted principal component analysis, where each feature is weighted according to its probability of being affected by the hidden factors only (Leek & Storey, 2008). Surrogate variable analysis has been extended to factor analysis (Friguet et al., 2009) and mixed-effect models (Listgarten et al., 2010). Recently, assuming that negative control genes are known, Gagnon-Bartsch & Speed (2012) proposed a surrogate variable method.

Surrogate variable analysis has been successfully applied to many genomic studies (Dumeaux et al., 2010; Teschendorff et al., 2010), but existing methods may fail to identify hidden factors. Strong correlation between hidden factors and primary variables can prevent the two-step and weighted principal component-based surrogate variable methods from identifying features that are affected by hidden factors only. If negative control genes are affected by primary variables or if the observed variation in negative control genes does not reflect unwanted variations in the entire genome, the methods for removing unwanted variation can also fail to identify true hidden factors.

In this paper, we propose a simple and straightforward method for identifying hidden factors and adjusting for their effects. Our approach, called direct surrogate variable analysis, is based on the observation that naïve estimators of the effects of the primary variables are biased when the effects of hidden factors are ignored in the analysis, but the bias can be estimated and removed using singular value decomposition on residuals. We derive the asymptotic properties of our estimators using techniques recently developed for the ultrahigh-dimensional regime (Lee et al., 2014) and the connection between our estimating procedure and the restricted least-squares method (Greene & Seaks, 1991). An R package (R Development Core Team, 2017) implementing the proposed approach, dSVA, can be downloaded from the comprehensive R archive network.

2. Methods

2.1. Direct surrogate variable analysis

Suppose that Inline graphic is an matrix of measured features, where is the number of features and is the number of samples. For gene expression data, represents RNA expression levels on genes. Further, suppose that is an matrix of observed covariates, including an intercept, and is an matrix of unobserved hidden factors. The following model represents the true relationship between Inline graphic and :

y_{i} = X β_{i} + Z δ_{i} + ϵ_{i},

(1)

where Inline graphic denotes the th column of , is a vector of regression coefficients associated with , is a vector of regression coefficients associated with , and is an random vector which follows . We further define and , which are and matrices of regression coefficients associated with and Inline graphic , respectively. In this model, and are assumed to be fixed. Later, to generate large numbers of and values for the simulation studies, we use a specified correlation between and . However, we emphasize that the proposed method is frequentist: and are considered fixed and unknown.

In practice, since Inline graphic is not observed, we effectively use the misspecified model

y_{i} = X β_{i}^{*} + ϵ_{i}^{*}

(2)

instead of (1). Under (2), the least-squares estimator of Inline graphic is

{\hat{β}}_{i}^{*} = (X^{T} X)^{- 1} X^{T} y_{i} = β_{i} + (X^{T} X)^{- 1} X^{T} Z δ_{i} + (X^{T} X)^{- 1} X^{T} ϵ_{i},

(3)

with residual vector

r_{i} = (I - M) y_{i} = (I - M) Z δ_{i} + (I - M) ϵ_{i},

(4)

where Inline graphic is the projection matrix onto the column space of . Equations (3) and (4) indicate that is a biased estimator of with bias . The conditional mean of the residual vector given is , which allows us to estimate via, for example, singular value decomposition.

Suppose that singular value decomposition is performed on the residual matrix Inline graphic , where , with being a diagonal matrix of ordered singular values, and and being matrices of left- and right-singular vectors. The first left-singular vectors can be viewed as estimators of linear combinations of the columns of , which we denote by where , with being a orthonormal matrix. Let Inline graphic . For any , the matrices and have the same column space, so is identical to . With an additional assumption that the row vectors of and the row vectors of are asymptotically orthogonal after mean centring, we can estimate and use it to remove the bias in . The proposed method is as follows.

Step 1.

Carry out singular value decomposition on the residual matrix . Let be the matrix comprising the first columns of that are equivalent to the left-singular vectors corresponding to the largest singular values.

Step 2.

Obtain and from the model . Since and are orthogonal to each other, from this model equals that from model (3).

Step 3.

Let and . We propose to estimate the surrogate variables as

$\hat{Γ} = U_{q} + X {\hat{B}}^{*} (I - M_{J}) {\hat{Ψ}}^{T} {\hat{Ψ} (I - M_{J}) {\hat{Ψ}}^{T}}^{- 1},$

where is a projection matrix with .

Step 4.

Estimate and test from the model

$y_{i} = X β_{i} + \hat{Γ} ψ_{i} + ϵ_{i} .$ (5)

This method requires estimation of Inline graphic , the number of surrogate variables, which can be obtained by permutation (Buja & Eyuboglu, 1992) or by analytical-asymptotic approaches (Johnstone, 2001; Leek, 2011). In this paper, we use the method of Buja & Eyuboglu (1992) for all numerical work. Since and can be always rescaled, they are not identifiable, so we set Inline graphic , where is the th element of , and adjust to satisfy this restriction. In the Supplementary Material we show that from (5) in Step 4 is the same as

\hat{B} = {\hat{B}}^{*} - {\hat{B}}^{*} (I - M_{J}) {\hat{Ψ}}^{T} {\hat{Ψ} (I - M_{J}) {\hat{Ψ}}^{T}}^{- 1} \hat{Ψ},

(6)

in which Inline graphic is an estimate of the bias of the naïve estimator . In § 2.3, we show that (6) is related to the restricted least-squares method.

2.2. Consistency of the proposed estimators

Important questions are under what conditions does the proposed Inline graphic span the same column space as , and whether is a consistent estimator of . For high-dimensional data, the number of features, , can be substantially larger than the number of samples, , and thus asymptotic results derived from the traditional low-dimensional setting where is fixed are inappropriate (Johnstone & Lu, 2009; Jung & Marron, 2009; Lee et al., 2010). Lee et al. (2014) considered a regime in which both Inline graphic and increase to infinity with . This regime is well-suited to high-throughput biomedical data, where the number of genes is in the tens of thousands and the number of samples is in the range of several dozens to hundreds. We work in this regime and investigate the asymptotic properties of the proposed method under the spiked-eigenvalue model of Johnstone (2001).

Before presenting our main results, let us define some additional notation. Suppose that Inline graphic and are two sequences. We write if and , and write if . We also define to be the function that returns the th largest singular value of an input matrix. Without loss of generality we assume that and , where and are the th columns of and , respectively, and is the vector norm. We introduce the following conditions.

Condition 1.

Both and increase to with .

Condition 2.

Let , where . Then , , and for .

Condition 3.

Let , where is the standard deviation of and . Then either of the following is satisfied: (i) ; or (ii) , and .

Condition 4.

Let be the matrix with columns formed by concatenating and . Then is nonsingular with and .

Condition 5.

Suppose that and are the th elements of and , respectively, and that and . For all ,

$\frac{1}{m} \sum_{i = 1}^{m} (β_{k i} - {\bar{β}}_{k}) (δ_{l i} - {\bar{δ}}_{l}) = o_{p} {\frac{1}{m} \sum_{i = 1}^{m} (δ_{l i} - {\bar{δ}}_{l})^{2}} .$

Condition 2 assumes the spiked-eigenvalue model (Johnstone, 2001), which ensures that the effects of hidden factors are large enough to be identified by singular value decomposition. Condition 3 comprises the sphericity conditions on nonspiked singular values (Lee et al., 2014). The relative growth rates of Inline graphic and play a key role in this condition. For example, when is greater than zero, must grow at a faster rate than to satisfy the first condition in Condition 3. The second part of Condition 3 relaxes the assumption on but adds an assumption on . Condition 5 requires that the row vectors of Inline graphic and the row vectors of be asymptotically orthogonal after mean centring.

Theorem 1.

Suppose that Conditions 1–5 are satisfied. Then the columns of span the same column space as the columns of with probability , and for .

Theorem 1 shows that the proposed method produces consistent estimates of Inline graphic and the hidden factors. The proof can be found in the Supplementary Material.

2.3. Relationship to the restricted least-squares method

We now show the connection between the proposed method and the restricted least-squares procedure of Greene & Seaks (1991). Suppose that Inline graphic is a linear restriction on . The restricted least-squares estimator of is the solution of

minimize (y - X β)^{T} (y - X β) subject to C β = c .

It can be shown that Inline graphic where is the ordinary least-squares estimator and .

When estimating Inline graphic , we impose a restriction on and , and hence on and , such that they are asymptotically orthogonal after mean centring. While this is not a linear restriction as in the restricted least-squares procedure, the similarity of the two approaches can be illustrated as follows. Let be a function on a matrix which stacks the columns of the matrix into one long vector. Then model (2) for all Inline graphic features can be re-expressed as

vec (Y) = (I_{m} \otimes X) vec (B^{*}) + vec (E^{*}),

where Inline graphic , is the Kronecker product, and is the identity matrix. We further define and . The solution of

\begin{matrix} minimize {vec (Y) - (I_{m} \otimes X) vec (B^{*})}^{T} {vec (Y) - (I_{m} \otimes X) vec (B^{*})} \\ subject to \hat{C} vec (B^{*}) = 0 \end{matrix}

is Inline graphic where

\begin{matrix} H \hat{C} & = {(I_{m} \otimes X)^{T} (I_{m} \otimes X)}^{- 1} {\hat{C}}^{T} [\hat{C} {(I_{m} \otimes X)^{T} (I_{m} \otimes X)}^{- 1} {\hat{C}}^{T}]^{- 1} \hat{C} \\ = (I - M_{J}) {\hat{Ψ}}^{T} {\hat{Ψ} (I - M_{J}) {\hat{Ψ}}^{T}}^{- 1} \hat{Ψ} (I - M_{J}) \otimes I_{p} . \end{matrix}

Now Inline graphic can be written as

vec ({\hat{B}}_{R L S}) = vec ({\hat{B}}^{*}) - [(I - M_{J}) {\hat{Ψ}}^{T} {\hat{Ψ} (I - M_{J}) {\hat{Ψ}}^{T}}^{- 1} \hat{Ψ} (I - M_{J}) \otimes I_{p}] vec ({\hat{B}}^{*}),

which leads to

{\hat{B}}_{R L S} = {\hat{B}}^{*} - {\hat{B}}^{*} (I - M_{J}) {\hat{Ψ}}^{T} {\hat{Ψ} (I - M_{J}) {\hat{Ψ}}^{T}}^{- 1} \hat{Ψ} (I - M_{J}) .

(7)

Clearly, Inline graphic in (6) and in (7) are identical if . Hence the proposed method and the restricted least-squares method are identical if the row means of are zero. Gene expression data are commonly normalized so that the row means of are equal (Bolstad et al., 2003); this makes the row means of the residual matrix Inline graphic , and consequently the row means of , all zero. Thus, this zero-mean condition is easily satisfied by gene expression data. If , then and will be different.

Since Inline graphic is a random matrix, our procedure is not the same as the restricted least-squares procedure, which assumes that the restriction matrix is fixed. However, the discussion above highlights the similarity between the two approaches. The restricted least-squares estimator can have smaller mean squared error than the ordinary least-squares estimator if the restriction is satisfied (Greene & Seaks, 1991). From our simulations, we observe that our estimators tend to have smaller mean squared errors than the estimators from the true regression model, where the restriction is not utilized.

3. Numerical studies

3.1. Simulation studies

We performed simulations to compare the proposed approach with existing methods in a wide range of scenarios. For each simulated dataset, 5000 features and 100 samples were generated from the regression model

y_{j i} = β_{i} x_{j} + z_{j}^{T} δ_{i} + ϵ_{j i} (j = 1, \dots, 100; i = 1, \dots, 5000),

where Inline graphic was generated from with following an ig distribution, which yields and .

A total of 864 simulation settings are summarized in Table 1. The binary and continuous Inline graphic were simulated respectively from

Table 1.

Simulation parameters; the total number of simulation settings is

Parameter	Values
	0, , 1
Percentage of nonzero	, , ,
Percentage of nonzero	, ,
Overlap among nonzero	total overlap, independent
Number of hidden factors	2, 4
Type of	binary, continuous
Correlation between nonzero and nonzero	0, ,

Open in a new tab

x_{j} = {\begin{matrix} 0, & j \leq 50, \\ 1, & j > 50, \end{matrix} x_{j} = {\begin{matrix} N (- 1, 0 \cdot 5), & j \leq 50, \\ N (1, 0 \cdot 5), & j > 50, \end{matrix}

and the first two hidden factors were simulated from

\begin{matrix} z_{j 1} \sim {\begin{matrix} N (μ_{z}, 1), & j \leq 50, \\ N (- μ_{z}, 1), & j > 50, \end{matrix} z_{j 2} \sim B e r (0 \cdot 5) . \end{matrix}

When the number of hidden factors is four, i.e., Inline graphic , were independently generated from . The parameter determines the correlation between the primary variable, , and the first hidden factor, . Three different values of were considered: 0, and 1. The regression coefficients and were generated from the distribution

(\begin{matrix} β_{i} \\ δ_{i 1} \\ δ_{i 2} \\ ⋮ \\ δ_{i q} \end{matrix}) = (\begin{matrix} a_{i 0} ζ_{i 0} \\ a_{i 1} ζ_{i 1} \\ a_{i 2} ζ_{i 2} \\ ⋮ \\ a_{i q} ζ_{i q} \end{matrix}), (\begin{matrix} a_{i 0} \\ a_{i 1} \\ a_{i 2} \\ ⋮ \\ a_{i q} \end{matrix}) \sim N {0, (\begin{matrix} 1 & ρ & 0 & \dots & 0 \\ ρ & 1 & 0 & \dots & 0 \\ 0 & 0 & 1 & \dots & 0 \\ ⋮ & ⋮ & ⋮ & ⋱ & ⋮ \\ 0 & 0 & 0 & \dots & 1 \end{matrix})}

where Inline graphic are indicator variables, or , that determine which of the and have nonzero values. To mimic real biological data where the primary variables and hidden factors are not associated with all features, we assumed that 5%, 10% or 20% of the were nonzero and that 10%, 20%, 40% or 60% of the Inline graphic were nonzero. The value of was independently assigned. For , we considered situations in which nonzero values were totally overlapping, , or independently selected. In the first situation, each feature either had no associated hidden factors or was associated with all hidden factors. The correlation between the nonzero Inline graphic and the nonzero , namely , was set to , and , representing scenarios in which Condition 5 ranged from being satisfied to severely violated.

Nine different methods were compared: direct surrogate variable analysis; regression model (1), where the hidden factors are assumed to be known and included in the analysis; a no-adjustment model, i.e., regression model (2), where the hidden factors are ignored in the analysis; the iteratively reweighted surrogate variable analysis of Leek & Storey (2008); the two-step surrogate variable analysis of Leek & Storey (2007); principal component analysis on the residuals; principal component analysis on the original measurements of the features; latent effect adjustment after primary projection (Sun et al., 2012); and four-step remove unwanted variation (Gagnon-Bartsch et al., 2017). Latent effect adjustment after primary projection uses an outlier detection approach after initial data projection to adjust for hidden factors. We treated the second method as a gold standard. In both principal component analyses, top principal components were selected and treated as surrogate variables.

For the four-step remove unwanted variation method, we assumed that 6 Inline graphic of features were negative control genes, close to the proportion of housekeeping genes in the genome (Gagnon-Bartsch & Speed, 2012). We considered situations in which negative control genes were selected only among features with , i.e., high-quality control genes, or were randomly selected among all features, i.e., poor-quality control genes. In the second case, the assumption of negative control genes was violated. A method to estimate the number of surrogate variables for the four-step remove unwanted variation method has been developed (Gagnon-Bartsch et al., 2017). We used this method in conjunction with the method of Buja & Eyuboglu (1992) to estimate Inline graphic for the four-step remove unwanted variation method; for all other methods, the approach of Buja & Eyuboglu (1992) was used to estimate .

For each simulation set-up, 200 datasets were generated and the performance of each method was evaluated based on (i) empirical false discovery rates, where the significant findings were determined by the Benjamini & Hochberg (1995) procedure for a targeted false discovery rate of Inline graphic ; (ii) the mean squared errors of the ; and (iii) the area under the receiver operating characteristic curve. For calculation of the false discovery rate and the area under the receiver operating characteristic curve, we define true and false positives as follows. If and a statistical test for Inline graphic is significant after applying the Benjamini–Hochberg procedure, it is a true positive. If the test is significant when , it is a false positive. In addition to the mean false discovery rates, we calculated the proportion of datasets with an empirical false discovery rate greater than Inline graphic .

Figure 1 shows simulation results from a scenario where Condition 5 was satisfied, i.e., Inline graphic . Direct surrogate variable analysis performed well, as the observed area under the receiver operating characteristic curve and the mean squared errors are similar to those obtained from the approach assuming that the hidden factors are known, and the observed false discovery rates are only slightly inflated in a few simulation settings. Among 288 simulation settings, only four had mean false discovery rates higher than Inline graphic . As expected, the no-adjustment and principal component analysis-based approaches performed very poorly. When the negative control gene assumption was satisfied, the remove unwanted variation method performed only slightly worse than direct surrogate variable analysis: in ten of the simulation settings, the mean empirical false discovery rates were larger than Inline graphic ; however, when this assumption was violated, the remove unwanted variation method had substantially inflated false discovery rates. The latent effect adjustment after primary projection method had mean empirical false discovery rates above in some simulation settings. Since this method was not developed to estimate Inline graphic , we did not obtain the mean squared errors. When the method developed for four-step remove unwanted variation was used to estimate the number of surrogate variables, the overall performance of the remove unwanted variation method declined substantially, indicating that the method of Buja & Eyuboglu (1992) performs better in estimating the number of surrogate variables. We compared different approaches to estimating Inline graphic , and the method of Buja & Eyuboglu (1992) outperformed the others; see the Supplementary Material. In Fig. 2, we directly compare the two top-performing approaches: direct surrogate variable analysis and the four-step remove unwanted variation method with high-quality control genes. Direct surrogate variable analysis clearly does better in controlling the false discovery rates.

Fig. 1. — Comparisons of the proposed and competing methods when . Each bar summarizes results from 288 different simulation settings, and in each setting 200 datasets were generated to calculate: (a) mean empirical false discovery rates, FDR; (b) the proportion of datasets with empirical FDR higher than ; (c) the mean area under the receiver operating characteristic curve, AUC; and (d) the mean squared errors, MSE. The methods compared are: dSVA, direct surrogate variable analysis; KnownZ, hidden factors known and included in the model; NoAdj, no adjustment for hidden factors; IRW, iteratively reweighted surrogate variable analysis; 2-SVA, two-step surrogate variable analysis; rPCA, principal component analysis on the residuals; PCA, principal component analysis on the original measured features; LEAPP, latent effect adjustment after primary projection; RUV4-High, four-step remove unwanted variation method with high-quality control genes; RUV4-Poor, four-step remove unwanted variation method with poor-quality control genes; RUV4-High2, RUV4-High with from Gagnon-Bartsch et al. (2017); RUV4-Poor2, RUV4-Poor with from Gagnon-Bartsch et al. (2017).

Fig. 2. — Comparison of direct surrogate variable analysis, dSVA, and the four-step remove unwanted variation method with high-quality control genes, RUV4-High, when : (a) mean empirical false discovery rate; (b) mean area under the receiver operating characteristic curve; (c) mean squared errors.

To investigate the effect of each simulation parameter on the performance of the methods when Inline graphic , we created plots for each parameter value. Since the no-adjustment and principal component analysis-based methods performed substantially worse than the other methods, we did not include them in these plots. Among the parameters, and the percentage of nonzero had large effects on the performance of some methods. Figure 3(a) shows boxplots of the false discovery rates with different Inline graphic values. The iteratively reweighted and two-step surrogate variable analysis approaches had well-controlled false discovery rates when and , but had inflated rates when . Therefore, these two methods cannot efficiently estimate the hidden factors in the presence of a strong correlation between Inline graphic and . Figure 3(b) shows that when the percentage of nonzero was , direct surrogate variable analysis had slightly inflated false discovery rates, perhaps because direct surrogate variable analysis uses all features, instead of selecting features with nonzero . Since the four-step remove unwanted variation approach uses a small fraction of features to estimate the hidden factors, it had more inflated false discovery rates when the percentage of nonzero Inline graphic was small. The performances of the different methods in terms of areas under the receiver operating characteristic curves and mean squared errors were largely similar.

Fig. 3. — Comparison of mean empirical false discovery rates when for: (a) or ; and (b) different proportions of nonzero ; 10%, 20%, 40% or 60%. In each simulation setting, 200 datasets were generated to obtain the mean empirical false discovery rates. The methods compared are: dSVA, direct surrogate variable analysis; IRW, iteratively reweighted surrogate variable analysis; 2-SVA, two-step surrogate variable analysis; LEAPP, latent effect adjustment after primary projection; RUV4-High, four-step remove unwanted variation method with high-quality control genes; RUV4-Poor, four-step remove unwanted variation method with poor-quality control genes.

Additional simulation results are presented in the Supplementary Material. Our proposed approach was observed to perform well even when Inline graphic was overestimated and Condition 5 was moderately violated. Overall, our simulation study shows that the proposed method can outperform existing methods in diverse scenarios.

3.2. Application to real data

We downloaded the Hapmap dataset GSE5859 from the National Center for Biotechnology Information gene expression omnibus website to investigate differentially expressed genes between European and Asian populations (Spielman et al., 2007). This dataset contains 8793 genes, or features, and 208 samples from three continental populations: 102 European, 65 Chinese, and 41 Japanese. The affy package (Gautier et al., 2004) was used for background correction and quantile normalization (Bolstad et al., 2003). In the Supplementary Material we perform an analysis without quantile normalization as a sensitivity analysis. Similar to the original study, we restricted the analysis to 4044 reliably expressed genes in at least 80% of the samples in one population (Spielman et al., 2007).

The original study showed that nearly 70% of genes were differentially expressed across the European and Asian samples (Spielman et al., 2007), but it was subsequently discovered that the calendar year in which each sample was processed was a strong confounding factor (Akey et al., 2007; Leek et al., 2010), and many of the positive findings could potentially be false. In this analysis, we considered a scenario where the researchers did not record the calendar year of sample collection and investigated whether the proposed surrogate variable analysis could capture the year effect. We treated year as a categorical response variable and estimated the proportion of variability that can be explained by the surrogate variables.

Table 2 shows the proportion of the variability explained by surrogate variables estimated by four different methods. Since the estimated variability would increase with the number of surrogate variables, for a fair comparison we used Inline graphic for all the methods, which was estimated by the method of Buja & Eyuboglu (1992). Both direct surrogate variable analysis and the remove unwanted variation method performed well, as 70% and 73% of the variability was explained by the surrogate variables estimated from these respective methods. In contrast, the surrogate variables from the iteratively reweighted and two-step surrogate variable analysis approaches explained only 41% and 64% of the variability in year, respectively. We also considered different combinations of the populations. Direct surrogate variable analysis and the four-step remove unwanted variation method again consistently outperformed the other methods.

Table 2.

Proportion of variability in year explained by the estimated surrogate variables; for a fair comparison, the same number of surrogate variables was used in all the methods

Type	Number of surrogate variables	dSVA	IRW	2-SVA	RUV4
EUR vs (JPT + CHI)	25
JPT vs (EUR + CHI)	25
CHI vs (EUR + JPT)	25
JPT vs CHI	16

Open in a new tab

EUR, JPT and CHI, individuals of European, Japanese and Chinese ancestry, respectively; dSVA, direct surrogate variable analysis; IRW, iteratively reweighted surrogate variable analysis; 2-SVA, two-step surrogate variable analysis; RUV4, four-step remove unwanted variation method.

Without any hidden variable adjustment, 73% and 65% of genes were found to be differentially expressed between the European and Asian populations at false discovery rates of Inline graphic and , respectively. As pointed out elsewhere, it seems implausible that so many genes would be differentially expressed between the two populations (Akey et al., 2007). When direct surrogate variable analysis was applied, only 29% and 18% of genes were found to be significant at false discovery rates of Inline graphic and , respectively. Li et al. (2010) have reported that approximately 20% of genes in lymphoblastoid cell lines are differentially expressed between Hapmap2 European and African samples at a false discovery rate of . Given that the genetic difference between the European and African populations is greater than that between the European and Asian populations, 18% of genes differentially expressed between the European and Asian populations seems a reasonable estimate.

When we applied two-step surrogate variable analysis and the four-step remove unwanted variation method to the Hapmap data, 15% and 18% of genes, respectively, were declared to be differentially expressed between the European and Asian populations at false discovery rate Inline graphic . In contrast, 65% of genes were found to be significant by iteratively reweighted surrogate variable analysis at the same false discovery rate, indicating that this method fails to identify the effects of the hidden factors. When we included year as a covariate in the regression analysis, only 28 genes, i.e., Inline graphic of the tested genes, were significant at false discovery rate , because year was nearly nested within each population. All Asian samples were processed in 2005 and 2006, but only three European samples were processed in those two years. Among these 28 genes, 15 were significant according to direct surrogate variable analysis. On the other hand, 12 and 14 genes, respectively, were significant by two-step surrogate variable analysis and the four-step remove unwanted variation method.

We carried out an additional analysis using the same dataset to identify genes differentially expressed by gender. Since genes in sex chromosomes can be used as positive control genes, this additional analysis can be used to directly evaluate the performance of each method. The results show that our method had comparable or slightly better performance than the competing methods; see the Supplementary Material for details.

4. Discussion and conclusion

Surrogate variable analysis was originally proposed for gene expression data, but it has since been applied to epigenetic data as well (Teschendorff et al., 2011; Maksimovic et al., 2015). Recently, surrogate variable analysis has been extended to {prediction} and clustering problems. For example, Parker et al. (2014) developed frozen surrogate variable analysis to remove batch effects for prediction problems, and Jacob et al. (2016) extended the remove unwanted variation method to unsupervised learning. Direct surrogate variable analysis was mainly developed for differential expression analysis, but it can be extended to other types of -omics data, as well as to prediction problems, by adopting the approaches used in frozen surrogate variable analysis. We leave such extensions for future research. One key assumption of the proposed method is Condition 5, which requires that the vector of Inline graphic values across genes and the vector of values across genes be asymptotically orthogonal after mean centring. We think that this is reasonable for many biomedical datasets. In our real-data analysis, for example, batch effects are purely technical issues and their effect sizes would not be correlated with those of population differences. Moreover, our method is robust with respect to moderate violations of this condition. In simulation studies, for instance, our method shows better false discovery rate control than the competing methods when Inline graphic . A similar assumption is implicitly used in existing methods. For example, in their simulation studies, Sun et al. (2012) generated and independently. They also suggested that when and are correlated, it will be difficult to identify .

Principal component analysis was used to correct for batch effects and the effects of hidden confounders prior to the introduction of surrogate variable analysis. This approach has proven very successful for genome-wide association studies (Price et al., 2006). However, our simulation results show that naïve use of principal components for hidden factor adjustment can result in severe power loss, because the top principal components identified can be highly correlated with the primary variables when the effects of the primary variables are not too weak. When this is the case, principal component analysis should be avoided.

Supplementary Material

Supplementary Data

Click here for additional data file.^{(317.2KB, pdf)}

Acknowledgement

We thank the editor, associate editor and referees for their valuable comments and suggestions, which have greatly helped to improve the quality of the paper. This research was supported by the U.S. National Institutes of Health.

Supplementary material

Supplementary material available at Biometrika online includes proofs of the theoretical results as well as additional simulation and real-data analysis results.

References

Akey J. M., Biswas S., Leek J. T. & Storey J. D. (2007). On the design and analysis of gene expression studies in human populations. Nature Genet. 39, 807–8. [DOI] [PubMed] [Google Scholar]
Benjamini Y. & Hochberg Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Statist. Soc. B 57, 289–300. [Google Scholar]
Bolstad B. M., Irizarry R. A., Åstrand M. & Speed T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–93. [DOI] [PubMed] [Google Scholar]
Buja A. & Eyuboglu N. (1992). Remarks on parallel analysis. Mult. Behav. Res. 27, 509–40. [DOI] [PubMed] [Google Scholar]
Chakraborty S., Datta S. & Datta S. (2012). Surrogate variable analysis using partial least squares (SVA-PLS) in gene expression studies. Bioinformatics 28, 799–806. [DOI] [PubMed] [Google Scholar]
Dumeaux V., Olsen K. S., Nuel G., Paulssen R. H., Børresen-Dale A. L. & Lund E. (2010). Deciphering normal blood gene expression variation—The NOWAC postgenome study. PLoS Genet. 6, e1000873. [DOI] [PMC free article] [PubMed] [Google Scholar]
Friguet C., Kloareg M. & Causeur D. (2009). A factor model approach to multiple testing under dependence. J. Am. Statist. Assoc. 104, 1406–15. [Google Scholar]
Gagnon-Bartsch J. A., Jacob L. & Speed T. P. (2017). Removing Unwanted Variation: Exploiting Negative Controls for High Dimensional Data Analysis. IMS Monographs Cambridge: Cambridge University Press, in press. [Google Scholar]
Gagnon-Bartsch J. A. & Speed T. P. (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
Gautier L., Cope L., Bolstad B. M. & Irizarry R. A. (2004). Affy-analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 307–15. [DOI] [PubMed] [Google Scholar]
Greene W. H. & Seaks T. G. (1991). The restricted least squares estimator: A pedagogical note. Rev. Econ. Statist. 73, 563–7. [Google Scholar]
Jacob L., Gagnon-Bartsch J. A. & Speed T. P. (2016). Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Biostatistics 17, 16–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
Johnstone I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29, 295–327. [Google Scholar]
Johnstone I. M. & Lu A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. J. Am. Statist. Assoc. 104, 682–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
Jung S. & Marron J. S. (2009). PCA consistency in high dimension, low sample size context. Ann. Statist. 37, 4104–30. [Google Scholar]
Lee S., Zou F. & Wright F. A. (2010). Convergence and prediction of principal component scores in high-dimensional settings. Ann. Statist. 38, 3605–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
Lee S., Zou F. & Wright F. A. (2014). Convergence of sample eigenvalues, eigenvectors, and principal component scores for ultra-high dimensional data. Biometrika 101, 484–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leek J. T. (2011). Asymptotic conditional singular value decomposition for high-dimensional genomic data. Biometrics 67, 344–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leek J. T., Scharpf R. B., Bravo H. C., Simcha D., Langmead B., Johnson W. E., Geman D., Baggerly K. & Irizarry R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Rev. Genet. 11, 733–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leek J. T. & Storey J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161. [DOI] [PMC free article] [PubMed] [Google Scholar]
Leek J. T. & Storey J. D. (2008). A general framework for multiple testing dependence. Proc. Nat. Acad. Sci. 105, 18718–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
Li J., Liu Y., Kim T., Min R. & Zhang Z. (2010). Gene expression variability within and between human populations and implications toward disease susceptibility. PLoS Comp. Biol. 6, e1000910. [DOI] [PMC free article] [PubMed] [Google Scholar]
Listgarten J., Kadie C., Schadt E. E. & Heckerman D. (2010). Correction for hidden confounders in the genetic analysis of genec expression. Proc. Nat. Acad. Sci. 107, 16465–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
Maksimovic J., Gagnon-Bartsch J. A., Speed T. P. & Oshlack A. (2015). Removing unwanted variation in a differential methylation analysis of Illumina HumanMethylation450 array data. Nucleic Acids Res. 43, e106. [DOI] [PMC free article] [PubMed] [Google Scholar]
Parker H. S., Bravo H. C. & Leek J. T. (2014). Removing batch effects for prediction problems with frozen surrogate variable analysis. PeerJ 2, e561. [DOI] [PMC free article] [PubMed] [Google Scholar]
Price A. L., Patterson N. J., Plenge R. M., Weinblatt M. E., Shadick N. A. & Reich D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–9. [DOI] [PubMed] [Google Scholar]
R Development Core Team (2017). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing: ISBN 3-900051-07-0, http://www.R-project.org. [Google Scholar]
Spielman R. S., Bastone L. A., Burdick J. T., Morley M., Ewens W. J. & Cheung V. G. (2007). Common genetic variants account for differences in gene expression among ethnic groups. Nature Genet. 39, 226–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
Stegle O., Parts L., Durbin R. & Winn J. (2010). A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comp. Biol. 6, e1000770. [DOI] [PMC free article] [PubMed] [Google Scholar]
Sun Y., Zhang N. R. & Owen A. B. (2012). Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. Ann. Appl. Statist. 6, 1664–88. [Google Scholar]
Teschendorff A. E., Menon U., Gentry-Maharaj A., Ramus S. J., Weisenberger D. J., Shen H., Campan M., Noushmehr H., Bell C. G., Maxwell A. P.. et al. (2010). Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res. 20, 440–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
Teschendorff A. E., Zhuang J. & Widschwendter M. (2011). Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies. Bioinformatics 27, 1496–505. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Click here for additional data file.^{(317.2KB, pdf)}

[B1] Akey J. M., Biswas S., Leek J. T. & Storey J. D. (2007). On the design and analysis of gene expression studies in human populations. Nature Genet. 39, 807–8. [DOI] [PubMed] [Google Scholar]

[B2] Benjamini Y. & Hochberg Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Statist. Soc. B 57, 289–300. [Google Scholar]

[B3] Bolstad B. M., Irizarry R. A., Åstrand M. & Speed T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–93. [DOI] [PubMed] [Google Scholar]

[B4] Buja A. & Eyuboglu N. (1992). Remarks on parallel analysis. Mult. Behav. Res. 27, 509–40. [DOI] [PubMed] [Google Scholar]

[B5] Chakraborty S., Datta S. & Datta S. (2012). Surrogate variable analysis using partial least squares (SVA-PLS) in gene expression studies. Bioinformatics 28, 799–806. [DOI] [PubMed] [Google Scholar]

[B6] Dumeaux V., Olsen K. S., Nuel G., Paulssen R. H., Børresen-Dale A. L. & Lund E. (2010). Deciphering normal blood gene expression variation—The NOWAC postgenome study. PLoS Genet. 6, e1000873. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B7] Friguet C., Kloareg M. & Causeur D. (2009). A factor model approach to multiple testing under dependence. J. Am. Statist. Assoc. 104, 1406–15. [Google Scholar]

[B8] Gagnon-Bartsch J. A., Jacob L. & Speed T. P. (2017). Removing Unwanted Variation: Exploiting Negative Controls for High Dimensional Data Analysis. IMS Monographs Cambridge: Cambridge University Press, in press. [Google Scholar]

[B9] Gagnon-Bartsch J. A. & Speed T. P. (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–52. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B10] Gautier L., Cope L., Bolstad B. M. & Irizarry R. A. (2004). Affy-analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 307–15. [DOI] [PubMed] [Google Scholar]

[B11] Greene W. H. & Seaks T. G. (1991). The restricted least squares estimator: A pedagogical note. Rev. Econ. Statist. 73, 563–7. [Google Scholar]

[B12] Jacob L., Gagnon-Bartsch J. A. & Speed T. P. (2016). Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Biostatistics 17, 16–28. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B13] Johnstone I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29, 295–327. [Google Scholar]

[B14] Johnstone I. M. & Lu A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. J. Am. Statist. Assoc. 104, 682–93. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B15] Jung S. & Marron J. S. (2009). PCA consistency in high dimension, low sample size context. Ann. Statist. 37, 4104–30. [Google Scholar]

[B16] Lee S., Zou F. & Wright F. A. (2010). Convergence and prediction of principal component scores in high-dimensional settings. Ann. Statist. 38, 3605–29. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B17] Lee S., Zou F. & Wright F. A. (2014). Convergence of sample eigenvalues, eigenvectors, and principal component scores for ultra-high dimensional data. Biometrika 101, 484–90. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B18] Leek J. T. (2011). Asymptotic conditional singular value decomposition for high-dimensional genomic data. Biometrics 67, 344–52. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B19] Leek J. T., Scharpf R. B., Bravo H. C., Simcha D., Langmead B., Johnson W. E., Geman D., Baggerly K. & Irizarry R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Rev. Genet. 11, 733–9. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B20] Leek J. T. & Storey J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B21] Leek J. T. & Storey J. D. (2008). A general framework for multiple testing dependence. Proc. Nat. Acad. Sci. 105, 18718–23. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B22] Li J., Liu Y., Kim T., Min R. & Zhang Z. (2010). Gene expression variability within and between human populations and implications toward disease susceptibility. PLoS Comp. Biol. 6, e1000910. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B23] Listgarten J., Kadie C., Schadt E. E. & Heckerman D. (2010). Correction for hidden confounders in the genetic analysis of genec expression. Proc. Nat. Acad. Sci. 107, 16465–70. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B24] Maksimovic J., Gagnon-Bartsch J. A., Speed T. P. & Oshlack A. (2015). Removing unwanted variation in a differential methylation analysis of Illumina HumanMethylation450 array data. Nucleic Acids Res. 43, e106. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B25] Parker H. S., Bravo H. C. & Leek J. T. (2014). Removing batch effects for prediction problems with frozen surrogate variable analysis. PeerJ 2, e561. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B26] Price A. L., Patterson N. J., Plenge R. M., Weinblatt M. E., Shadick N. A. & Reich D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–9. [DOI] [PubMed] [Google Scholar]

[B27] R Development Core Team (2017). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing: ISBN 3-900051-07-0, http://www.R-project.org. [Google Scholar]

[B28] Spielman R. S., Bastone L. A., Burdick J. T., Morley M., Ewens W. J. & Cheung V. G. (2007). Common genetic variants account for differences in gene expression among ethnic groups. Nature Genet. 39, 226–31. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B29] Stegle O., Parts L., Durbin R. & Winn J. (2010). A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comp. Biol. 6, e1000770. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B30] Sun Y., Zhang N. R. & Owen A. B. (2012). Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. Ann. Appl. Statist. 6, 1664–88. [Google Scholar]

[B31] Teschendorff A. E., Menon U., Gentry-Maharaj A., Ramus S. J., Weisenberger D. J., Shen H., Campan M., Noushmehr H., Bell C. G., Maxwell A. P.. et al. (2010). Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res. 20, 440–6. [DOI] [PMC free article] [PubMed] [Google Scholar]

[B32] Teschendorff A. E., Zhuang J. & Widschwendter M. (2011). Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies. Bioinformatics 27, 1496–505. [DOI] [PubMed] [Google Scholar]

PERMALINK

An improved and explicit surrogate variable analysis procedure by coefficient adjustment

Seunggeun Lee

Wei Sun

Fred A Wright

Fei Zou

Summary

1. Introduction

2. Methods

2.1. Direct surrogate variable analysis

Step 1.

Step 2.

Step 3.

Step 4.

2.2. Consistency of the proposed estimators

Condition 1.

Condition 2.

Condition 3.

Condition 4.

Condition 5.

Theorem 1.

2.3. Relationship to the restricted least-squares method

3. Numerical studies

3.1. Simulation studies

Table 1.

Fig. 1.

Fig. 2.

Fig. 3.

3.2. Application to real data

Table 2.

4. Discussion and conclusion

Supplementary Material

Acknowledgement

Supplementary material

References

Associated Data

Supplementary Materials

ACTIONS

PERMALINK

RESOURCES

Similar articles

Cited by other articles

Links to NCBI Databases