Skip to main content
Oxford University Press logoLink to Oxford University Press
. 2017 Apr 21;104(2):303–316. doi: 10.1093/biomet/asx018

An improved and explicit surrogate variable analysis procedure by coefficient adjustment

Seunggeun Lee 1,, Wei Sun 2, Fred A Wright 3, Fei Zou 4
PMCID: PMC5627626  NIHMSID: NIHMS857072  PMID: 29430031

Summary

Unobserved environmental, demographic and technical factors canadversely affect the estimation and testing of the effects ofprimary variables. Surrogate variable analysis, proposed to tacklethis problem, has been widely used in genomic studies. To estimatehidden factors that are correlated with the primary variables,surrogate variable analysis performs principal component analysiseither on a subset of features or on all features, but weightingeach differently. However, existing approaches may fail to identifyhidden factors that are strongly correlated with the primaryvariables, and the extra step of feature selection and weightcalculation makes the theoretical investigation of surrogatevariable analysis challenging. In this paper, we propose an improvedsurrogate variable analysis, using all measured features, that has anatural connection with restricted least squares, which allows us tostudy its theoretical properties. Simulation studies and real-dataanalysis show that the method is competitive with state-of-the-artmethods.

Keywords: Batch effect, High-dimensional data, Principal component analysis, Surrogate variable analysis

1. Introduction

In regression analysis, the existence of unobserved factors can cause biases in estimating parameters. Suppose that the true relationship in the data is

y=Xβ+Zδ+ϵ,

where Inline graphic is a vector of outcome measurements, Inline graphic is a matrix of the observed covariates including the primary variables, and Inline graphic is a matrix of the unobserved factors. We are interested in estimating the regression parameter Inline graphic. Since Inline graphic is not observed, in practice we use the misspecified model

y=Xβ+ϵ,

which can negatively impact inference on Inline graphic.

With the development of high-throughput technologies in biomedical sciences, high-dimensional data are routinely collected and analysed to find biologically meaningful features. Unobserved factors can cause adverse effects, including inflation of Type I error and/or power loss (Stegle et al., 2010). Although in practice great efforts are made to control confounders, such efforts may be insufficient to avoid all confounding issues (Leek et al., 2010).

Principal component analysis on the original or residualized features after removing the effects of observed dependent variables has often been used to identify hidden factors, and has been successful in identifying and controlling for population stratification in genome-wide association studies (Price et al., 2006). However, principal component analysis-based approaches are less effective for gene expression studies, where the hidden factors can affect a subset of features with relatively large effects (Leek & Storey, 2007). To overcome this limitation, surrogate variable analysis has been proposed (Leek & Storey, 2007, 2008; Teschendorff et al., 2011; Chakraborty et al., 2012) for microarray data. Leek & Storey (2007) initially developed a two-step approach which involves first identifying a subset of features that may be affected by hidden factors but not by primary variables, and then performing principal component analysis on the selected features. Later, they modified the approach to a weighted principal component analysis, where each feature is weighted according to its probability of being affected by the hidden factors only (Leek & Storey, 2008). Surrogate variable analysis has been extended to factor analysis (Friguet et al., 2009) and mixed-effect models (Listgarten et al., 2010). Recently, assuming that negative control genes are known, Gagnon-Bartsch & Speed (2012) proposed a surrogate variable method.

Surrogate variable analysis has been successfully applied to many genomic studies (Dumeaux et al., 2010; Teschendorff et al., 2010), but existing methods may fail to identify hidden factors. Strong correlation between hidden factors and primary variables can prevent the two-step and weighted principal component-based surrogate variable methods from identifying features that are affected by hidden factors only. If negative control genes are affected by primary variables or if the observed variation in negative control genes does not reflect unwanted variations in the entire genome, the methods for removing unwanted variation can also fail to identify true hidden factors.

In this paper, we propose a simple and straightforward method for identifying hidden factors and adjusting for their effects. Our approach, called direct surrogate variable analysis, is based on the observation that naïve estimators of the effects of the primary variables are biased when the effects of hidden factors are ignored in the analysis, but the bias can be estimated and removed using singular value decomposition on residuals. We derive the asymptotic properties of our estimators using techniques recently developed for the ultrahigh-dimensional regime (Lee et al., 2014) and the connection between our estimating procedure and the restricted least-squares method (Greene & Seaks, 1991). An R package (R Development Core Team, 2017) implementing the proposed approach, dSVA, can be downloaded from the comprehensive R archive network.

2. Methods

2.1. Direct surrogate variable analysis

Suppose that Inline graphic is an Inline graphic matrix of measured features, where Inline graphic is the number of features and Inline graphic is the number of samples. For gene expression data, Inline graphic represents RNA expression levels on Inline graphic genes. Further, suppose that Inline graphic is an Inline graphic matrix of observed covariates, including an intercept, and Inline graphic is an Inline graphic matrix of unobserved hidden factors. The following model represents the true relationship between Inline graphic and Inline graphic:

yi=Xβi+Zδi+ϵi, (1)

where Inline graphic denotes the Inline graphicth column of Inline graphic, Inline graphic is a Inline graphic vector of regression coefficients associated with Inline graphic, Inline graphic is a Inline graphic vector of regression coefficients associated with Inline graphic, and Inline graphic is an Inline graphic random vector which follows Inline graphic. We further define Inline graphic and Inline graphic, which are Inline graphic and Inline graphic matrices of regression coefficients associated with Inline graphic and Inline graphic, respectively. In this model, Inline graphic and Inline graphic are assumed to be fixed. Later, to generate large numbers of Inline graphic and Inline graphic values for the simulation studies, we use a specified correlation between Inline graphic and Inline graphic. However, we emphasize that the proposed method is frequentist: Inline graphic and Inline graphic are considered fixed and unknown.

In practice, since Inline graphic is not observed, we effectively use the misspecified model

yi=Xβi+ϵi (2)

instead of (1). Under (2), the least-squares estimator of Inline graphic is

β^i=(XTX)1XTyi=βi+(XTX)1XTZδi+(XTX)1XTϵi, (3)

with residual vector

ri=(IM)yi=(IM)Zδi+(IM)ϵi, (4)

where Inline graphic is the projection matrix onto the column space of Inline graphic. Equations (3) and (4) indicate that Inline graphic is a biased estimator of Inline graphic with bias Inline graphic. The conditional mean of the residual vector given Inline graphic is Inline graphic, which allows us to estimate Inline graphic via, for example, singular value decomposition.

Suppose that singular value decomposition is performed on the residual matrix Inline graphic, where Inline graphic, with Inline graphic being a diagonal matrix of ordered singular values, and Inline graphic and Inline graphic being matrices of left- and right-singular vectors. The first Inline graphic left-singular vectors can be viewed as estimators of linear combinations of the columns of Inline graphic, which we denote by Inline graphic where Inline graphic, with Inline graphic being a Inline graphic orthonormal matrix. Let Inline graphic. For any Inline graphic, the matrices Inline graphic and Inline graphic have the same column space, so Inline graphic is identical to Inline graphic. With an additional assumption that the row vectors of Inline graphic and the row vectors of Inline graphic are asymptotically orthogonal after mean centring, we can estimate Inline graphic and use it to remove the bias in Inline graphic. The proposed method is as follows.

Step 1.

Carry out singular value decomposition on the residual matrix Inline graphic. Let Inline graphic be the matrix comprising the first Inline graphic columns of Inline graphic that are equivalent to the Inline graphic left-singular vectors corresponding to the Inline graphic largest singular values.

Step 2.

Obtain Inline graphic and Inline graphic from the model Inline graphic. Since Inline graphic and Inline graphic are orthogonal to each other, Inline graphic from this model equals that from model (3).

Step 3.

Let Inline graphic and Inline graphic. We propose to estimate the surrogate variables Inline graphic as

Γ^=Uq+XB^(IMJ)Ψ^T{Ψ^(IMJ)Ψ^T}1,

where Inline graphic is a projection matrix with Inline graphic.

Step 4.

Estimate and test Inline graphic from the model

yi=Xβi+Γ^ψi+ϵi. (5)

This method requires estimation of Inline graphic, the number of surrogate variables, which can be obtained by permutation (Buja & Eyuboglu, 1992) or by analytical-asymptotic approaches (Johnstone, 2001; Leek, 2011). In this paper, we use the method of Buja & Eyuboglu (1992) for all numerical work. Since Inline graphic and Inline graphic can be always rescaled, they are not identifiable, so we set Inline graphic, where Inline graphic is the Inline graphicth element of Inline graphic, and adjust Inline graphic to satisfy this restriction. In the Supplementary Material we show that Inline graphic from (5) in Step 4 is the same as

B^=B^B^(IMJ)Ψ^T{Ψ^(IMJ)Ψ^T}1Ψ^, (6)

in which Inline graphic is an estimate of the bias of the naïve estimator Inline graphic. In § 2.3, we show that (6) is related to the restricted least-squares method.

2.2. Consistency of the proposed estimators

Important questions are under what conditions does the proposed Inline graphic span the same column space as Inline graphic, and whether Inline graphic is a consistent estimator of Inline graphic. For high-dimensional data, the number of features, Inline graphic, can be substantially larger than the number of samples, Inline graphic, and thus asymptotic results derived from the traditional low-dimensional setting where Inline graphic is fixed are inappropriate (Johnstone & Lu, 2009; Jung & Marron, 2009; Lee et al., 2010). Lee et al. (2014) considered a regime in which both Inline graphic and Inline graphic increase to infinity with Inline graphic. This regime is well-suited to high-throughput biomedical data, where the number of genes is in the tens of thousands and the number of samples is in the range of several dozens to hundreds. We work in this regime and investigate the asymptotic properties of the proposed method under the spiked-eigenvalue model of Johnstone (2001).

Before presenting our main results, let us define some additional notation. Suppose that Inline graphic and Inline graphic are two sequences. We write Inline graphic if Inline graphic and Inline graphic, and write Inline graphic if Inline graphic. We also define Inline graphic to be the function that returns the Inline graphicth largest singular value of an input matrix. Without loss of generality we assume that Inline graphic and Inline graphic, where Inline graphic and Inline graphic are the Inline graphicth columns of Inline graphic and Inline graphic, respectively, and Inline graphic is the vector norm. We introduce the following conditions.

Condition 1.

Both Inline graphic and Inline graphic increase to Inline graphic with Inline graphic.

Condition 2.

Let Inline graphic, where Inline graphic. Then Inline graphic, Inline graphic, and Inline graphic for Inline graphic.

Condition 3.

Let Inline graphic, where Inline graphic is the standard deviation of Inline graphic and Inline graphic. Then either of the following is satisfied: (i) Inline graphic; or (ii) Inline graphic, Inline graphic and Inline graphic.

Condition 4.

Let Inline graphic be the matrix with Inline graphic columns formed by concatenating Inline graphic and Inline graphic. Then Inline graphic is nonsingular with Inline graphic and Inline graphic.

Condition 5.

Suppose that Inline graphic and Inline graphic are the Inline graphicth elements of Inline graphic and Inline graphic, respectively, and that Inline graphic and Inline graphic. For all Inline graphic,

1mi=1m(βkiβ¯k)(δliδ¯l)=op{1mi=1m(δliδ¯l)2}.

Condition 2 assumes the spiked-eigenvalue model (Johnstone, 2001), which ensures that the effects of hidden factors are large enough to be identified by singular value decomposition. Condition 3 comprises the sphericity conditions on nonspiked singular values (Lee et al., 2014). The relative growth rates of Inline graphic and Inline graphic play a key role in this condition. For example, when Inline graphic is greater than zero, Inline graphic must grow at a faster rate than Inline graphic to satisfy the first condition in Condition 3. The second part of Condition 3 relaxes the assumption on Inline graphic but adds an assumption on Inline graphic. Condition 5 requires that the row vectors of Inline graphic and the row vectors of Inline graphic be asymptotically orthogonal after mean centring.

Theorem 1.

Suppose that Conditions 1–5 are satisfied. Then the columns of Inline graphic span the same column space as the columns of Inline graphic with probability Inline graphic, and Inline graphic for Inline graphic.

Theorem 1 shows that the proposed method produces consistent estimates of Inline graphic and the hidden factors. The proof can be found in the Supplementary Material.

2.3. Relationship to the restricted least-squares method

We now show the connection between the proposed method and the restricted least-squares procedure of Greene & Seaks (1991). Suppose that Inline graphic is a linear restriction on Inline graphic. The restricted least-squares estimator Inline graphic of Inline graphic is the solution of

minimize (yXβ)T(yXβ)subject to Cβ=c.

It can be shown that Inline graphic where Inline graphic is the ordinary least-squares estimator and Inline graphic.

When estimating Inline graphic, we impose a restriction on Inline graphic and Inline graphic, and hence on Inline graphic and Inline graphic, such that they are asymptotically orthogonal after mean centring. While this is not a linear restriction as in the restricted least-squares procedure, the similarity of the two approaches can be illustrated as follows. Let Inline graphic be a function on a matrix which stacks the columns of the matrix into one long vector. Then model (2) for all Inline graphic features can be re-expressed as

vec(Y)=(ImX)vec(B)+vec(E),

where Inline graphic, Inline graphic is the Kronecker product, and Inline graphic is the Inline graphic identity matrix. We further define Inline graphic and Inline graphic. The solution of

minimize {vec(Y)(ImX)vec(B)}T{vec(Y)(ImX)vec(B)}subject to C^vec(B)=0

is Inline graphic where

HC^={(ImX)T(ImX)}1C^T[C^{(ImX)T(ImX)}1C^T]1C^=(IMJ)Ψ^T{Ψ^(IMJ)Ψ^T}1Ψ^(IMJ)Ip.

Now Inline graphic can be written as

vec(B^RLS)=vec(B^)[(IMJ)Ψ^T{Ψ^(IMJ)Ψ^T}1Ψ^(IMJ)Ip]vec(B^),

which leads to

B^RLS=B^B^(IMJ)Ψ^T{Ψ^(IMJ)Ψ^T}1Ψ^(IMJ). (7)

Clearly, Inline graphic in (6) and Inline graphic in (7) are identical if Inline graphic. Hence the proposed method and the restricted least-squares method are identical if the row means of Inline graphic are zero. Gene expression data are commonly normalized so that the row means of Inline graphic are equal (Bolstad et al., 2003); this makes the row means of the residual matrix Inline graphic, and consequently the row means of Inline graphic, all zero. Thus, this zero-mean condition is easily satisfied by gene expression data. If Inline graphic, then Inline graphic and Inline graphic will be different.

Since Inline graphic is a random matrix, our procedure is not the same as the restricted least-squares procedure, which assumes that the restriction matrix Inline graphic is fixed. However, the discussion above highlights the similarity between the two approaches. The restricted least-squares estimator can have smaller mean squared error than the ordinary least-squares estimator if the restriction is satisfied (Greene & Seaks, 1991). From our simulations, we observe that our estimators tend to have smaller mean squared errors than the estimators from the true regression model, where the restriction is not utilized.

3. Numerical studies

3.1. Simulation studies

We performed simulations to compare the proposed approach with existing methods in a wide range of scenarios. For each simulated dataset, 5000 features and 100 samples were generated from the regression model

yji=βixj+zjTδi+ϵji(j=1,,100;i=1,,5000),

where Inline graphic was generated from Inline graphic with Inline graphic following an igInline graphic distribution, which yields Inline graphic and Inline graphic.

A total of 864 simulation settings are summarized in Table 1. The binary and continuous Inline graphic were simulated respectively from

Table 1.

Simulation parameters; the total number of simulation settings is Inline graphic

Parameter Values
Inline graphic 0, Inline graphic, 1
Percentage of nonzero Inline graphic Inline graphic, Inline graphic, Inline graphic, Inline graphic
Percentage of nonzero Inline graphic Inline graphic, Inline graphic, Inline graphic
Overlap among nonzero Inline graphic total overlap, independent
Number of hidden factors 2, 4
Type of Inline graphic binary, continuous
Correlation between nonzero Inline graphic and nonzero Inline graphic 0, Inline graphic, Inline graphic
xj={0,j50,1,j>50,xj={N(1,05),j50,N(1,05),j>50,

and the first two hidden factors were simulated from

zj1{N(μz,1),j50,N(μz,1),j>50,zj2Ber(05).

When the number of hidden factors is four, i.e., Inline graphic, Inline graphic were independently generated from Inline graphic. The parameter Inline graphic determines the correlation between the primary variable, Inline graphic, and the first hidden factor, Inline graphic. Three different values of Inline graphic were considered: 0, Inline graphic and 1. The regression coefficients Inline graphic and Inline graphic were generated from the distribution

(βiδi1δi2δiq)=(ai0ζi0ai1ζi1ai2ζi2aiqζiq),(ai0ai1ai2aiq)N{0,(1ρ00ρ10000100001)}

where Inline graphic are indicator variables, Inline graphic or Inline graphic, that determine which of the Inline graphic and Inline graphic have nonzero values. To mimic real biological data where the primary variables and hidden factors are not associated with all features, we assumed that 5%, 10% or 20% of the Inline graphic were nonzero and that 10%, 20%, 40% or 60% of the Inline graphic were nonzero. The value of Inline graphic was independently assigned. For Inline graphic, we considered situations in which nonzero Inline graphic values were totally overlapping, Inline graphic, or independently selected. In the first situation, each feature either had no associated hidden factors or was associated with all Inline graphic hidden factors. The correlation between the nonzero Inline graphic and the nonzero Inline graphic, namely Inline graphic, was set to Inline graphic, Inline graphic and Inline graphic, representing scenarios in which Condition 5 ranged from being satisfied to severely violated.

Nine different methods were compared: direct surrogate variable analysis; regression model (1), where the hidden factors are assumed to be known and included in the analysis; a no-adjustment model, i.e., regression model (2), where the hidden factors are ignored in the analysis; the iteratively reweighted surrogate variable analysis of Leek & Storey (2008); the two-step surrogate variable analysis of Leek & Storey (2007); principal component analysis on the residuals; principal component analysis on the original measurements of the features; latent effect adjustment after primary projection (Sun et al., 2012); and four-step remove unwanted variation (Gagnon-Bartsch et al., 2017). Latent effect adjustment after primary projection uses an outlier detection approach after initial data projection to adjust for hidden factors. We treated the second method as a gold standard. In both principal component analyses, top principal components were selected and treated as surrogate variables.

For the four-step remove unwanted variation method, we assumed that 6Inline graphic of features were negative control genes, close to the proportion of housekeeping genes in the genome (Gagnon-Bartsch & Speed, 2012). We considered situations in which negative control genes were selected only among features with Inline graphic, i.e., high-quality control genes, or were randomly selected among all features, i.e., poor-quality control genes. In the second case, the assumption of negative control genes was violated. A method to estimate the number of surrogate variables for the four-step remove unwanted variation method has been developed (Gagnon-Bartsch et al., 2017). We used this method in conjunction with the method of Buja & Eyuboglu (1992) to estimate Inline graphic for the four-step remove unwanted variation method; for all other methods, the approach of Buja & Eyuboglu (1992) was used to estimate Inline graphic.

For each simulation set-up, 200 datasets were generated and the performance of each method was evaluated based on (i) empirical false discovery rates, where the significant findings were determined by the Benjamini & Hochberg (1995) procedure for a targeted false discovery rate of Inline graphic; (ii) the mean squared errors of the Inline graphic; and (iii) the area under the receiver operating characteristic curve. For calculation of the false discovery rate and the area under the receiver operating characteristic curve, we define true and false positives as follows. If Inline graphic and a statistical test for Inline graphic is significant after applying the Benjamini–Hochberg procedure, it is a true positive. If the test is significant when Inline graphic, it is a false positive. In addition to the mean false discovery rates, we calculated the proportion of datasets with an empirical false discovery rate greater than Inline graphic.

Figure 1 shows simulation results from a scenario where Condition 5 was satisfied, i.e., Inline graphic. Direct surrogate variable analysis performed well, as the observed area under the receiver operating characteristic curve and the mean squared errors are similar to those obtained from the approach assuming that the hidden factors are known, and the observed false discovery rates are only slightly inflated in a few simulation settings. Among 288 simulation settings, only four had mean false discovery rates higher than Inline graphic. As expected, the no-adjustment and principal component analysis-based approaches performed very poorly. When the negative control gene assumption was satisfied, the remove unwanted variation method performed only slightly worse than direct surrogate variable analysis: in ten of the simulation settings, the mean empirical false discovery rates were larger than Inline graphic; however, when this assumption was violated, the remove unwanted variation method had substantially inflated false discovery rates. The latent effect adjustment after primary projection method had mean empirical false discovery rates above Inline graphic in some simulation settings. Since this method was not developed to estimate Inline graphic, we did not obtain the mean squared errors. When the method developed for four-step remove unwanted variation was used to estimate the number of surrogate variables, the overall performance of the remove unwanted variation method declined substantially, indicating that the method of Buja & Eyuboglu (1992) performs better in estimating the number of surrogate variables. We compared different approaches to estimating Inline graphic, and the method of Buja & Eyuboglu (1992) outperformed the others; see the Supplementary Material. In Fig. 2, we directly compare the two top-performing approaches: direct surrogate variable analysis and the four-step remove unwanted variation method with high-quality control genes. Direct surrogate variable analysis clearly does better in controlling the false discovery rates.

Fig. 1.

Fig. 1.

Comparisons of the proposed and competing methods when Inline graphic. Each bar summarizes results from 288 different simulation settings, and in each setting 200 datasets were generated to calculate: (a) mean empirical false discovery rates, FDR; (b) the proportion of datasets with empirical FDR higher than Inline graphic; (c) the mean area under the receiver operating characteristic curve, AUC; and (d) the mean squared errors, MSE. The methods compared are: dSVA, direct surrogate variable analysis; KnownZ, hidden factors known and included in the model; NoAdj, no adjustment for hidden factors; IRW, iteratively reweighted surrogate variable analysis; 2-SVA, two-step surrogate variable analysis; rPCA, principal component analysis on the residuals; PCA, principal component analysis on the original measured features; LEAPP, latent effect adjustment after primary projection; RUV4-High, four-step remove unwanted variation method with high-quality control genes; RUV4-Poor, four-step remove unwanted variation method with poor-quality control genes; RUV4-High2, RUV4-High with Inline graphic from Gagnon-Bartsch et al. (2017); RUV4-Poor2, RUV4-Poor with Inline graphic from Gagnon-Bartsch et al. (2017).

Fig. 2.

Fig. 2.

Comparison of direct surrogate variable analysis, dSVA, and the four-step remove unwanted variation method with high-quality control genes, RUV4-High, when Inline graphic: (a) mean empirical false discovery rate; (b) mean area under the receiver operating characteristic curve; (c) mean squared errors.

To investigate the effect of each simulation parameter on the performance of the methods when Inline graphic, we created plots for each parameter value. Since the no-adjustment and principal component analysis-based methods performed substantially worse than the other methods, we did not include them in these plots. Among the parameters, Inline graphic and the percentage of nonzero Inline graphic had large effects on the performance of some methods. Figure 3(a) shows boxplots of the false discovery rates with different Inline graphic values. The iteratively reweighted and two-step surrogate variable analysis approaches had well-controlled false discovery rates when Inline graphic and Inline graphic, but had inflated rates when Inline graphic. Therefore, these two methods cannot efficiently estimate the hidden factors in the presence of a strong correlation between Inline graphic and Inline graphic. Figure 3(b) shows that when the percentage of nonzero Inline graphic was Inline graphic, direct surrogate variable analysis had slightly inflated false discovery rates, perhaps because direct surrogate variable analysis uses all features, instead of selecting features with nonzero Inline graphic. Since the four-step remove unwanted variation approach uses a small fraction of features to estimate the hidden factors, it had more inflated false discovery rates when the percentage of nonzero Inline graphic was small. The performances of the different methods in terms of areas under the receiver operating characteristic curves and mean squared errors were largely similar.

Fig. 3.

Fig. 3.

Comparison of mean empirical false discovery rates when Inline graphic for: (a) Inline graphic or Inline graphic; and (b) different proportions of nonzero Inline graphic; 10%, 20%, 40% or 60%. In each simulation setting, 200 datasets were generated to obtain the mean empirical false discovery rates. The methods compared are: dSVA, direct surrogate variable analysis; IRW, iteratively reweighted surrogate variable analysis; 2-SVA, two-step surrogate variable analysis; LEAPP, latent effect adjustment after primary projection; RUV4-High, four-step remove unwanted variation method with high-quality control genes; RUV4-Poor, four-step remove unwanted variation method with poor-quality control genes.

Additional simulation results are presented in the Supplementary Material. Our proposed approach was observed to perform well even when Inline graphic was overestimated and Condition 5 was moderately violated. Overall, our simulation study shows that the proposed method can outperform existing methods in diverse scenarios.

3.2. Application to real data

We downloaded the Hapmap dataset GSE5859 from the National Center for Biotechnology Information gene expression omnibus website to investigate differentially expressed genes between European and Asian populations (Spielman et al., 2007). This dataset contains 8793 genes, or features, and 208 samples from three continental populations: 102 European, 65 Chinese, and 41 Japanese. The affy package (Gautier et al., 2004) was used for background correction and quantile normalization (Bolstad et al., 2003). In the Supplementary Material we perform an analysis without quantile normalization as a sensitivity analysis. Similar to the original study, we restricted the analysis to 4044 reliably expressed genes in at least 80% of the samples in one population (Spielman et al., 2007).

The original study showed that nearly 70% of genes were differentially expressed across the European and Asian samples (Spielman et al., 2007), but it was subsequently discovered that the calendar year in which each sample was processed was a strong confounding factor (Akey et al., 2007; Leek et al., 2010), and many of the positive findings could potentially be false. In this analysis, we considered a scenario where the researchers did not record the calendar year of sample collection and investigated whether the proposed surrogate variable analysis could capture the year effect. We treated year as a categorical response variable and estimated the proportion of variability that can be explained by the surrogate variables.

Table 2 shows the proportion of the variability explained by surrogate variables estimated by four different methods. Since the estimated variability would increase with the number of surrogate variables, for a fair comparison we used Inline graphic for all the methods, which was estimated by the method of Buja & Eyuboglu (1992). Both direct surrogate variable analysis and the remove unwanted variation method performed well, as 70% and 73% of the variability was explained by the surrogate variables estimated from these respective methods. In contrast, the surrogate variables from the iteratively reweighted and two-step surrogate variable analysis approaches explained only 41% and 64% of the variability in year, respectively. We also considered different combinations of the populations. Direct surrogate variable analysis and the four-step remove unwanted variation method again consistently outperformed the other methods.

Table 2.

Proportion of variability in year explained by the estimated surrogate variables; for a fair comparison, the same number of surrogate variables was used in all the methods

Type Number of surrogate variables dSVA IRW 2-SVA RUV4
EUR vs (JPT + CHI) 25 Inline graphic Inline graphic Inline graphic Inline graphic
JPT vs (EUR + CHI) 25 Inline graphic Inline graphic Inline graphic Inline graphic
CHI vs (EUR + JPT) 25 Inline graphic Inline graphic Inline graphic Inline graphic
JPT vs CHI 16 Inline graphic Inline graphic Inline graphic Inline graphic

EUR, JPT and CHI, individuals of European, Japanese and Chinese ancestry, respectively; dSVA, direct surrogate variable analysis; IRW, iteratively reweighted surrogate variable analysis; 2-SVA, two-step surrogate variable analysis; RUV4, four-step remove unwanted variation method.

Without any hidden variable adjustment, 73% and 65% of genes were found to be differentially expressed between the European and Asian populations at false discovery rates of Inline graphic and Inline graphic, respectively. As pointed out elsewhere, it seems implausible that so many genes would be differentially expressed between the two populations (Akey et al., 2007). When direct surrogate variable analysis was applied, only 29% and 18% of genes were found to be significant at false discovery rates of Inline graphic and Inline graphic, respectively. Li et al. (2010) have reported that approximately 20% of genes in lymphoblastoid cell lines are differentially expressed between Hapmap2 European and African samples at a false discovery rate of Inline graphic. Given that the genetic difference between the European and African populations is greater than that between the European and Asian populations, 18% of genes differentially expressed between the European and Asian populations seems a reasonable estimate.

When we applied two-step surrogate variable analysis and the four-step remove unwanted variation method to the Hapmap data, 15% and 18% of genes, respectively, were declared to be differentially expressed between the European and Asian populations at false discovery rate Inline graphic. In contrast, 65% of genes were found to be significant by iteratively reweighted surrogate variable analysis at the same false discovery rate, indicating that this method fails to identify the effects of the hidden factors. When we included year as a covariate in the regression analysis, only 28 genes, i.e., Inline graphic of the tested genes, were significant at false discovery rate Inline graphic, because year was nearly nested within each population. All Asian samples were processed in 2005 and 2006, but only three European samples were processed in those two years. Among these 28 genes, 15 were significant according to direct surrogate variable analysis. On the other hand, 12 and 14 genes, respectively, were significant by two-step surrogate variable analysis and the four-step remove unwanted variation method.

We carried out an additional analysis using the same dataset to identify genes differentially expressed by gender. Since genes in sex chromosomes can be used as positive control genes, this additional analysis can be used to directly evaluate the performance of each method. The results show that our method had comparable or slightly better performance than the competing methods; see the Supplementary Material for details.

4. Discussion and conclusion

Surrogate variable analysis was originally proposed for gene expression data, but it has since been applied to epigenetic data as well (Teschendorff et al., 2011; Maksimovic et al., 2015). Recently, surrogate variable analysis has been extended to {prediction} and clustering problems. For example, Parker et al. (2014) developed frozen surrogate variable analysis to remove batch effects for prediction problems, and Jacob et al. (2016) extended the remove unwanted variation method to unsupervised learning. Direct surrogate variable analysis was mainly developed for differential expression analysis, but it can be extended to other types of -omics data, as well as to prediction problems, by adopting the approaches used in frozen surrogate variable analysis. We leave such extensions for future research. One key assumption of the proposed method is Condition 5, which requires that the vector of Inline graphic values across Inline graphic genes and the vector of Inline graphic values across Inline graphic genes be asymptotically orthogonal after mean centring. We think that this is reasonable for many biomedical datasets. In our real-data analysis, for example, batch effects are purely technical issues and their effect sizes would not be correlated with those of population differences. Moreover, our method is robust with respect to moderate violations of this condition. In simulation studies, for instance, our method shows better false discovery rate control than the competing methods when Inline graphic. A similar assumption is implicitly used in existing methods. For example, in their simulation studies, Sun et al. (2012) generated Inline graphic and Inline graphic independently. They also suggested that when Inline graphic and Inline graphic are correlated, it will be difficult to identify Inline graphic.

Principal component analysis was used to correct for batch effects and the effects of hidden confounders prior to the introduction of surrogate variable analysis. This approach has proven very successful for genome-wide association studies (Price et al., 2006). However, our simulation results show that naïve use of principal components for hidden factor adjustment can result in severe power loss, because the top principal components identified can be highly correlated with the primary variables when the effects of the primary variables are not too weak. When this is the case, principal component analysis should be avoided.

Supplementary Material

Supplementary Data

Acknowledgement

We thank the editor, associate editor and referees for their valuable comments and suggestions, which have greatly helped to improve the quality of the paper. This research was supported by the U.S. National Institutes of Health.

Supplementary material

Supplementary material available at Biometrika online includes proofs of the theoretical results as well as additional simulation and real-data analysis results.

References

  1. Akey J. M., Biswas S., Leek J. T. & Storey J. D. (2007). On the design and analysis of gene expression studies in human populations. Nature Genet. 39, 807–8. [DOI] [PubMed] [Google Scholar]
  2. Benjamini Y. & Hochberg Y. (1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. J. R. Statist. Soc. B 57, 289–300. [Google Scholar]
  3. Bolstad B. M., Irizarry R. A., Åstrand M. & Speed T. P. (2003). A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19, 185–93. [DOI] [PubMed] [Google Scholar]
  4. Buja A. & Eyuboglu N. (1992). Remarks on parallel analysis. Mult. Behav. Res. 27, 509–40. [DOI] [PubMed] [Google Scholar]
  5. Chakraborty S., Datta S. & Datta S. (2012). Surrogate variable analysis using partial least squares (SVA-PLS) in gene expression studies. Bioinformatics 28, 799–806. [DOI] [PubMed] [Google Scholar]
  6. Dumeaux V., Olsen K. S., Nuel G., Paulssen R. H., Børresen-Dale A. L. & Lund E. (2010). Deciphering normal blood gene expression variation—The NOWAC postgenome study. PLoS Genet. 6, e1000873. [DOI] [PMC free article] [PubMed] [Google Scholar]
  7. Friguet C., Kloareg M. & Causeur D. (2009). A factor model approach to multiple testing under dependence. J. Am. Statist. Assoc. 104, 1406–15. [Google Scholar]
  8. Gagnon-Bartsch J. A., Jacob L. & Speed T. P. (2017). Removing Unwanted Variation: Exploiting Negative Controls for High Dimensional Data Analysis. IMS Monographs Cambridge: Cambridge University Press, in press. [Google Scholar]
  9. Gagnon-Bartsch J. A. & Speed T. P. (2012). Using control genes to correct for unwanted variation in microarray data. Biostatistics 13, 539–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  10. Gautier L., Cope L., Bolstad B. M. & Irizarry R. A. (2004). Affy-analysis of Affymetrix GeneChip data at the probe level. Bioinformatics 20, 307–15. [DOI] [PubMed] [Google Scholar]
  11. Greene W. H. & Seaks T. G. (1991). The restricted least squares estimator: A pedagogical note. Rev. Econ. Statist. 73, 563–7. [Google Scholar]
  12. Jacob L., Gagnon-Bartsch J. A. & Speed T. P. (2016). Correcting gene expression data when neither the unwanted variation nor the factor of interest are observed. Biostatistics 17, 16–28. [DOI] [PMC free article] [PubMed] [Google Scholar]
  13. Johnstone I. M. (2001). On the distribution of the largest eigenvalue in principal components analysis. Ann. Statist. 29, 295–327. [Google Scholar]
  14. Johnstone I. M. & Lu A. Y. (2009). On consistency and sparsity for principal components analysis in high dimensions. J. Am. Statist. Assoc. 104, 682–93. [DOI] [PMC free article] [PubMed] [Google Scholar]
  15. Jung S. & Marron J. S. (2009). PCA consistency in high dimension, low sample size context. Ann. Statist. 37, 4104–30. [Google Scholar]
  16. Lee S., Zou F. & Wright F. A. (2010). Convergence and prediction of principal component scores in high-dimensional settings. Ann. Statist. 38, 3605–29. [DOI] [PMC free article] [PubMed] [Google Scholar]
  17. Lee S., Zou F. & Wright F. A. (2014). Convergence of sample eigenvalues, eigenvectors, and principal component scores for ultra-high dimensional data. Biometrika 101, 484–90. [DOI] [PMC free article] [PubMed] [Google Scholar]
  18. Leek J. T. (2011). Asymptotic conditional singular value decomposition for high-dimensional genomic data. Biometrics 67, 344–52. [DOI] [PMC free article] [PubMed] [Google Scholar]
  19. Leek J. T., Scharpf R. B., Bravo H. C., Simcha D., Langmead B., Johnson W. E., Geman D., Baggerly K. & Irizarry R. A. (2010). Tackling the widespread and critical impact of batch effects in high-throughput data. Nature Rev. Genet. 11, 733–9. [DOI] [PMC free article] [PubMed] [Google Scholar]
  20. Leek J. T. & Storey J. D. (2007). Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3, e161. [DOI] [PMC free article] [PubMed] [Google Scholar]
  21. Leek J. T. & Storey J. D. (2008). A general framework for multiple testing dependence. Proc. Nat. Acad. Sci. 105, 18718–23. [DOI] [PMC free article] [PubMed] [Google Scholar]
  22. Li J., Liu Y., Kim T., Min R. & Zhang Z. (2010). Gene expression variability within and between human populations and implications toward disease susceptibility. PLoS Comp. Biol. 6, e1000910. [DOI] [PMC free article] [PubMed] [Google Scholar]
  23. Listgarten J., Kadie C., Schadt E. E. & Heckerman D. (2010). Correction for hidden confounders in the genetic analysis of genec expression. Proc. Nat. Acad. Sci. 107, 16465–70. [DOI] [PMC free article] [PubMed] [Google Scholar]
  24. Maksimovic J., Gagnon-Bartsch J. A., Speed T. P. & Oshlack A. (2015). Removing unwanted variation in a differential methylation analysis of Illumina HumanMethylation450 array data. Nucleic Acids Res. 43, e106. [DOI] [PMC free article] [PubMed] [Google Scholar]
  25. Parker H. S., Bravo H. C. & Leek J. T. (2014). Removing batch effects for prediction problems with frozen surrogate variable analysis. PeerJ 2, e561. [DOI] [PMC free article] [PubMed] [Google Scholar]
  26. Price A. L., Patterson N. J., Plenge R. M., Weinblatt M. E., Shadick N. A. & Reich D. (2006). Principal components analysis corrects for stratification in genome-wide association studies. Nature Genet. 38, 904–9. [DOI] [PubMed] [Google Scholar]
  27. R Development Core Team (2017). R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing: ISBN 3-900051-07-0, http://www.R-project.org. [Google Scholar]
  28. Spielman R. S., Bastone L. A., Burdick J. T., Morley M., Ewens W. J. & Cheung V. G. (2007). Common genetic variants account for differences in gene expression among ethnic groups. Nature Genet. 39, 226–31. [DOI] [PMC free article] [PubMed] [Google Scholar]
  29. Stegle O., Parts L., Durbin R. & Winn J. (2010). A Bayesian framework to account for complex non-genetic factors in gene expression levels greatly increases power in eQTL studies. PLoS Comp. Biol. 6, e1000770. [DOI] [PMC free article] [PubMed] [Google Scholar]
  30. Sun Y., Zhang N. R. & Owen A. B. (2012). Multiple hypothesis testing adjusted for latent variables, with an application to the AGEMAP gene expression data. Ann. Appl. Statist. 6, 1664–88. [Google Scholar]
  31. Teschendorff A. E., Menon U., Gentry-Maharaj A., Ramus S. J., Weisenberger D. J., Shen H., Campan M., Noushmehr H., Bell C. G., Maxwell A. P.. et al. (2010). Age-dependent DNA methylation of genes that are suppressed in stem cells is a hallmark of cancer. Genome Res. 20, 440–6. [DOI] [PMC free article] [PubMed] [Google Scholar]
  32. Teschendorff A. E., Zhuang J. & Widschwendter M. (2011). Independent surrogate variable analysis to deconvolve confounding factors in large-scale microarray profiling studies. Bioinformatics 27, 1496–505. [DOI] [PubMed] [Google Scholar]

Associated Data

This section collects any data citations, data availability statements, or supplementary materials included in this article.

Supplementary Materials

Supplementary Data

Articles from Biometrika are provided here courtesy of Oxford University Press

RESOURCES